Extract text from a PDF: Difference between revisions
Mr. MacKenty (talk | contribs) (Created page with "right|frame|This a problem set for you to work through <ref>http://www.flaticon.com/</ref> This is a problem set. Some of these are easy, others are far m...") |
Mr. MacKenty (talk | contribs) |
||
Line 22: | Line 22: | ||
# use this code to get started: | # use this code to get started: | ||
<syntaxhighlight lang=python"> | <syntaxhighlight lang="python"> | ||
import PyPDF2 | import PyPDF2 | ||
pdfFileObject = open(' | # Make sure your pdf is in the same directory as the code you are executing: | ||
pdfFileObject = open('YOURPDF.pdf', 'rb') | |||
pdfReader = PyPDF2.PdfFileReader(pdfFileObject) | pdfReader = PyPDF2.PdfFileReader(pdfFileObject) | ||
count = pdfReader.numPages | count = pdfReader.numPages | ||
Line 32: | Line 34: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
# PyPDF2 github: https://github.com/mstamy2/PyPDF2 | |||
# PyPDF2 tutorial: https://dzone.com/articles/an-intro-to-pypdf2 | |||
# PyPDF2 documentation: https://pythonhosted.org/PyPDF2/ | |||
== Unit Tests == | == Unit Tests == |
Revision as of 14:22, 18 September 2018
This is a problem set. Some of these are easy, others are far more difficult. The purpose of these problems sets are:
- to build your skill applying computational thinking to a problem
- to assess your knowledge and skills of different programming practices
What is this problem set trying to do[edit]
- This is tricky.
- PDF's are a ubiquitous file format
- They are famously difficult to get text from.
The Problem[edit]
Extract specific text from a PDF. Start here:
- From terminal (inside visual studio code or iTerm) : pip3 install PyPDF2
- Find some silly pdf to use (um, with text).
- use this code to get started:
import PyPDF2
# Make sure your pdf is in the same directory as the code you are executing:
pdfFileObject = open('YOURPDF.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
page = pdfReader.getPage(i)
print(page.extractText())
- PyPDF2 github: https://github.com/mstamy2/PyPDF2
- PyPDF2 tutorial: https://dzone.com/articles/an-intro-to-pypdf2
- PyPDF2 documentation: https://pythonhosted.org/PyPDF2/
Unit Tests[edit]
- User Input: Name: Bill
- Expected output: Hello Bill
- User Input: Name: TJ
- Expected output: An administrator! Hello TJ
- User Input: Name: 123
- Expected output: Hello 123
Hacker edition[edit]
In the hacker version:
- Your program should test for valid user input. The user input should be only allow for strings
THIS PART ISNT DONE YET
How you will be assessed[edit]
Your solution will be graded using the following axis:
Scope
- To what extent does your code implement the features required by our specification?
- To what extent is there evidence of effort?
Correctness
- To what extent did your code meet specifications?
- To what extent did your code meet unit tests?
- To what extent is your code free of bugs?
Design
- To what extent is your code written well (i.e. clearly, efficiently, elegantly, and/or logically)?
- To what extent is your code eliminating repetition?
- To what extent is your code using functions appropriately?
Style
- To what extent is your code readable?
- To what extent is your code commented?
- To what extent are your variables well named?
- To what extent do you adhere to style guide?
References[edit]
A possible solution[edit]
Click the expand link to see one possible solution, but NOT before you have tried and failed!
not yet!