Extract text from a PDF
This is a problem set. Some of these are easy, others are far more difficult. The purpose of these problems sets are:
- to build your skill applying computational thinking to a problem
- to assess your knowledge and skills of different programming practices
What is this problem set trying to do[edit]
- This is tricky.
- PDF's are a ubiquitous file format
- They are famously difficult to get text from.
The Problem[edit]
Extract specific text from a PDF. Start here:
- From terminal (inside visual studio code or iTerm) : pip3 install PyPDF2
- Find some silly pdf to use (um, with text).
- Parse the text into plaintext
- Use this code to get started:
import PyPDF2
# Make sure your pdf is in the same directory as the code you are executing:
pdfFileObject = open('YOURPDF.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
page = pdfReader.getPage(i)
print(page.extractText())
- PyPDF2 github: https://github.com/mstamy2/PyPDF2
- PyPDF2 tutorial: https://dzone.com/articles/an-intro-to-pypdf2
- PyPDF2 documentation: https://pythonhosted.org/PyPDF2/
How you will be assessed[edit]
Your solution will be graded using the following axis:
Scope
- To what extent does your code implement the features required by our specification?
- To what extent is there evidence of effort?
Correctness
- To what extent did your code meet specifications?
- To what extent did your code meet unit tests?
- To what extent is your code free of bugs?
Design
- To what extent is your code written well (i.e. clearly, efficiently, elegantly, and/or logically)?
- To what extent is your code eliminating repetition?
- To what extent is your code using functions appropriately?
Style
- To what extent is your code readable?
- To what extent is your code commented?
- To what extent are your variables well named?
- To what extent do you adhere to style guide?
References[edit]
A possible solution[edit]
Click the expand link to see one possible solution, but NOT before you have tried and failed!
not yet!