Extract text from a PDF

From Computer Science Wiki
This a problem set for you to work through [1]

This is a problem set. Some of these are easy, others are far more difficult. The purpose of these problems sets are:

  1. to build your skill applying computational thinking to a problem
  2. to assess your knowledge and skills of different programming practices

What is this problem set trying to do[edit]

  1. This is tricky.
  2. PDF's are a ubiquitous file format
  3. They are famously difficult to get text from.

The Problem[edit]

Extract specific text from a PDF. Start here:

  1. From terminal (inside visual studio code or iTerm) : pip3 install PyPDF2
  2. Find some silly pdf to use (um, with text).
  3. Parse the text into plaintext
  4. Use this code to get started:
import PyPDF2
# Make sure your pdf is in the same directory as the code you are executing:
pdfFileObject = open('YOURPDF.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
  1. PyPDF2 github: https://github.com/mstamy2/PyPDF2
  2. PyPDF2 tutorial: https://dzone.com/articles/an-intro-to-pypdf2
  3. PyPDF2 documentation: https://pythonhosted.org/PyPDF2/

How you will be assessed[edit]

Your solution will be graded using the following axis:


  • To what extent does your code implement the features required by our specification?
  • To what extent is there evidence of effort?


  • To what extent did your code meet specifications?
  • To what extent did your code meet unit tests?
  • To what extent is your code free of bugs?


  • To what extent is your code written well (i.e. clearly, efficiently, elegantly, and/or logically)?
  • To what extent is your code eliminating repetition?
  • To what extent is your code using functions appropriately?


  • To what extent is your code readable?
  • To what extent is your code commented?
  • To what extent are your variables well named?
  • To what extent do you adhere to style guide?


A possible solution[edit]

Click the expand link to see one possible solution, but NOT before you have tried and failed!

not yet!