Extract text from a PDF

From Computer Science Wiki
Jump to: navigation, search
This a problem set for you to work through [1]

This is a problem set. Some of these are easy, others are far more difficult. The purpose of these problems sets are:

  1. to build your skill applying computational thinking to a problem
  2. to assess your knowledge and skills of different programming practices


What is this problem set trying to do

  1. This is tricky.
  2. PDF's are a ubiquitous file format
  3. They are famously difficult to get text from.


The Problem

Extract specific text from a PDF. Start here:

  1. From terminal (inside visual studio code or iTerm) : pip3 install PyPDF2
  2. Find some silly pdf to use (um, with text).
  3. Parse the text into plaintext
  4. UseApply knowledge or rules to put theory into practice. this code to get started:
import PyPDF2
# Make sure your pdf is in the same directory as the code you are executing:
pdfFileObject = open('YOURPDF.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    print(page.extractText())
  1. PyPDF2 github: https://github.com/mstamy2/PyPDF2
  2. PyPDF2 tutorial: https://dzone.com/articles/an-intro-to-pypdf2
  3. PyPDF2 documentation: https://pythonhosted.org/PyPDF2/

How you will be assessed

Your solution will be graded using the following axis:


Scope

  • To what extentConsider the merits or otherwise of an argument or concept. Opinions and conclusions should be presented clearly and supported with appropriate evidence and sound argument. does your code implement the features required by our specification?
  • To what extentConsider the merits or otherwise of an argument or concept. Opinions and conclusions should be presented clearly and supported with appropriate evidence and sound argument. is there evidence of effort?

Correctness

  • To what extentConsider the merits or otherwise of an argument or concept. Opinions and conclusions should be presented clearly and supported with appropriate evidence and sound argument. did your code meet specifications?
  • To what extentConsider the merits or otherwise of an argument or concept. Opinions and conclusions should be presented clearly and supported with appropriate evidence and sound argument. did your code meet unit tests?
  • To what extentConsider the merits or otherwise of an argument or concept. Opinions and conclusions should be presented clearly and supported with appropriate evidence and sound argument. is your code free of bugs?

DesignProduce a plan, simulation or model.

  • To what extentConsider the merits or otherwise of an argument or concept. Opinions and conclusions should be presented clearly and supported with appropriate evidence and sound argument. is your code written well (i.e. clearly, efficiently, elegantly, and/or logically)?
  • To what extentConsider the merits or otherwise of an argument or concept. Opinions and conclusions should be presented clearly and supported with appropriate evidence and sound argument. is your code eliminating repetition?
  • To what extentConsider the merits or otherwise of an argument or concept. Opinions and conclusions should be presented clearly and supported with appropriate evidence and sound argument. is your code using functions appropriately?

Style

  • To what extentConsider the merits or otherwise of an argument or concept. Opinions and conclusions should be presented clearly and supported with appropriate evidence and sound argument. is your code readable?
  • To what extentConsider the merits or otherwise of an argument or concept. Opinions and conclusions should be presented clearly and supported with appropriate evidence and sound argument. is your code commented?
  • To what extentConsider the merits or otherwise of an argument or concept. Opinions and conclusions should be presented clearly and supported with appropriate evidence and sound argument. are your variables well named?
  • To what extentConsider the merits or otherwise of an argument or concept. Opinions and conclusions should be presented clearly and supported with appropriate evidence and sound argument. do you adhere to style guide?

References

A possible solution

Click the expand link to see one possible solution, but NOT before you have tried and failed!

not yet!