Extract text from a PDF: Difference between revisions

From Computer Science Wiki
(Created page with "right|frame|This a problem set for you to work through <ref>http://www.flaticon.com/</ref> This is a problem set. Some of these are easy, others are far m...")
 
Line 22: Line 22:
# use this code to get started:  
# use this code to get started:  


<syntaxhighlight lang=python">
<syntaxhighlight lang="python">
import PyPDF2
import PyPDF2
pdfFileObject = open('IBCompSciGuide.pdf', 'rb')
# Make sure your pdf is in the same directory as the code you are executing:
pdfFileObject = open('YOURPDF.pdf', 'rb')
 
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
count = pdfReader.numPages
Line 32: Line 34:
</syntaxhighlight>
</syntaxhighlight>


 
# PyPDF2 github: https://github.com/mstamy2/PyPDF2
# PyPDF2 tutorial: https://dzone.com/articles/an-intro-to-pypdf2
# PyPDF2 documentation: https://pythonhosted.org/PyPDF2/


== Unit Tests ==
== Unit Tests ==

Revision as of 14:22, 18 September 2018

This a problem set for you to work through [1]

This is a problem set. Some of these are easy, others are far more difficult. The purpose of these problems sets are:

  1. to build your skill applying computational thinking to a problem
  2. to assess your knowledge and skills of different programming practices


What is this problem set trying to do[edit]

  1. This is tricky.
  2. PDF's are a ubiquitous file format
  3. They are famously difficult to get text from.


The Problem[edit]

Extract specific text from a PDF. Start here:

  1. From terminal (inside visual studio code or iTerm) : pip3 install PyPDF2
  2. Find some silly pdf to use (um, with text).
  3. use this code to get started:
import PyPDF2
# Make sure your pdf is in the same directory as the code you are executing:
pdfFileObject = open('YOURPDF.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    print(page.extractText())
  1. PyPDF2 github: https://github.com/mstamy2/PyPDF2
  2. PyPDF2 tutorial: https://dzone.com/articles/an-intro-to-pypdf2
  3. PyPDF2 documentation: https://pythonhosted.org/PyPDF2/

Unit Tests[edit]

  • User Input: Name: Bill
  • Expected output: Hello Bill
  • User Input: Name: TJ
  • Expected output: An administrator! Hello TJ
  • User Input: Name: 123
  • Expected output: Hello 123

Hacker edition[edit]

In the hacker version:

  • Your program should test for valid user input. The user input should be only allow for strings

THIS PART ISNT DONE YET

How you will be assessed[edit]

Your solution will be graded using the following axis:


Scope

  • To what extent does your code implement the features required by our specification?
  • To what extent is there evidence of effort?

Correctness

  • To what extent did your code meet specifications?
  • To what extent did your code meet unit tests?
  • To what extent is your code free of bugs?

Design

  • To what extent is your code written well (i.e. clearly, efficiently, elegantly, and/or logically)?
  • To what extent is your code eliminating repetition?
  • To what extent is your code using functions appropriately?

Style

  • To what extent is your code readable?
  • To what extent is your code commented?
  • To what extent are your variables well named?
  • To what extent do you adhere to style guide?

References[edit]

A possible solution[edit]

Click the expand link to see one possible solution, but NOT before you have tried and failed!

not yet!