Extract text from a PDF: Difference between revisions

Revision as of 14:22, 18 September 2018

This a problem set for you to work through ^[1]

This is a problem set. Some of these are easy, others are far more difficult. The purpose of these problems sets are:

to build your skill applying computational thinking to a problem
to assess your knowledge and skills of different programming practices

What is this problem set trying to do[edit]

This is tricky.
PDF's are a ubiquitous file format
They are famously difficult to get text from.

The Problem[edit]

Extract specific text from a PDF. Start here:

From terminal (inside visual studio code or iTerm) : pip3 install PyPDF2
Find some silly pdf to use (um, with text).
use this code to get started:

import PyPDF2
# Make sure your pdf is in the same directory as the code you are executing:
pdfFileObject = open('YOURPDF.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    print(page.extractText())

PyPDF2 github: https://github.com/mstamy2/PyPDF2
PyPDF2 tutorial: https://dzone.com/articles/an-intro-to-pypdf2
PyPDF2 documentation: https://pythonhosted.org/PyPDF2/

Unit Tests[edit]

User Input: Name: Bill
Expected output: Hello Bill

User Input: Name: TJ
Expected output: An administrator! Hello TJ

User Input: Name: 123
Expected output: Hello 123

Hacker edition[edit]

In the hacker version:

Your program should test for valid user input. The user input should be only allow for strings

THIS PART ISNT DONE YET

How you will be assessed[edit]

Your solution will be graded using the following axis:

Scope

To what extent does your code implement the features required by our specification?
To what extent is there evidence of effort?

Correctness

To what extent did your code meet specifications?
To what extent did your code meet unit tests?
To what extent is your code free of bugs?

Design

To what extent is your code written well (i.e. clearly, efficiently, elegantly, and/or logically)?
To what extent is your code eliminating repetition?
To what extent is your code using functions appropriately?

Style

To what extent is your code readable?
To what extent is your code commented?
To what extent are your variables well named?
To what extent do you adhere to style guide?

References[edit]

↑ http://www.flaticon.com/

A possible solution[edit]

Click the expand link to see one possible solution, but NOT before you have tried and failed!

not yet!

[1] ttp://www.flaticon.com/

[1]

@@ Line 22: / Line 22: @@
 # use this code to get started:
-<syntaxhighlight lang=python">
+<syntaxhighlight lang="python">
 import PyPDF2
-pdfFileObject = open('IBCompSciGuide.pdf', 'rb')
+# Make sure your pdf is in the same directory as the code you are executing:
+pdfFileObject = open('YOURPDF.pdf', 'rb')
 pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
 count = pdfReader.numPages
@@ Line 32: / Line 34: @@
 </syntaxhighlight>
+# PyPDF2 github: https://github.com/mstamy2/PyPDF2
+# PyPDF2 tutorial: https://dzone.com/articles/an-intro-to-pypdf2
+# PyPDF2 documentation: https://pythonhosted.org/PyPDF2/
 == Unit Tests ==