2021-03-22 09:03:26 +01:00
|
|
|
# Test PDFs
|
2021-03-13 22:53:54 +01:00
|
|
|
|
2021-03-22 09:03:26 +01:00
|
|
|
This folder contains PDFs for testing purposes and the parse results of the PDFs. Generally there are 3 types of PDFs test setups:
|
|
|
|
|
|
|
|
1. Self generated PDFs
|
|
|
|
2. PDFs which entered `public domain` or have a otherwise permissive license like `Creative Commons SA`
|
|
|
|
3. PDFs where the license is unclear
|
|
|
|
|
|
|
|
For (1) and (2) we track the end-result and all transformation steps.
|
|
|
|
For (3) we only track the resulst of some transfomation stages (those who doesn't leak too much of the content)
|
|
|
|
|
|
|
|
## Self-generated PDFs
|
2021-03-13 22:53:54 +01:00
|
|
|
|
|
|
|
- [ExamplePdf](ExamplePdf.pdf)
|
|
|
|
|
2021-03-22 09:03:26 +01:00
|
|
|
## Included Public PDFs
|
|
|
|
|
|
|
|
_(PDFs which entered `public domain` or have a otherwise permissive license like `Creative Commons SA`)_
|
2021-03-13 22:53:54 +01:00
|
|
|
|
2021-03-13 22:57:49 +01:00
|
|
|
| File | Source | Author | License Information |
|
|
|
|
| ------------------------------------------------------------------ | ------------------------------------- | ------------------------------------------------ | ------------------------- |
|
|
|
|
| [Adventures-Of-Sherlock-Holmes](Adventures-Of-Sherlock-Holmes.pdf) | https://pdfreebooks.org/ | Arthur Doyle | Public Domain |
|
|
|
|
| [Alice-In-Wonderland](Alice-In-Wonderland.pdf) | https://pdfreebooks.org/ | Lewis Carroll | Public Domain |
|
|
|
|
| [Closed-Syllables](Closed-Syllables.pdf) | ? | Susan Jones | Creative Commons BY 4.0 |
|
|
|
|
| [Flash-Masques-Temperature](Flash-Masques-Temperature.pdf) | https://www.techtera.org/ | ? | Creative Commons BY 4.0 |
|
|
|
|
| [Grammar-Matters](Grammar-Matters.pdf) | ? | Debbie Kuhlmann | Creative Commons BY 4.0 |
|
2021-03-26 08:42:31 +01:00
|
|
|
| [Life-Of-God-In-Soul-Of-Man](Life-Of-God-In-Soul-Of-Man.pdf) | https://archive.org/ | Henry Scougal | Public Domain |
|
2021-03-13 22:57:49 +01:00
|
|
|
| [Safe-Communication](Safe-Communication.pdf) | https://www.england.nhs.uk/ | Nicola Davey & Ali Cole | Creative Commons BY 4.0 |
|
|
|
|
| [St-Mary-Witney-Social-Audit](St-Mary-Witney-Social-Audit.pdf) | https://catrionarobertson.com/ | Catriona Robertson | Creative Commons BY 4.0 |
|
|
|
|
| [The-Art-of-Public-Speaking](The-Art-of-Public-Speaking.pdf) | http://www.gutenberg.org/ebooks/16317 | Dale Carnagey, J. Berg Esenwein | Project Gutenberg License |
|
|
|
|
| [The-Man-Without-A-Body](The-Man-Without-A-Body) | ? | Edward Page Mitchell | Public Domain |
|
|
|
|
| [The-War-of-the-Worlds](The-War-of-the-Worlds.pdf) | http://www.planetpdf.com/ | H.G Wells | Public Domain |
|
|
|
|
| [Tragedy-Of-The-Commons](Tragedy-Of-The-Commons.pdf) | https://science.sciencemag.org | Garrett Hardin | Public Domain |
|
|
|
|
| [WoodUp](WoodUp.pdf) | https://bupress.unibz.it/ | Freie Universität Bozen-Bolzano / Giustino Tonon | Creative Commons BY 4.0 |
|
2021-03-22 09:03:26 +01:00
|
|
|
|
|
|
|
## PDFs not stored but paritally tested
|
|
|
|
|
|
|
|
- https://homepages.cwi.nl/~lex/files/dict.pdf
|
2021-03-28 10:21:26 +02:00
|
|
|
- Page numbers with current chapter
|
|
|
|
- https://github.com/mozilla/pdf.js/raw/master/web/compressed.tracemonkey-pldi-09.pdf
|
|
|
|
- No page numbers
|
2021-03-22 09:03:26 +01:00
|
|
|
|
|
|
|
# Known transformatino problems
|
|
|
|
|
|
|
|
_Tracks known problems with parsing and transforming certain PDFs ._
|
|
|
|
|
|
|
|
- `Remove Repetitive Elements`
|
2021-03-26 08:42:31 +01:00
|
|
|
- [](Life-Of-God-In-Soul-Of-Man.pdf)
|
|
|
|
- often numbers are cryptic text
|
|
|
|
- high variance in Y
|