mirror of
https://github.com/jzillmann/pdf-to-markdown.git
synced 2024-11-26 17:54:04 +01:00
3.9 KiB
3.9 KiB
Test PDFs
This folder contains PDFs for testing purposes and the parse results of the PDFs. Generally there are 3 types of PDFs test setups:
- Self generated PDFs
- PDFs which entered
public domain
or have a otherwise permissive license likeCreative Commons SA
- PDFs where the license is unclear
For (1) and (2) we track the end-result and all transformation steps. For (3) we only track the resulst of some transfomation stages (those who doesn't leak too much of the content)
Self-generated PDFs
Included Public PDFs
(PDFs which entered public domain
or have a otherwise permissive license like Creative Commons SA
)
File | Source | Author | License Information |
---|---|---|---|
Adventures-Of-Sherlock-Holmes | https://pdfreebooks.org/ | Arthur Doyle | Public Domain |
Alice-In-Wonderland | https://pdfreebooks.org/ | Lewis Carroll | Public Domain |
Closed-Syllables | ? | Susan Jones | Creative Commons BY 4.0 |
Flash-Masques-Temperature | https://www.techtera.org/ | ? | Creative Commons BY 4.0 |
Grammar-Matters | ? | Debbie Kuhlmann | Creative Commons BY 4.0 |
Life-Of-God-In-Soul-Of-Man | https://archive.org/ | Henry Scougal | Public Domain |
Safe-Communication | https://www.england.nhs.uk/ | Nicola Davey & Ali Cole | Creative Commons BY 4.0 |
St-Mary-Witney-Social-Audit | https://catrionarobertson.com/ | Catriona Robertson | Creative Commons BY 4.0 |
The-Art-of-Public-Speaking | http://www.gutenberg.org/ebooks/16317 | Dale Carnagey, J. Berg Esenwein | Project Gutenberg License |
The-Man-Without-A-Body | ? | Edward Page Mitchell | Public Domain |
The-War-of-the-Worlds | http://www.planetpdf.com/ | H.G Wells | Public Domain |
Tragedy-Of-The-Commons | https://science.sciencemag.org | Garrett Hardin | Public Domain |
WoodUp | https://bupress.unibz.it/ | Freie Universität Bozen-Bolzano / Giustino Tonon | Creative Commons BY 4.0 |
PDFs not stored but paritally tested
- https://homepages.cwi.nl/~lex/files/dict.pdf
- Page numbers with current chapter
- https://github.com/mozilla/pdf.js/raw/master/web/compressed.tracemonkey-pldi-09.pdf
- No page numbers
Known transformatino problems
Tracks known problems with parsing and transforming certain PDFs .