mirror of
https://github.com/jzillmann/pdf-to-markdown.git
synced 2024-11-30 03:34:22 +01:00
7f5f4d7071
- This is mainly code from 2 years ago (was in the stash) - The tests were green but failing now because of recent changes - Plan is to first move all files to the root to then be able to debug the tests better |
||
---|---|---|
.. | ||
Achieving-The-Paris-Climate-Agreement | ||
Adventures-Of-Sherlock-Holmes | ||
Alice-In-Wonderland | ||
CC_License_Agreement_of_siMPle | ||
CC-NC_Leitfaden | ||
Closed-Syllables | ||
compressed.tracemonkey-pldi-09 | ||
dict | ||
ExamplePdf | ||
Flash-Masques-Temperature | ||
Grammar-Matters | ||
Life-Of-God-In-Soul-Of-Man | ||
Made-with-cc | ||
Safe-Communication | ||
St-Mary-Witney-Social-Audit | ||
The-Art-of-Public-Speaking | ||
The-Impact-of-Open-Access-Latin-American-Scholarship | ||
The-Man-Without-A-Body | ||
The-War-of-the-Worlds | ||
Tragedy-Of-The-Commons | ||
Watered-Soul-Blog-Book | ||
WoodUp | ||
Achieving-The-Paris-Climate-Agreement.pdf | ||
Adventures-Of-Sherlock-Holmes.pdf | ||
Alice-In-Wonderland.pdf | ||
CC_License_Agreement_of_siMPle.pdf | ||
CC-NC_Leitfaden.pdf | ||
Closed-Syllables.pdf | ||
ExamplePdf.pages | ||
ExamplePdf.pdf | ||
Flash-Masques-Temperature.pdf | ||
Grammar-Matters.pdf | ||
KNOWN_ISSUES.md | ||
Life-Of-God-In-Soul-Of-Man.pdf | ||
Made-with-cc.pdf | ||
README.md | ||
Safe-Communication.pdf | ||
St-Mary-Witney-Social-Audit.pdf | ||
The-Art-of-Public-Speaking.pdf | ||
The-Impact-of-Open-Access-Latin-American-Scholarship.pdf | ||
The-Man-Without-A-Body.pdf | ||
The-War-of-the-Worlds.pdf | ||
Tragedy-Of-The-Commons.pdf | ||
Watered-Soul-Blog-Book.pdf | ||
WoodUp.pdf |
Test PDFs
This folder contains PDFs for testing purposes and the parse results of the PDFs. Generally there are 3 types of PDFs test setups:
- Self generated PDFs
- PDFs which entered
public domain
or have a otherwise permissive license likeCreative Commons SA
- PDFs where the license is unclear
For (1) and (2) we track the end-result and all transformation steps. For (3) we only track the resulst of some transfomation stages (those who doesn't leak too much of the content)
Self-generated PDFs
Included Public PDFs
(PDFs which entered public domain
or have a otherwise permissive license like Creative Commons SA
)
File | Source | Author /Editor | License Information |
---|---|---|---|
Achieving-The-Paris-Climate-Agreement | https://link.springer.com/ | Sven Teske | Open Access, CC 4.0 |
Adventures-Of-Sherlock-Holmes | https://pdfreebooks.org/ | Arthur Doyle | Public Domain |
Alice-In-Wonderland | https://pdfreebooks.org/ | Lewis Carroll | Public Domain |
CC_License_Agreement_of_siMPle | https://simple-plastics.eu/ | Aalborg University, Denmark and Alfred Wegener Institute | Creative Commons BY 4.0 |
CC-NC_Leitfaden | https://irights.info | Paul Klimpel | Creative Commons NC 4.0 |
Closed-Syllables | ? | Susan Jones | Creative Commons BY 4.0 |
Flash-Masques-Temperature | https://www.techtera.org/ | ? | Creative Commons BY 4.0 |
Grammar-Matters | ? | Debbie Kuhlmann | Creative Commons BY 4.0 |
Life-Of-God-In-Soul-Of-Man | https://archive.org/ | Henry Scougal | Public Domain |
Made-with-cc | https://creativecommons.org/ | Paul Stacey & Sarah Hinchliff Pearson | Public Domain |
Safe-Communication | https://www.england.nhs.uk/ | Nicola Davey & Ali Cole | Creative Commons BY-SA 4.0 |
St-Mary-Witney-Social-Audit | https://catrionarobertson.com/ | Catriona Robertson | Creative Commons BY 4.0 |
The-Art-of-Public-Speaking | http://www.gutenberg.org/ebooks/16317 | Dale Carnagey, J. Berg Esenwein | Project Gutenberg License |
The-Impact-of-Open-Access-Latin-American-Scholarship | https://about.jstor.org/ | John Kiplinger, Valerie Yaw | Creative Commons NC 4.0 |
The-Man-Without-A-Body | ? | Edward Page Mitchell | Public Domain |
The-War-of-the-Worlds | http://www.planetpdf.com/ | H.G Wells | Public Domain |
Tragedy-Of-The-Commons | https://science.sciencemag.org | Garrett Hardin | Public Domain |
Watered-Soul-Blog-Book | https://wateredsoul.com/ | Wanda | Creative Commons BY 4.0 |
WoodUp | https://bupress.unibz.it/ | Freie Universität Bozen-Bolzano / Giustino Tonon | Creative Commons BY 4.0 |
PDFs not stored but partially tested
- https://homepages.cwi.nl/~lex/files/dict.pdf
- Page numbers with current chapter
- https://github.com/mozilla/pdf.js/raw/master/web/compressed.tracemonkey-pldi-09.pdf
- No page numbers
Known transformation problems
Tracks known problems with parsing and transforming certain PDFs .
See als KNOWN_ISSUES