pdf-to-markdown/examples/README.md
2021-03-26 08:42:31 +01:00

3.8 KiB

Test PDFs

This folder contains PDFs for testing purposes and the parse results of the PDFs. Generally there are 3 types of PDFs test setups:

  1. Self generated PDFs
  2. PDFs which entered public domain or have a otherwise permissive license like Creative Commons SA
  3. PDFs where the license is unclear

For (1) and (2) we track the end-result and all transformation steps. For (3) we only track the resulst of some transfomation stages (those who doesn't leak too much of the content)

Self-generated PDFs

Included Public PDFs

(PDFs which entered public domain or have a otherwise permissive license like Creative Commons SA)

File Source Author License Information
Adventures-Of-Sherlock-Holmes https://pdfreebooks.org/ Arthur Doyle Public Domain
Alice-In-Wonderland https://pdfreebooks.org/ Lewis Carroll Public Domain
Closed-Syllables ? Susan Jones Creative Commons BY 4.0
Flash-Masques-Temperature https://www.techtera.org/ ? Creative Commons BY 4.0
Grammar-Matters ? Debbie Kuhlmann Creative Commons BY 4.0
Life-Of-God-In-Soul-Of-Man https://archive.org/ Henry Scougal Public Domain
Safe-Communication https://www.england.nhs.uk/ Nicola Davey & Ali Cole Creative Commons BY 4.0
St-Mary-Witney-Social-Audit https://catrionarobertson.com/ Catriona Robertson Creative Commons BY 4.0
The-Art-of-Public-Speaking http://www.gutenberg.org/ebooks/16317 Dale Carnagey, J. Berg Esenwein Project Gutenberg License
The-Man-Without-A-Body ? Edward Page Mitchell Public Domain
The-War-of-the-Worlds http://www.planetpdf.com/ H.G Wells Public Domain
Tragedy-Of-The-Commons https://science.sciencemag.org Garrett Hardin Public Domain
WoodUp https://bupress.unibz.it/ Freie Universität Bozen-Bolzano / Giustino Tonon Creative Commons BY 4.0

PDFs not stored but paritally tested

Known transformatino problems

Tracks known problems with parsing and transforming certain PDFs .

  • Remove Repetitive Elements
      • often numbers are cryptic text
      • high variance in Y