pdf-to-markdown/KNOWN_ISSUES.md
2024-03-28 12:03:49 -06:00

1.5 KiB

Known Issues

Missing or wrong characters

The text which comes of pdfjs looks very erronous sometimes. E.g Life-Of-God-In-Soul-Of-Man. The interesting thing is that rendering with pdfjs (online) looks good. So maybe this is just a setup problem !?

Uncovered TOC variants

Not yet reviewed test PDFS

Achieving-The-Paris-Climate-Agreement.pdf

  • wrong page page mapping ?
  • no page numbers removed
  • no toc
  • romisch numbers are wrong
  • subheading under the toc headings should be detected as well (clearly not in the code)

Sherlock

  • words not together

Made-with-cc.pdf

  • no toc

Watered-Soul-Blog-Book.pdf

  • TOC: character minumum cuts out year
  • TOC: stops to early

Life of God in Soul of man

  • Headlines confusion (after the headline the first words of a sentence are big... shouldn't be a headline in this case... looks at all heights in the line)