Commit Graph

20 Commits

Author SHA1 Message Date
Johannes Zillmann
b5f3075bdf Clean up types
- merge `ItemType`/`BlockType` to `TextType`
- fix bug with duplicate and flattened types
2024-04-02 11:18:55 -06:00
Johannes Zillmann
55ae236928 Improve header detection
- fix tests
- still run header detection based on heights even if TOC headlines have been identified
2024-03-28 11:39:34 -06:00
Johannes Zillmann
7f5f4d7071 Add DetectHeaders transformation
- This is mainly code from 2 years ago (was in the stash)
- The tests were green but failing now because of recent changes
- Plan is to first move all files to the root to then be able to debug the tests better
2024-03-26 10:23:15 -06:00
Johannes Zillmann
02c2fd04fe DetectToc removes TOC items and marks headlines 2021-07-19 10:15:59 -06:00
Johannes Zillmann
d223e8a790 Move types to front 2021-07-18 14:25:55 -06:00
Johannes Zillmann
616909481a Don't print globals twice 2021-07-18 14:13:38 -06:00
Johannes Zillmann
46234417ad Fine tune line detection
* Before lines where assembled that really separate lines
2021-07-18 13:07:06 -06:00
Johannes Zillmann
e261583c65 Improve TOC headline detection 2021-04-27 08:29:00 +02:00
Johannes Zillmann
94a7405671 Lookup and verify toc links 2021-04-25 14:41:50 +02:00
Johannes Zillmann
19a76d6163 Publish TOC as global (rudimentary) 2021-04-25 08:15:10 +02:00
Johannes Zillmann
28c2b1a6a6 Have types instead of type 2021-04-18 16:23:52 +02:00
Johannes Zillmann
5b611cd506 Rename TocDetection to DetectToc 2021-04-18 15:31:45 +02:00
Johannes Zillmann
a1ea24cc3a Improved TOC detection
- Restrict pages before numbered line
2021-04-18 10:05:34 +02:00
Johannes Zillmann
a427806f68 Move width & height after x & y 2021-04-11 18:28:53 +02:00
Johannes Zillmann
6283ab7a96 Track evaluation score (optionally)
Makes it easier to see how a value got classified
2021-04-01 18:16:42 +02:00
Johannes Zillmann
898af7bbc8 Fix previous commit and re-use page mapping 2021-03-29 07:24:20 +02:00
Johannes Zillmann
388e8cc6b1 Find page mapping during statistics calculation 2021-03-28 23:45:26 +02:00
Johannes Zillmann
89d4bbd2f9 Cover globals in tests 2021-03-28 10:58:24 +02:00
Johannes Zillmann
4d1821f584 Qualify lines for removal based on multiple scores 2021-03-23 08:08:13 +01:00
Johannes Zillmann
c98145a63c Test for remote PDFS 2021-03-22 09:03:26 +01:00