Johannes Zillmann
b529dfa0a2
Detect Links
...
- Still needs a proper place since this is on `word` basis
2024-04-15 08:20:18 -06:00
Johannes Zillmann
3fa91a5d1e
FontStyle detection
...
- what is missing is combining subsequent equal elements
2024-04-15 07:55:55 -06:00
Johannes Zillmann
fab5d4649c
List Levels
...
- no tests for this... need to revise the test infrastructure and the transformation which is modifying the item contents directly
2024-04-05 12:06:21 -06:00
Johannes Zillmann
182dd34c46
Detect lists & blocks
2024-04-02 16:23:19 -06:00
Johannes Zillmann
b5f3075bdf
Clean up types
...
- merge `ItemType`/`BlockType` to `TextType`
- fix bug with duplicate and flattened types
2024-04-02 11:18:55 -06:00
Johannes Zillmann
3c31c12768
Update known issues
2024-03-28 12:03:49 -06:00
Johannes Zillmann
55ae236928
Improve header detection
...
- fix tests
- still run header detection based on heights even if TOC headlines have been identified
2024-03-28 11:39:34 -06:00
Johannes Zillmann
7f5f4d7071
Add DetectHeaders transformation
...
- This is mainly code from 2 years ago (was in the stash)
- The tests were green but failing now because of recent changes
- Plan is to first move all files to the root to then be able to debug the tests better
2024-03-26 10:23:15 -06:00
Johannes Zillmann
02c2fd04fe
DetectToc
removes TOC items and marks headlines
2021-07-19 10:15:59 -06:00
Johannes Zillmann
d223e8a790
Move types
to front
2021-07-18 14:25:55 -06:00
Johannes Zillmann
616909481a
Don't print globals twice
2021-07-18 14:13:38 -06:00
Johannes Zillmann
46234417ad
Fine tune line detection
...
* Before lines where assembled that really separate lines
2021-07-18 13:07:06 -06:00
Johannes Zillmann
e261583c65
Improve TOC headline detection
2021-04-27 08:29:00 +02:00
Johannes Zillmann
94a7405671
Lookup and verify toc links
2021-04-25 14:41:50 +02:00
Johannes Zillmann
19a76d6163
Publish TOC as global (rudimentary)
2021-04-25 08:15:10 +02:00
Johannes Zillmann
28c2b1a6a6
Have types
instead of type
2021-04-18 16:23:52 +02:00
Johannes Zillmann
5b611cd506
Rename TocDetection to DetectToc
2021-04-18 15:31:45 +02:00
Johannes Zillmann
243736ea0a
Fix typos
2021-04-18 11:38:34 +02:00
Johannes Zillmann
baa5b4fadc
Add 6 more test PDFs
2021-04-18 11:34:11 +02:00
Johannes Zillmann
a1ea24cc3a
Improved TOC detection
...
- Restrict pages before numbered line
2021-04-18 10:05:34 +02:00
Johannes Zillmann
ce6c9fe977
Initial TOC detection
2021-04-12 08:09:30 +02:00
Johannes Zillmann
a427806f68
Move width & height after x & y
2021-04-11 18:28:53 +02:00
Johannes Zillmann
642509a454
Refine repetitive character removal
2021-04-02 22:33:12 +02:00
Johannes Zillmann
6283ab7a96
Track evaluation score (optionally)
...
Makes it easier to see how a value got classified
2021-04-01 18:16:42 +02:00
Johannes Zillmann
d8fb3e0b24
Rename CalculateCoordinate to Unwrap... cause thats what its really is
2021-03-31 10:08:05 +02:00
Johannes Zillmann
71ef84153c
Show page labels + default mapping to 1
2021-03-29 08:47:04 +02:00
Johannes Zillmann
898af7bbc8
Fix previous commit and re-use page mapping
2021-03-29 07:24:20 +02:00
Johannes Zillmann
388e8cc6b1
Find page mapping during statistics calculation
2021-03-28 23:45:26 +02:00
Johannes Zillmann
89d4bbd2f9
Cover globals in tests
2021-03-28 10:58:24 +02:00
Johannes Zillmann
d7d3502a25
Fix processing pdfs with no page numbers
2021-03-28 10:21:26 +02:00
Johannes Zillmann
21106d7e5e
Lower min score since accuracy has increased
2021-03-26 09:02:31 +01:00
Johannes Zillmann
0b096faa0c
More accurate page number detection
2021-03-26 08:42:31 +01:00
Johannes Zillmann
4340acb758
Simplify code
2021-03-24 23:08:36 +01:00
Johannes Zillmann
4d1821f584
Qualify lines for removal based on multiple scores
2021-03-23 08:08:13 +01:00
Johannes Zillmann
c98145a63c
Test for remote PDFS
2021-03-22 09:03:26 +01:00
Johannes Zillmann
68c4d9a4a3
Consolidate repetitive element eviction
...
* Solely rely on neighbour similarity
* Cut out `y` in the middle
2021-03-16 07:02:31 +01:00
Johannes Zillmann
f42358d63b
Remove empty items
2021-03-16 05:50:57 +01:00
Johannes Zillmann
5af033c0f1
Round and limit y
2021-03-15 20:37:41 +01:00
Johannes Zillmann
a90e6207dc
Add similarity checks to repetitive element removal
2021-03-15 09:16:50 +01:00
Johannes Zillmann
9bd5043f2e
Very basic removal of repetitive elements
2021-03-14 12:15:37 +01:00
Johannes Zillmann
8e024ee544
Fix layout
2021-03-13 22:57:49 +01:00
Johannes Zillmann
60596e7416
#24 Add first external PDFs for testing
2021-03-13 22:53:54 +01:00
Johannes Zillmann
db86552965
Fix tests
2021-03-13 22:50:02 +01:00
Johannes Zillmann
713a82b41d
Stabilize font display in tests
...
* If multiple PDF are tested after another their font ids change (e.g. `g_d0_f1` becomes `g_d1_f1`)
2021-03-13 19:38:47 +01:00
Johannes Zillmann
417cc2ab94
Add Test infrastructure for example PDFs
2021-03-13 08:46:22 +01:00
Johannes Zillmann
ef0bd7ebbe
Add example files
2017-03-29 08:17:14 +02:00