2 private links
GitHub - VikParuchuri/surya: OCR, layout analysis, reading order, table recognition in 90+ languages
Surya is a document OCR toolkit that does: OCR in 90+ languages that benchmarks favorably vs cloud services It works on a range of documents (see #usage and #benchmarks for more details). [...] There is a hosted API for all surya models available https://www.datalab.to/: Works with PDF, images, word docs, and powerpoints [...] I benchmarked OCR against Google Cloud vision since it has similar language coverage to Surya. [...] This will evaluate surya and optionally tesseract on multilingual pdfs from common crawl (with synthetic data for missing languages).
Marker converts PDF to markdown quickly and accurately. Supports a wide range of documents (optimized for books and scientific papers) [...] Here are some known limitations that are on the roadmap to address: Marker will not convert 100% of equations to LaTeX. [...] marker /path/to/input/folder /path/to/output/folder --workers 10 --max 10 --metadata_file /path/to/metadata.json --min_length 10000 --workers is the number of pdfs to convert at once. [...] Then run benchmark.py like this: python benchmark.py data/pdfs data/references report.json --nougat This will benchmark marker against other text extraction methods.