Conquering PDFs: document understanding beyond plain text

Ines Montani

Wednesday 17:10 in Zeiss Plenary (Spectrum)

For the practical examples, I'll be using spaCy, and the new Docling library and layout analysis models. I'll also cover Optical Character Recognition (OCR) for image-based text, how to convert tabular data to pandas DataFrames, and strategies for creating training and evaluation data for information extraction tasks like text classification and entity recognition using PDFs and other documents as inputs.

Ines Montani

Ines Montani is a developer specializing in tools for AI and NLP technology. She’s the co-founder and CEO of Explosion and a core developer of spaCy, a popular open-source library for Natural Language Processing in Python, and Prodigy, a modern annotation tool for creating training data for machine learning models.