PDFs - When a thousand words are worth more than a picture (or table).

Wednesday 15:10 in Hassium

PDF, a must-have in RAG systems, ensures visual fidelity across platforms and devices, at the expense of compromising what would be the core condition for computers to properly process and interpret text: semantics. That means any logical arrangement of text, upon rendering, explodes into dummy visual shards of data that literally portrait the bigger picture for the human eye to perceive, but no longer convey the information computers should grasp. Such a bottleneck already makes proper ingestion of text-only documents a big challenge, let alone when tables or figures come into play, the ultimate nightmare for PDF parsers, not to say developers. The rest you must have already foreseen: a RAG system barfing unreliable knowledge from bad chunks (based on regular PDF parsing), if those ever get to be retrieved from a vector database.

In this talk you can gather some vision-driven insights on how to leverage the strengths of PDFs and language models towards good chunks to be ingested in a vector database. Or, in other words, how multimodal models can go beyond trivial reverse engineering by decomposing tables into its building blocks, in plain language, as how those would be explained to another human; or better yet, as how humans would ask questions about such pieces of knowledge. Consequently, it brings robustness to retrieval, the backbone of RAG. And from such a strategy, we can transfer the same rationale to figures.

Get ready to boost your retrieval skills, as we:

Analyze the semantical bottlenecks, from the anatomy of a PDF stream, to how parsers traverse it;
(Briefly) approach the never-ending debate on the ideal chunk format for ingestion in vector databases;
Build some chunks using multimodal models to decompose tables into its building blocks, preserving plain language;
Conduct an experiment on measuring quality of retrieval and compare the decomposition strategy against PDF parsers and reverse engineering techniques;
And last, but not least, transfer the same rationale to figures.

By then, you'll have enough food for thought to get your hands dirty, clone the repo, and give tweaks to the experiment yourself. Come along, gather some insights, and get inspired to break down tables and figures from your own PDF files, and to improve retrieval in your RAG systems.

Caio Benatti Moretti

Caio holds a PhD in Computer Science and has been working with data and AI both in academia and industry since 2014. Currently working as a DS/MLE Consultant at Xebia Data, he is particularly keen on neural networks in its many forms and applications. His enthusiasm even led him to make a neural network fit inside a business card. With experience designing and taking applications into production, Caio has been recently focusing on how (Generative)AI can augment human productivity.