PDF, a must-have in RAG systems, ensures visual fidelity across platforms and devices, at the expense of compromising what would be the core condition for computers to properly process and interpret text: semantics. That means any logical arrangement of text, upon rendering, explodes into dummy visual shards of data that literally portrait the bigger picture for the human eye to perceive, but no longer convey the information computers should grasp. Such a bottleneck already makes proper ingestion of text-only documents a big challenge, let alone when tables or figures come into play, the ultimate nightmare for PDF parsers, not to say developers. The rest you must have already foreseen: a RAG system barfing unreliable knowledge from bad chunks (based on regular PDF parsing), if those ever get to be retrieved from a vector database.
In this talk you can gather some vision-driven insights on how to leverage the strengths of PDFs and language models towards good chunks to be ingested in a vector database. Or, in other words, how multimodal models can go beyond trivial reverse engineering by decomposing tables into its building blocks, in plain language, as how those would be explained to another human; or better yet, as how humans would ask questions about such pieces of knowledge. Consequently, it brings robustness to retrieval, the backbone of RAG. And from such a strategy, we can transfer the same rationale to figures.
Get ready to boost your retrieval skills, as we:
By then, you'll have enough food for thought to get your hands dirty, clone the repo, and give tweaks to the experiment yourself. Come along, gather some insights, and get inspired to break down tables and figures from your own PDF files, and to improve retrieval in your RAG systems.