Multi-Format Document RAG: Building a Retrieval Pipeline Across PDFs, Word, Excel, and Audio

Most RAG tutorials start with a single file type. Load a PDF, split it into chunks, embed the chunks, and query. The demo works. Then someone asks: "Can it also search our Excel pricing sheets, the recorded customer calls, and that PowerPoint deck from last quarter?" The answer is usually silence.

Enterprise knowledge does not live in one format. It is scattered across PDFs, Word documents, spreadsheets, presentations, HTML exports, images of whiteboards, and audio recordings of meetings. A multi-format document RAG pipeline that handles all of these sources through a single retrieval path is not a nice-to-have — it is a prerequisite for any system that claims to represent what an organisation actually knows.

Why Single-Format Pipelines Fail in Practice

The appeal of a single-format pipeline is simplicity. PDF-only ingestion is well understood, well documented, and relatively easy to build. But the moment it ships, its limitations become obvious.

Consider a typical enterprise scenario. A product team stores specifications in Word documents. Finance maintains pricing models in Excel. Legal keeps contracts as scanned PDFs. The sales team records client calls. Marketing publishes HTML newsletters. A single-format RAG pipeline that only ingests PDFs will miss specifications, pricing, call transcripts, and marketing content entirely. The retrieval system answers questions from a fraction of the knowledge base, and users learn not to trust it.

The problem compounds over time. Teams that know their content is excluded stop contributing to the knowledge system. The pipeline becomes a PDF search engine rather than an organisational memory. Building a document to RAG pipeline that handles every format from the start avoids this failure mode.

Format-Specific Challenges

Each document format presents distinct extraction challenges. Understanding these is essential before designing a unified pipeline.

PDFs: The Deceptive Standard

PDFs look simple but are architecturally complex. A digital-native PDF contains extractable text layers, but a scanned PDF is essentially an image wrapped in a container. Table extraction from PDFs remains one of the hardest problems in document AI — columns misalign, headers span multiple rows, and footnotes interrupt data regions. Multi-column layouts, embedded charts, and mixed orientation pages add further complexity. A robust PDF parser must handle all of these variants without manual configuration per document.

Word Documents: Structure Without Consistency

DOCX files carry rich structural metadata — headings, lists, tables, footnotes, comments, tracked changes. The challenge is that authors use these features inconsistently. One team uses Heading 2 for section titles. Another uses bold body text. A third uses manual line breaks instead of paragraph styles. The parser must extract semantic structure even when the formatting is informal, and it must decide whether tracked changes and comments are part of the canonical content or noise.

Spreadsheets: Data Masquerading as Documents

Excel and CSV files sit at the boundary between structured and unstructured data. A clean spreadsheet with column headers and typed values is essentially a database table. But enterprise spreadsheets rarely look like that. They contain merged cells, embedded notes, multi-sheet workbooks where Sheet 3 references Sheet 1, pivot tables, and free-text columns where someone typed three paragraphs into a single cell. A RAG pipeline for Word Excel PDF and spreadsheet content must handle both the tabular and narrative aspects of these files.

Presentations: Visual Knowledge

PowerPoint decks encode knowledge visually — in slide titles, bullet points, speaker notes, and embedded charts. The text is fragmented by design. A single concept might span three slides with five bullets each. Chunking strategies that work for prose documents fail here because the unit of meaning in a presentation is the slide or slide group, not the paragraph.

Audio: The Unindexed Archive

Meeting recordings, customer calls, and conference presentations contain enormous amounts of institutional knowledge that never gets written down. Ingesting audio requires transcription as a first step, but the challenges go beyond speech-to-text accuracy. Speaker diarisation (identifying who said what), timestamp alignment, and handling domain-specific terminology all affect retrieval quality. A multi-format document RAG pipeline must treat audio as a first-class source, not an afterthought.

HTML and Images

HTML pages from internal wikis, knowledge bases, and exported emails carry their own quirks — nested tables for layout, inline styles that obscure semantic meaning, and boilerplate navigation that must be stripped. Images of whiteboards, diagrams, and handwritten notes require OCR or vision models to extract any text at all.

Unifying Formats in a Single Pipeline

The key architectural insight is that format-specific parsing is an ingestion concern, not a pipeline concern. Each format needs its own parser, but once the content is extracted, every document — regardless of its original format — enters the same processing path.

A well-designed multi-format pipeline follows four stages:

Clean — Raw extraction output is normalised. Character encoding issues are resolved, boilerplate is removed, and formatting artefacts from the source format are stripped. The output is clean text with preserved structural markers.

Transform — Cleaned content is chunked, enriched with metadata, and embedded. Chunking strategies may vary slightly by source type (slide-level for presentations, row-group-level for spreadsheets, paragraph-level for documents), but the embedding and indexing process is identical.

Integrate — Chunks are loaded into the vector store and linked back to their source documents. Metadata includes the original format, source location, extraction timestamp, and any structural context (page number, sheet name, slide index, speaker label).

Serve — A single retrieval interface queries across all sources. The user asks a question and gets answers drawn from PDFs, spreadsheets, transcripts, and presentations — ranked by relevance, not by format.

This four-stage architecture — Clean, Transform, Integrate, Serve — means that adding a new format only requires writing a new parser at the ingestion layer. The rest of the pipeline remains untouched.

The Visual Pipeline Advantage

Configuring a multi-format pipeline in code is possible but error-prone. Each parser has its own parameters (OCR settings for PDFs, sheet selection for Excel, transcription model for audio), and the interactions between parsing, chunking, and embedding are difficult to reason about in a configuration file.

A visual canvas where pipeline stages are represented as nodes provides a fundamentally different workflow. You can see multiple ingest nodes — one for each format — converging into shared cleaning, transformation, and indexing nodes. The data flow is explicit. When something breaks, you can trace the path from a specific source format through each processing stage to understand where the failure occurred.

Ertas supports eight parsers — PDF, Word, PowerPoint, Excel/CSV, HTML, Image, and Audio — all feeding into the same pipeline through a visual canvas. Each parser appears as a distinct ingest node, but downstream processing is shared. This means a team can start with PDF ingestion, verify that retrieval works, and then add Excel and audio sources without rebuilding anything.

Practical Considerations

Building a RAG pipeline for Word Excel PDF and other formats raises several practical questions that are worth addressing early.

Chunk size consistency. Different formats produce chunks of very different lengths. A spreadsheet row might be 20 tokens. A PDF page might be 800. Normalising chunk sizes across formats improves embedding quality and retrieval fairness — otherwise long-document formats dominate search results simply because they produce more text per chunk.

Metadata for provenance. Users need to know where an answer came from. "Page 14 of the Q3 report" is useful. "Chunk 847" is not. Preserving format-specific location metadata (page, sheet, slide, timestamp) through the pipeline is essential for trust.

Incremental updates. Enterprise document repositories change constantly. The pipeline must support re-ingesting updated documents without reprocessing the entire corpus. This requires tracking document versions and only processing deltas.

Access control. Not every user should see every document. Format-aware metadata makes it possible to apply source-level permissions at retrieval time, ensuring that the RAG system respects the same access rules as the original document store.

Conclusion

A multi-format document RAG pipeline is not a luxury feature — it is the minimum viable architecture for enterprise retrieval. Organisations that start with a single-format pipeline inevitably face the same problem: the system knows about PDFs but is blind to everything else. By designing for multiple formats from the beginning, with format-specific parsers feeding into a shared processing path, teams build retrieval systems that actually represent what the organisation knows. The alternative is a search engine that only sees a fraction of the answers.