Introduction
R2R’s ingestion pipeline efficiently processes various document formats, transforming them into searchable content. It seamlessly integrates with vector databases and knowledge graphs for optimal retrieval and analysis. R2R offers two main implementations for ingestion:- Light: Uses R2R’s built-in ingestion logic, which supports a wide range of file types including TXT, JSON, HTML, PDF, DOCX, PPTX, XLSX, CSV, Markdown, images, audio, and video. This is the default for the ‘light’ installation.
- Full: Leverages Unstructured’s open-source ingestion platform to handle supported file types. This is the default for the ‘full’ installation and provides more advanced parsing capabilities.
Key Configuration Areas
Many of the settings managed by ther2r.toml
relate to the ingestion process, some of which are shown below
default_ingestion_settings.toml
-
The
[database]
section configures the Postgres database used for semantic search and document management. During retrieval, this database is queried to find the most relevant document chunks based on vector similarity. -
The
[ingestion]
section determines how different file types are processed and converted into text. This includes protocol for how text is split into smaller, manageable pieces. This affects the granularity of information storage and retrieval. -
The
[embedding]
section defines the model and parameters for converting text into vector embeddings. In the retrieval process, these settings are used to embed the user’s query, allowing it to be compared against the stored document embeddings.
Key Features
- Multi-format Support: Handles various document types including TXT, JSON, HTML, PDF, DOCX, PPTX, XLSX, CSV, Markdown, images, audio, and video.
- Customizable: Supports the addition of custom parsers for specific data types.
- Asynchronous Processing: Efficiently manages data handling with asynchronous operations.
- Dual Storage: Supports ingestion into both vector databases for embedding-based search and knowledge graphs for structured information retrieval.
- Modular Design: Composed of distinct pipes that can be customized or extended.