Parsing
R2R supports different parsing providers to extract text from various document formats. To configure the parsing provider:example r2r.toml
r2r
: Default offering forlight
installations, a simple and lightweight parser included in R2R.unstructured_local
: Default offering forfull
installations, makes use of open source Unstructured package.unstructured_api
: Cloud offering of Unstructured
Supported File Types
R2R supports parsing for the following file types:- BMP (Bitmap Image)
- CSV (Comma-Separated Values)
- DOC (Microsoft Word Document)
- DOCX (Microsoft Word Document)
- EML (Electronic Mail)
- EPUB (Electronic Publication)
- GIF (Graphics Interchange Format)
- HEIC (High-Efficiency Image Format)
- HTM (HyperText Markup)
- HTML (HyperText Markup Language)
- JPEG (Joint Photographic Experts Group)
- JPG (Joint Photographic Experts Group)
- JSON (JavaScript Object Notation)
- MD (Markdown)
- MSG (Microsoft Outlook Message)
- MP3 (MPEG Audio Layer III)
- MP4 (MPEG-4 Part 14)
- ODT (Open Document Text)
- ORG (Org Mode)
- PDF (Portable Document Format)
- P7S (PKCS#7)
- PNG (Portable Network Graphics)
- PPT (PowerPoint)
- PPTX (Microsoft PowerPoint Presentation)
- RST (reStructured Text)
- RTF (Rich Text Format)
- SVG (Scalable Vector Graphics)
- TSV (Tab-Separated Values)
- TXT (Plain Text)
- XLS (Microsoft Excel Spreadsheet)
- XLSX (Microsoft Excel Spreadsheet)
- XML (Extensible Markup Language)
- TIFF (Tagged Image File Format)
- MP4 (MPEG-4 Part 14)
Parsing providers for an R2R system cannot be configured at runtime and are instead configured server side.
Chunking
R2R uses chunking to break down parsed documents into smaller, manageable pieces for efficient processing and retrieval. Configure the chunking settings inr2r.toml
:
r2r.toml
provider
: The chunking provider (defaults to “r2r”).
chunking_strategy
: The chunking method (“recursive”).chunk_size
: The target size for each chunk.chunk_overlap
: The number of characters to overlap between chunks.excluded_parsers
: List of parsers to exclude (e.g., [“mp4”]).
strategy
: The overall chunking strategy (“auto”, “fast”, or “hi_res”).chunking_strategy
: The specific chunking method (“by_title” or “basic”).new_after_n_chars
: Soft maximum size for a chunk.max_characters
: Hard maximum size for a chunk.combine_under_n_chars
: Minimum size for combining small sections.overlap
: Number of characters to overlap between chunks.
Supported Providers
full
provider, using the open-source Unstructured library for local processing.Advanced Configuration Options
When using the Unstructured chunking provider, you can specify additional parameters in the configuration file:Runtime Configuration
The chunking configuration can be specified at runtime with theingest_files
endpoint, allowing dynamic adjustment of chunking parameters based on the input documents or specific use cases.