> ## Documentation Index
> Fetch the complete documentation index at: https://r2r-patch-decruft-openapi-json.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Parsing & Chunking

> Learn how to configure document chunking in your R2R deployment

## Parsing

R2R supports different parsing providers to extract text from various document formats. To configure the parsing provider:

```toml example r2r.toml
[parsing]
provider = "unstructured_local" # | rag | unstructured_api
excluded_parsers = ["mp4"]
```

Available providers:

* `r2r`: Default offering for `light` installations, a simple and lightweight parser included in R2R.
* `unstructured_local`: Default offering for `full` installations, makes use of open source Unstructured package.
* `unstructured_api`: Cloud offering of Unstructured

### Supported File Types

**R2R supports parsing for the following file types:**

* BMP (Bitmap Image)
* CSV (Comma-Separated Values)
* DOC (Microsoft Word Document)
* DOCX (Microsoft Word Document)
* EML (Electronic Mail)
* EPUB (Electronic Publication)
* GIF (Graphics Interchange Format)
* HEIC (High-Efficiency Image Format)
* HTM (HyperText Markup)
* HTML (HyperText Markup Language)
* JPEG (Joint Photographic Experts Group)
* JPG (Joint Photographic Experts Group)
* JSON (JavaScript Object Notation)
* MD (Markdown)
* MSG (Microsoft Outlook Message)
* MP3 (MPEG Audio Layer III)
* MP4 (MPEG-4 Part 14)
* ODT (Open Document Text)
* ORG (Org Mode)
* PDF (Portable Document Format)
* P7S (PKCS#7)
* PNG (Portable Network Graphics)
* PPT (PowerPoint)
* PPTX (Microsoft PowerPoint Presentation)
* RST (reStructured Text)
* RTF (Rich Text Format)
* SVG (Scalable Vector Graphics)
* TSV (Tab-Separated Values)
* TXT (Plain Text)
* XLS (Microsoft Excel Spreadsheet)
* XLSX (Microsoft Excel Spreadsheet)
* XML (Extensible Markup Language)
* TIFF (Tagged Image File Format)
* MP4 (MPEG-4 Part 14)

<Note> Parsing providers for an R2R system cannot be configured at runtime and are instead configured server side. </Note>

**Refer to the [Unstructured documentation](https://docs.unstructured.io/welcome) for details about their ingestion capabilities and limitations.**

## Chunking

R2R uses chunking to break down parsed documents into smaller, manageable pieces for efficient processing and retrieval. Configure the chunking settings in `r2r.toml`:

```toml r2r.toml
[chunking]
provider = "unstructured_local"
strategy = "auto"
chunking_strategy = "by_title"
new_after_n_chars = 512
max_characters = 1_024
combine_under_n_chars = 128
overlap = 20
```

Key chunking configuration options:

* `provider`: The chunking provider (defaults to "r2r").

**For R2R:**

* `chunking_strategy`: The chunking method ("recursive").
* `chunk_size`: The target size for each chunk.
* `chunk_overlap`: The number of characters to overlap between chunks.
* `excluded_parsers`: List of parsers to exclude (e.g., \["mp4"]).

**For Unstructured:**

* `strategy`: The overall chunking strategy ("auto", "fast", or "hi\_res").
* `chunking_strategy`: The specific chunking method ("by\_title" or "basic").
* `new_after_n_chars`: Soft maximum size for a chunk.
* `max_characters`: Hard maximum size for a chunk.
* `combine_under_n_chars`: Minimum size for combining small sections.
* `overlap`: Number of characters to overlap between chunks.

## Supported Providers

<Tabs>
  <Tab title="Unstructured Local">
    ```python
    # Ensure unstructured is installed
    # Refer to the full installation docs here - [https://r2r-docs.sciphi.ai/introduction/documentation/installation/full/docker]

    # Set 'provider = "unstructured_local"' for `ingestion` in `my_r2r.toml`.
    r2r serve --config-path=my_r2r.toml
    ```

    This is the default `full` provider, using the open-source Unstructured library for local processing.
  </Tab>

  <Tab title="Unstructured API">
    ```python
    export UNSTRUCTURED_API_KEY=your_unstructured_api_key
    export UNSTRUCTURED_API_URL=your_unstructured_api_url
    # .. set other environment variables

    # Optional - Update default provider
    # Set 'provider = "unstructured_api"' for `ingestion` in `my_r2r.toml`.
    r2r serve --config-path=my_r2r.toml
    ```

    Uses the Unstructured platform API for chunking, which may offer additional features or performance benefits.
  </Tab>

  <Tab title="R2R">
    ```python
    # No additional setup required

    # Optional - Update default provider
    # Set 'provider = "r2r"' for `ingestion` in `my_r2r.toml`.
    r2r serve --config-path=my_r2r.toml
    ```

    This is the default `light` provider, using the open-source R2R library for local processing.
  </Tab>
</Tabs>

### Advanced Configuration Options

When using the Unstructured chunking provider, you can specify additional parameters in the configuration file:

```toml
[ingestion]
provider = "unstructured_local"
strategy = "auto"  # "auto", "fast", or "hi_res"
chunking_strategy = "by_title"  # "by_title" or "basic"

# Core chunking parameters
combine_under_n_chars = 128
max_characters = 500
new_after_n_chars = 1500
overlap = 0

# Additional chunking options
coordinates = false
encoding = "utf-8"
extract_image_block_types = []  # List of image block types to extract
gz_uncompressed_content_type = null
hi_res_model_name = null
include_orig_elements = true
include_page_breaks = false

languages = []  # List of languages to consider
multipage_sections = true
ocr_languages = []  # List of languages for OCR
output_format = "application/json"
overlap_all = false
pdf_infer_table_structure = true

similarity_threshold = null
skip_infer_table_types = []  # List of table types to skip inference
split_pdf_concurrency_level = 5
split_pdf_page = true
starting_page_number = null
unique_element_ids = false
xml_keep_tags = false
```

These options allow fine-tuning of the chunking process for specific document types or requirements. Refer to the Unstructured [documentation here](https://docs.unstructured.io/open-source/core-functionality/chunking) for more details on the available settings.

### Runtime Configuration

The chunking configuration can be specified at runtime with the [`ingest_files`](/api-reference/endpoint/ingest_files) endpoint, allowing dynamic adjustment of chunking parameters based on the input documents or specific use cases.

### Combining Chunking with Other R2R Components

Chunking is a crucial part of the document processing pipeline in R2R. It works in conjunction with other components such as parsing, embedding, and retrieval. For example:

```python
response = client.ingest_files(
    file_paths=["document.pdf"],
    ingestion_config={
        "provider": "unstructured_local",
        "chunking_strategy": "by_title",
        "max_characters": 1000
    },
    embedding_config={...},
    # ... other configurations
)
```

For more detailed information on configuring chunking and other ingestion settings, please refer to the [Ingestion Configuration documentation](/documentation/configuration/ingestion/overview).

## Next Steps

To learn more about configuring other components of R2R, explore the following pages:

* [Embedding Configuration](/documentation/configuration/ingestion/embedding)
* [Knowledge Graph Configuration](/documentation/configuration/knowledge-graph/overview)
* [Retrieval Configuration](/documentation/configuration/retrieval/overview)
