Ingestion

Document Ingestion and Management

Ingest Files

Ingest files or directories into your R2R system:

file_paths = ['path/to/file1.txt', 'path/to/file2.txt']
metadatas = [{'key1': 'value1'}, {'key2': 'value2'}]

ingest_response = client.ingest_files(
    file_paths=file_paths,
    metadatas=metadatas,
    # optionally override chunking settings at runtime
    ingestion_config={
        "provider": "unstructured_local",
        "strategy": "auto",
        "chunking_strategy": "by_title",
        "new_after_n_chars": 256, # soft maximum
        "max_characters": 512, # hard maximum
        "combine_under_n_chars": 64, # hard minimum
        "overlap": 100,
    }
)

Response

response

dict

The response from the R2R system after ingesting the files.

[{'message': 'Ingestion task queued successfully.', 'task_id': '6e27dfca-606d-422d-b73f-2d9e138661b4', 'document_id': 'c3291abf-8a4e-5d9d-80fd-232ef6fd8526'}, ...]

file_paths

list[str]

required

A list of file paths or directory paths to ingest. If a directory path is provided, all files within the directory and its subdirectories will be ingested.

metadatas

Optional[list[dict]]

An optional list of metadata dictionaries corresponding to each file. If provided, the length should match the number of files being ingested.

document_ids

Optional[list[Union[UUID, str]]]

An optional list of document IDs to assign to the ingested files. If provided, the length should match the number of files being ingested.

versions

Optional[list[str]]

An optional list of version strings for the ingested files. If provided, the length should match the number of files being ingested.

ingestion_config

Optional[Union[dict, ChunkingConfig]]

The ingestion config override parameter enables developers to customize their R2R chunking strategy at runtime.

Show properties

provider

str

default:"unstructured_local"

Which chunking provider to use. Options are “r2r”, “unstructured_local”, or “unstructured_api”.

max_chunk_size

Optional[int]

default:"None"

Sets a maximum size on output chunks.

combine_under_n_chars

Optional[int]

Combine chunks smaller than this number of characters.

max_characters

Optional[int]

Maximum number of characters per chunk.

coordinates

bool

default:"False"

Whether to include coordinates in the output.

encoding

Optional[str]

Encoding to use for text files.

extract_image_block_types

Optional[list[str]]

Types of image blocks to extract.

gz_uncompressed_content_type

Optional[str]

Content type for uncompressed gzip files.

hi_res_model_name

Optional[str]

Name of the high-resolution model to use.

include_orig_elements

Optional[bool]

default:"False"

Whether to include original elements in the output.

include_page_breaks

bool

Whether to include page breaks in the output.

languages

Optional[list[str]]

List of languages to consider for text processing.

multipage_sections

bool

default:"True"

Whether to allow sections to span multiple pages.

new_after_n_chars

Optional[int]

Start a new chunk after this many characters.

ocr_languages

Optional[list[str]]

Languages to use for OCR.

output_format

str

default:"application/json"

Format of the output.

overlap

int

default:"0"

Number of characters to overlap between chunks.

overlap_all

bool

default:"False"

Whether to overlap all chunks.

pdf_infer_table_structure

bool

default:"True"

Whether to infer table structure in PDFs.

similarity_threshold

Optional[float]

Threshold for considering chunks similar.

skip_infer_table_types

Optional[list[str]]

Types of tables to skip inferring.

split_pdf_concurrency_level

int

default:"5"

Concurrency level for splitting PDFs.

split_pdf_page

bool

default:"True"

Whether to split PDFs by page.

starting_page_number

Optional[int]

Page number to start processing from.

strategy

str

default:"auto"

Strategy for processing. Options are “auto”, “fast”, or “hi_res”.

chunking_strategy

Optional[str]

default:"by_title"

Strategy for chunking. Options are “by_title” or “basic”.

unique_element_ids

bool

default:"False"

Whether to generate unique IDs for elements.

xml_keep_tags

bool

default:"False"

Whether to keep XML tags in the output.

Update Files

Update existing documents:

file_paths = ["/path/to/r2r/examples/data/aristotle_v2.txt"]
document_ids = ["9fbe403b-c11c-5aae-8ade-ef22980c3ad1"]
update_response = client.update_files(
  file_paths=file_paths,
  document_ids=document_ids,
  metadatas=[{"x":"y"}] # to overwrite the existing metadata
)

Response

response

dict

The response from the R2R system after updating the files.

[{'message': 'Update files task queued successfully.', 'task_id': '6e27dfca-606d-422d-b73f-2d9e138661b4', 'document_id': '9f375ce9-efe9-5b57-8bf2-a63dee5f3621'}, ...]

file_paths

list[str]

required

A list of file paths to update.

document_ids

Optional[list[Union[UUID, str]]]

required

A list of document IDs corresponding to the files being updated. When not provided, an attempt is made to generate the correct document id from the given user id and file path.

metadatas

Optional[list[dict]]

An optional list of metadata dictionaries for the updated files.

Documents Overview

Retrieve high-level document information, restricted to user files, except when called by a superuser where it will then return results from over all users:

documents_overview = client.documents_overview()

Response

response

list[dict]

A list of dictionaries containing document information.

[
  {
    'document_id': '9fbe403b-c11c-5aae-8ade-ef22980c3ad1',
    'version': 'v0',
    'collection_ids': [],
    'ingestion_status': 'success',
    'restructuring_status': 'pending',
    'user_id': '2acb499e-8428-543b-bd85-0d9098718220',
    'title': 'aristotle.txt',
    'created_at': '2024-07-21T20:09:14.218741Z',
    'updated_at': '2024-07-21T20:09:14.218741Z',
    'metadata': {'title': 'aristotle.txt', 'version': 'v0', 'x': 'y'}
  },
  ...
]

document_ids

Optional[list[Union[UUID, str]]]

An optional list of document IDs to filter the overview.

Document Chunks

Fetch chunks for a particular document:

document_id = "9fbe403b-c11c-5aae-8ade-ef22980c3ad1"
chunks = client.document_chunks(document_id)

Response

response

list[dict]

A list of dictionaries containing chunk information.

[{
  'text': 'Aristotle[A] (Greek: Ἀριστοτέλης Aristotélēs, pronounced [aristotélɛːs]; 384–322 BC) was an Ancient Greek philosopher and polymath...',
  'user_id': '2acb499e-8428-543b-bd85-0d9098718220',
  'document_id': '9fbe403b-c11c-5aae-8ade-ef22980c3ad1',
  'extraction_id': 'aeba6400-1bd0-5ee9-8925-04732d675434',
  'fragment_id': 'f48bcdad-4155-52a4-8c9d-8ba06e996ba3'
  'metadata': {'title': 'aristotle.txt', 'version': 'v0', 'chunk_order': 0, 'document_type': 'txt', 'unstructured_filetype': 'text/plain', 'unstructured_languages': ['eng'], 'unstructured_parent_id': '971399f6ba2ec9768d2b5b92ab9d17d6', 'partitioned_by_unstructured': True}
},
...]

document_id

str

required

The ID of the document to retrieve chunks for.

Delete Documents

Delete a document by its ID:

delete_response = client.delete(
  {
    "document_id":
      {"$eq": "9fbe403b-c11c-5aae-8ade-ef22980c3ad1"}
  }
)

Response

response

dict

The response from the R2R system after successfully deleting the documents.

{'results': {}}

filters

list[dict]

required

A list of logical filters to perform over input documents fields which identifies the unique set of documents to delete (e.g., {"document_id": {"$eq": "9fbe403b-c11c-5aae-8ade-ef22980c3ad1"}}). Logical operations might include variables such as "user_id" or "title" and filters like neq, gte, etc.

Getting Started

Setup

Deployment

Deep Dives

Document Ingestion and Management

Ingest Files

Update Files

Documents Overview

Document Chunks

Delete Documents

Getting Started

Setup

Deployment

Deep Dives

​Document Ingestion and Management

​Ingest Files

​Update Files

​Documents Overview

​Document Chunks

​Delete Documents

Document Ingestion and Management

Ingest Files

Update Files

Documents Overview

Document Chunks

Delete Documents