Document Ingestion and Management
Ingest Files
Ingest files or directories into your R2R system:
const files = [
{ path: 'path/to/file1.txt' , name: 'file1.txt' },
{ path: 'path/to/file2.txt' , name: 'file2.txt' }
];
const metadatas = [{ key1: 'value1' }, { key2: 'value2' }];
const ingestResponse = await client . ingestFiles ( files , {
metadatas ,
user_ids: [ 'user-id-1' , 'user-id-2' ],
});
files
Array<string | File | { path: string; name: string }>
required
An array of file paths, File objects, or objects with path and name properties to ingest.
An optional array of metadata objects corresponding to each file.
An optional array of document IDs to assign to the ingested files.
An optional array of user IDs associated with the ingested files.
ingestion_config
Optional[Union[dict, ChunkingConfig]]
The ingestion config override parameter enables developers to customize their R2R chunking strategy at runtime.
provider
str
default: "unstructured_local"
Which chunking provider to use. Options are “r2r”, “unstructured_local”, or “unstructured_api”.
Which chunking method to apply. Options are “by_title”, “basic”, “recursive”, or “character”.
The average size of chunks, in tokens.
The default overlap between chunks.
max_chunk_size
Optional[int]
default: "None"
Sets a maximum size on output chunks.
Combine chunks smaller than this number of characters.
Maximum number of characters per chunk.
Whether to include coordinates in the output.
Encoding to use for text files.
Types of image blocks to extract.
gz_uncompressed_content_type
Content type for uncompressed gzip files.
Name of the high-resolution model to use.
include_orig_elements
Optional[bool]
default: "False"
Whether to include original elements in the output.
Whether to include page breaks in the output.
List of languages to consider for text processing.
Whether to allow sections to span multiple pages.
Start a new chunk after this many characters.
Languages to use for OCR.
output_format
str
default: "application/json"
Number of characters to overlap between chunks.
Whether to overlap all chunks.
pdf_infer_table_structure
Whether to infer table structure in PDFs.
Threshold for considering chunks similar.
Types of tables to skip inferring.
split_pdf_concurrency_level
Concurrency level for splitting PDFs.
Whether to split PDFs by page.
Page number to start processing from.
Strategy for processing. Options are “auto”, “fast”, or “hi_res”.
chunking_strategy
Optional[str]
default: "by_title"
Strategy for chunking. Options are “by_title” or “basic”.
Whether to generate unique IDs for elements.
Whether to keep XML tags in the output.
Update Files
Update existing documents:
const files = [
{ path: '/path/to/updated_file1.txt' , name: 'updated_file1.txt' }
];
const document_ids = [ 'document-id-1' ];
const updateResponse = await client . updateFiles ( files , {
document_ids ,
metadatas: [{ key: 'updated_value' }] // to overwrite the existing metadata
});
files
Array<File | { path: string; name: string }>
required
An array of File objects or objects with path and name properties to update.
An array of document IDs corresponding to the files being updated.
metadatas
Array<Record<string, any>>
An optional array of metadata objects for the updated files.
The ingestion config override parameter enables developers to customize their R2R chunking strategy at runtime.
Which chunking provider to use, r2r
or unstructured
. Selecting unstructured
is generally recommended when parsing with unstructured
or unstructured_api
.
Which chunking method to apply? When using unstructured, by_title
or basic
are supported.
The average size of chunks, in tokens.
The default overlap between chunks.
max_chunk_size
Optional[int]
default: "None"
Sets a maximum size on output chunks.
Documents Overview
Retrieve high-level document information, restricted to user files, except when called by a superuser where it will then return results from over all users:
const documentsOverview = await client . documentsOverview ();
An optional array of document IDs to filter the overview.
Document Chunks
Fetch chunks for a particular document:
const documentId = '9fbe403b-c11c-5aae-8ade-ef22980c3ad1' ;
const chunks = await client . documentChunks ( documentId );
The ID of the document to retrieve chunks for.
Delete Documents
Delete a document by its ID:
const deleteResponse = await client . delete ({ document_id: "91662726-7271-51a5-a0ae-34818509e1fd" });
filters
{ [key: string]: string | string[] }
required
A list of logical filters to perform over input documents fields which identifies the unique set of documents to delete (e.g., {"document_id": {"$eq": "9fbe403b-c11c-5aae-8ade-ef22980c3ad1"}}
). Logical operations might include variables such as "user_id"
or "title"
and filters like neq
, gte
, etc.