Documents & Ingestion
Documents are the core unit of Remem—text content that gets encrypted, chunked, embedded, classified, and made searchable.Overview
When you ingest a document into Remem, the following happens automatically:- Content is encrypted with your tenant’s Data Encryption Key (DEK)
- Text is split into semantic chunks (~1200 char targets, adaptive v2 chunking)
- Each chunk is embedded using voyage-3.5-lite (1024-dim vectors)
- Content is classified by Grok 4 Fast (category, tags, sensitivity, extracted data)
- Vectors are indexed in your isolated Qdrant collection
- BM25 search vectors are computed for hybrid search
- Large PDFs and Markdown files can be indexed into a PageIndex tree for long-document retrieval (optional)
- Long documents (>= 20k characters) are chunked more aggressively and stored as
both(summary + chunks) so they rank properly in normal search.
For long documents (>= 20k chars), Remem stores both a summary chunk and full chunks to improve recall. PageIndex
adds section-level context in rich queries but does not replace normal chunk search.
Ingest a Document
POST /v1/documents/ingest
You can ingest documents in two ways: JSON body (for text content) or multipart form (for file uploads).
Method 1: JSON Body (Text)
Use this method when you have text content ready to ingest directly.Request Fields
| Field | Type | Required | Description |
|---|---|---|---|
content | string | Yes | Raw text content to ingest |
title | string | No | Document title (auto-generated if omitted) |
source | string | No | Ingestion source (enum): api, quick_capture, folder_sync, gmail. Default: api |
source_id | string | No | External ID for deduplication. If a document with the same source_id exists, creates a new version |
source_path | string | No | Original file path or URI (stored encrypted; returned on GET /v1/documents/{id}) |
mime_type | string | No | MIME type hint (e.g., text/plain, text/markdown) |
metadata | object | No | Arbitrary key-value pairs, stored encrypted |
source is the ingestion channel, not the file format. File formats are inferred by the classifier and show up
as source_type (e.g., pdf, markdown, image). If you need to preserve a local file path, set source to
folder_sync and pass source_path. Invalid source values return a 422.Method 2: Multipart Form (File Upload)
Use this method to upload files directly (PDFs, images, text files, etc.).Supported File Types
| Category | Extensions | Notes |
|---|---|---|
| Text | .txt, .md, .markdown, .vtt, .srt, .log, .rtf | Processed directly (VTT/SRT transcripts supported) |
.pdf | Rendered to images, text extracted | |
| Images | .jpg, .jpeg, .png, .heic, .webp, .gif | Vision-based classification |
| Code | .py, .js, .ts, .swift, .go, .rs, .java, .cpp, .c, .h | Text extraction |
| Data | .csv, .tsv, .json, .yaml, .yml, .toml, .xml | Structured data extraction |
.eml, .msg | Email parsing | |
| Web | .html, .htm | HTML parsing |
Max content size: Practical limit is ~50KB of text for JSON body ingestion. For file uploads, the limit is 10MB.
Classifier
source_type values are: pdf, image, text, code, markdown, document, spreadsheet, email,
web, unknown. These are inferred by Remem (file extension + classifier) and are not user-supplied.Async Processing Pipeline
After ingestion, the document goes through a multi-stage processing pipeline. Here’s what happens:Content Encrypted
Text content is encrypted with your tenant’s Data Encryption Key (DEK) using AES-256-GCM.
Text Chunking
Content is split into semantic chunks using adaptive v2 chunking (~1200 char targets, 1800 max, 15% overlap).
Document Classification
Grok 4 Fast analyzes the document and extracts:
- Category (e.g., “invoice”, “manual”, “screenshot”)
- Tags (key:value format like
vendor:amazon,topic:finance) - Sensitivity level (public, internal, confidential, personal)
- Language detection
- Summary for search
- Extracted structured data (amounts, dates, names, etc.)
PageIndex for Long Documents
PageIndex is an optional long-document indexer that complements standard vector + BM25 search. It is designed to improve retrieval quality for very large PDFs and Markdown files by building a hierarchical map of sections. How it works:- Runs asynchronously in a separate
pageindex-worker - Only applies to PDF and Markdown inputs above a size threshold (default: 20,000 characters)
- Generates a tree of section nodes with short summaries
- Stores node summaries encrypted in PostgreSQL (same DEK model as documents)
- Does not replace standard chunking or vector search — it augments rich-mode results
- PageIndex uses an external LLM (OpenAI-compatible API; Grok is the default in production).
- If PageIndex is disabled or the worker is offline, ingestion still completes normally and all other search paths continue to work.
Retrieve a Document
GET /v1/documents/{document_id}
Fetch a document with decrypted content, metadata, and classification results.
The API respects your API key’s sensitivity scope. If your key has
internal access, you cannot retrieve confidential or personal documents.Update a Document
POST /v1/documents/{document_id}/update
Creates a new version of the document. Does not overwrite the original.
Versioning: Each update creates a new document version. The previous version is marked as superseded but remains in the database for audit purposes. Old chunks and vectors are queued for deletion after the new version is indexed.
Delete a Document
DELETE /v1/documents/{document_id}
Soft delete—marks the document as deleted and excludes it from search results immediately.
Hard Deletion
Hard deletion removes the document from all stores:- PostgreSQL (document record, chunks, metadata)
- Qdrant (vector embeddings)
- DigitalOcean Spaces (raw file)
- Redis (cached data)
hard_delete flag:
Classification Fields
When a document is processed, Grok 4 Fast extracts the following metadata fields:category
category
What type of document is this?Free-form string determined by the LLM based on content, not file extension.Examples:
"invoice", "manual", "receipt", "screenshot", "meeting_notes", "property-listing", "guidebook"Categories are not enums. The LLM decides what the document is based on its actual content.
tags
tags
sensitivity
sensitivity
Access control levelOne of four levels:
public— Can be shared freelyinternal— Within organization onlyconfidential— Limited access (financial, legal)personal— Private to user (default)
language
language
Detected languageISO 639-1 language code (e.g.,
"en", "es", "fr", "de").Used for language-specific search and filtering.summary
summary
1-2 sentence descriptionA concise summary of the document content (max 500 chars) optimized for search.Example:
"Q1 planning meeting discussing priorities for EU expansion, mobile app launch by March, and hiring two engineers."search_text
search_text
Detailed indexable descriptionA longer, more detailed description extracted by the LLM for full-text search. Includes key entities, topics, and context that might not be in the summary.
extracted
extracted
Structured key-value dataFree-form JSON object containing structured data extracted from the document. Only present when the LLM detects extractable fields.Examples:Invoice:Property Listing:Meeting Notes:
confidence
confidence
Classification confidence scoreFloat between 0.0 and 1.0 indicating how confident the LLM is in its classification.
>= 0.7— High confidence, classification is reliable< 0.7— Low confidence, may require manual review
Common Pitfalls & Tips
Max content for JSON body: ~50KB of text. For larger content, use multipart file upload instead.
Idempotent Ingestion
Use theIdempotency-Key header to ensure duplicate requests don’t create multiple documents:
Idempotency-Key, Remem returns the cached response from the first request instead of creating a duplicate.
Idempotency keys are stored in Redis for 24 hours and are tenant-scoped. After 24 hours, the key expires and a new request would be treated as fresh.
Raw File Storage & TTL
Remem stores raw files in DigitalOcean Spaces with a hybrid retention policy:| Document Type | Retention Policy |
|---|---|
| High-value (contracts, invoices, legal, tax, medical) | Kept forever |
| User-starred | Kept forever |
| Everything else | Kept for 90 days, then deleted |
| Always kept | Text content, embeddings, summaries, extracted data |
Even after raw files are deleted, searchable content remains. You can still query the document using semantic search—you just can’t retrieve the original PDF or image.
Next Steps
- Learn how to query your documents using fast and rich modes
- Set up API key sensitivity scoping for fine-grained access control
- Integrate with Claude Desktop via MCP