Documents & Ingestion

Documents are the core unit of Remem—text content that gets encrypted, chunked, embedded, classified, and made searchable.

Overview

When you ingest a document into Remem, the following happens automatically:

Content is encrypted with your tenant’s Data Encryption Key (DEK)
Text is split into semantic chunks (~1200 char targets, adaptive v2 chunking)
Each chunk is embedded using voyage-3.5-lite (1024-dim vectors)
Content is classified by Grok 4 Fast (category, tags, sensitivity, extracted data)
Vectors are indexed in your isolated Qdrant collection
BM25 search vectors are computed for hybrid search
Large PDFs and Markdown files can be indexed into a PageIndex tree for long-document retrieval (optional)
Long documents (>= 20k characters) are chunked more aggressively and stored as both (summary + chunks) so they rank properly in normal search.

Ingestion is asynchronous. The API queues the document and returns a job_id immediately. The document is not immediately searchable—expect 5-15 seconds for typical text documents.

For long documents (>= 20k chars), Remem stores both a summary chunk and full chunks to improve recall. PageIndex adds section-level context in rich queries but does not replace normal chunk search.

Ingest a Document

POST /v1/documents/ingest You can ingest documents in two ways: JSON body (for text content) or multipart form (for file uploads).

Method 1: JSON Body (Text)

Use this method when you have text content ready to ingest directly.

curl -X POST https://api.remem.io/v1/documents/ingest \
  -H "Content-Type: application/json" \
  -H "X-API-Key: vlt_..." \
  -d '{
    "title": "Meeting Notes - Q1 Planning",
    "content": "We decided to focus on three priorities...",
    "source": "api",
    "source_id": "meeting-2026-02-04",
    "source_path": "/home/user/Notes/meetings/2026-02-04.md",
    "mime_type": "text/plain",
    "metadata": {"type": "meeting_notes", "date": "2026-02-04"}
  }'

Response:

{
  "job_id": "1770218554334-0",
  "message": "Document queued for ingestion"
}

Request Fields

Field	Type	Required	Description
`content`	string	Yes	Raw text content to ingest
`title`	string	No	Document title (auto-generated if omitted)
`source`	string	No	Ingestion source (enum): `api`, `quick_capture`, `folder_sync`, `gmail`. Default: `api`
`source_id`	string	No	External ID for deduplication. If a document with the same `source_id` exists, creates a new version
`source_path`	string	No	Original file path or URI (stored encrypted; returned on `GET /v1/documents/{id}`)
`mime_type`	string	No	MIME type hint (e.g., `text/plain`, `text/markdown`)
`metadata`	object	No	Arbitrary key-value pairs, stored encrypted

source is the ingestion channel, not the file format. File formats are inferred by the classifier and show up as source_type (e.g., pdf, markdown, image). If you need to preserve a local file path, set source to folder_sync and pass source_path. Invalid source values return a 422.

Method 2: Multipart Form (File Upload)

Use this method to upload files directly (PDFs, images, text files, etc.).

curl -X POST https://api.remem.io/v1/documents/ingest \
  -H "X-API-Key: vlt_..." \
  -F "[email protected]" \
  -F "title=Q1 Report" \
  -F "source=api" \
  -F "source_path=/home/user/Reports/Q1/report.pdf" \
  -F 'metadata={"department": "engineering"}'

Response:

{
  "job_id": "1770218554335-0",
  "message": "Document queued for ingestion"
}

Supported File Types

Category	Extensions	Notes
Text	`.txt`, `.md`, `.markdown`, `.vtt`, `.srt`, `.log`, `.rtf`	Processed directly (VTT/SRT transcripts supported)
PDF	`.pdf`	Rendered to images, text extracted
Images	`.jpg`, `.jpeg`, `.png`, `.heic`, `.webp`, `.gif`	Vision-based classification
Code	`.py`, `.js`, `.ts`, `.swift`, `.go`, `.rs`, `.java`, `.cpp`, `.c`, `.h`	Text extraction
Data	`.csv`, `.tsv`, `.json`, `.yaml`, `.yml`, `.toml`, `.xml`	Structured data extraction
Email	`.eml`, `.msg`	Email parsing
Web	`.html`, `.htm`	HTML parsing

Max content size: Practical limit is ~50KB of text for JSON body ingestion. For file uploads, the limit is 10MB.

Classifier source_type values are: pdf, image, text, code, markdown, document, spreadsheet, email, web, unknown. These are inferred by Remem (file extension + classifier) and are not user-supplied.

Async Processing Pipeline

After ingestion, the document goes through a multi-stage processing pipeline. Here’s what happens:

Document Queued

Your document is added to the Redis Streams job queue.

Worker Picks Up Job

A background worker retrieves the job for processing.

Content Encrypted

Text content is encrypted with your tenant’s Data Encryption Key (DEK) using AES-256-GCM.

Text Chunking

Content is split into semantic chunks using adaptive v2 chunking (~1200 char targets, 1800 max, 15% overlap).

Embedding Generation

Each chunk is embedded using voyage-3.5-lite (1024-dimensional vectors).

Document Classification

Grok 4 Fast analyzes the document and extracts:

Category (e.g., “invoice”, “manual”, “screenshot”)
Tags (key:value format like vendor:amazon, topic:finance)
Sensitivity level (public, internal, confidential, personal)
Language detection
Summary for search
Extracted structured data (amounts, dates, names, etc.)

Vector Indexing

Embeddings are indexed in your isolated Qdrant collection for semantic search.

BM25 Indexing

BM25 search vectors are computed for keyword-based hybrid search.

PageIndex Tree (Optional)

For long PDFs and Markdown files (default: ≥20k characters), a separate worker builds a hierarchical PageIndex tree of sections and summaries. Node summaries are stored encrypted in PostgreSQL and later used in rich-mode retrieval.

Documents are not immediately searchable. For typical text documents, expect 5-15 seconds of processing time. Large PDFs or image-heavy documents may take longer.

PageIndex for Long Documents

PageIndex is an optional long-document indexer that complements standard vector + BM25 search. It is designed to improve retrieval quality for very large PDFs and Markdown files by building a hierarchical map of sections. How it works:

Runs asynchronously in a separate pageindex-worker
Only applies to PDF and Markdown inputs above a size threshold (default: 20,000 characters)
Generates a tree of section nodes with short summaries
Stores node summaries encrypted in PostgreSQL (same DEK model as documents)
Does not replace standard chunking or vector search — it augments rich-mode results

Notes:

PageIndex uses an external LLM (OpenAI-compatible API; Grok is the default in production).
If PageIndex is disabled or the worker is offline, ingestion still completes normally and all other search paths continue to work.

Retrieve a Document

GET /v1/documents/{document_id} Fetch a document with decrypted content, metadata, and classification results.

curl https://api.remem.io/v1/documents/{document_id} \
  -H "X-API-Key: vlt_..."

Response:

{
  "id": "a1b2c3d4-...",
  "tenant_id": "...",
  "title": "Meeting Notes - Q1 Planning",
  "content": "We decided to focus on three priorities...",
  "source": "api",
  "status": "completed",
  "chunk_count": 3,
  "created_at": "2026-02-04T10:30:00Z",
  "classification": {
    "category": "meeting_notes",
    "tags": ["project:q1-planning", "topic:strategy"],
    "sensitivity": "internal",
    "language": "en",
    "summary": "Q1 planning meeting discussing priorities for EU expansion and mobile app launch",
    "confidence": 0.92,
    "extracted": {
      "date": "2026-01-15",
      "attendees": ["Alice", "Bob"],
      "priorities": ["EU expansion", "mobile app", "hiring"]
    }
  }
}

The API respects your API key’s sensitivity scope. If your key has internal access, you cannot retrieve confidential or personal documents.

Update a Document

POST /v1/documents/{document_id}/update Creates a new version of the document. Does not overwrite the original.

curl -X POST https://api.remem.io/v1/documents/{document_id}/update \
  -H "Content-Type: application/json" \
  -H "X-API-Key: vlt_..." \
  -d '{
    "content": "Updated meeting notes...",
    "title": "Meeting Notes - Q1 Planning (Revised)"
  }'

Response:

{
  "id": "a1b2c3d4-...",
  "version": 2,
  "is_new_version": true,
  "message": "New version created"
}

Versioning: Each update creates a new document version. The previous version is marked as superseded but remains in the database for audit purposes. Old chunks and vectors are queued for deletion after the new version is indexed.

Delete a Document

DELETE /v1/documents/{document_id} Soft delete—marks the document as deleted and excludes it from search results immediately.

curl -X DELETE https://api.remem.io/v1/documents/{document_id} \
  -H "X-API-Key: vlt_..."

Response:

{
  "message": "Document marked for deletion",
  "deleted_at": "2026-02-04T11:00:00Z"
}

Soft delete: The document is hidden from queries immediately but remains in the database. Hard deletion (complete removal from PostgreSQL, Qdrant, and Spaces) happens asynchronously within 30 days.

Hard Deletion

Hard deletion removes the document from all stores:

PostgreSQL (document record, chunks, metadata)
Qdrant (vector embeddings)
DigitalOcean Spaces (raw file)
Redis (cached data)

Hard delete is triggered automatically 30 days after soft delete, or you can request immediate hard delete using the hard_delete flag:

curl -X DELETE https://api.remem.io/v1/documents/{document_id}?hard_delete=true \
  -H "X-API-Key: vlt_..."

Classification Fields

When a document is processed, Grok 4 Fast extracts the following metadata fields:

Common Pitfalls & Tips

Ingestion is async. Don’t expect documents to be immediately searchable. Poll the query API or wait 10-15 seconds before testing search.

Large documents are chunked automatically. You don’t need to split them yourself. Send the full content—the system handles semantic chunking.

Duplicate detection uses source_id. Always set it for idempotent ingestion. If you retry the same source_id, the system creates a new version instead of a duplicate.

Metadata is encrypted at rest. You can store sensitive key-value pairs (customer IDs, account numbers, etc.) safely. They’re encrypted in PostgreSQL and only decrypted during queries.

File uploads have a 10MB limit. For larger files, consider splitting them or using a file transfer service.

Supported file types. Remem supports PDF, images, text, markdown, spreadsheets, code, and more. Check the supported file types table above.

Max content for JSON body: ~50KB of text. For larger content, use multipart file upload instead.

Idempotent Ingestion

Use the Idempotency-Key header to ensure duplicate requests don’t create multiple documents:

curl -X POST https://api.remem.io/v1/documents/ingest \
  -H "Content-Type: application/json" \
  -H "X-API-Key: vlt_..." \
  -H "Idempotency-Key: upload-20260204-meeting-notes" \
  -d '{
    "title": "Meeting Notes",
    "content": "..."
  }'

If you retry this request with the same Idempotency-Key, Remem returns the cached response from the first request instead of creating a duplicate.

Idempotency keys are stored in Redis for 24 hours and are tenant-scoped. After 24 hours, the key expires and a new request would be treated as fresh.

Raw File Storage & TTL

Remem stores raw files in DigitalOcean Spaces with a hybrid retention policy:

Document Type	Retention Policy
High-value (contracts, invoices, legal, tax, medical)	Kept forever
User-starred	Kept forever
Everything else	Kept for 90 days, then deleted
Always kept	Text content, embeddings, summaries, extracted data

Even after raw files are deleted, searchable content remains. You can still query the document using semantic search—you just can’t retrieve the original PDF or image.

To mark a document for permanent retention:

curl -X POST https://api.remem.io/v1/documents/{document_id}/star \
  -H "X-API-Key: vlt_..."

Next Steps

Learn how to query your documents using fast and rich modes
Set up API key sensitivity scoping for fine-grained access control
Integrate with Claude Desktop via MCP

Getting Started

Using Remem

Integrations

Documents & Ingestion

Documents & Ingestion

Overview

Ingest a Document

Method 1: JSON Body (Text)

Request Fields

Method 2: Multipart Form (File Upload)

Supported File Types

Async Processing Pipeline

PageIndex for Long Documents

Retrieve a Document

Update a Document

Delete a Document

Hard Deletion

Classification Fields

Common Pitfalls & Tips

Idempotent Ingestion

Raw File Storage & TTL

Next Steps

Getting Started

Using Remem

Integrations

​Documents & Ingestion

​Overview

​Ingest a Document

​Method 1: JSON Body (Text)

​Request Fields

​Method 2: Multipart Form (File Upload)

​Supported File Types

​Async Processing Pipeline

​PageIndex for Long Documents

​Retrieve a Document

​Update a Document

​Delete a Document

​Hard Deletion

​Classification Fields

​Common Pitfalls & Tips

​Idempotent Ingestion

​Raw File Storage & TTL

​Next Steps

Documents & Ingestion

Overview

Ingest a Document

Method 1: JSON Body (Text)

Request Fields

Method 2: Multipart Form (File Upload)

Supported File Types

Async Processing Pipeline

PageIndex for Long Documents

Retrieve a Document

Update a Document

Delete a Document

Hard Deletion

Classification Fields

Common Pitfalls & Tips

Idempotent Ingestion

Raw File Storage & TTL

Next Steps