Skip to main content

Documents & Ingestion

Documents are the core unit of Remem—text content that gets encrypted, chunked, embedded, classified, and made searchable.

Overview

When you ingest a document into Remem, the following happens automatically:
  • Content is encrypted with your tenant’s Data Encryption Key (DEK)
  • Text is split into semantic chunks (~1200 char targets, adaptive v2 chunking)
  • Each chunk is embedded using voyage-3.5-lite (1024-dim vectors)
  • Content is classified by Grok 4 Fast (category, tags, sensitivity, extracted data)
  • Vectors are indexed in your isolated Qdrant collection
  • BM25 search vectors are computed for hybrid search
  • Large PDFs and Markdown files can be indexed into a PageIndex tree for long-document retrieval (optional)
  • Long documents (>= 20k characters) are chunked more aggressively and stored as both (summary + chunks) so they rank properly in normal search.
Ingestion is asynchronous. The API queues the document and returns a job_id immediately. The document is not immediately searchable—expect 5-15 seconds for typical text documents.
For long documents (>= 20k chars), Remem stores both a summary chunk and full chunks to improve recall. PageIndex adds section-level context in rich queries but does not replace normal chunk search.

Ingest a Document

POST /v1/documents/ingest You can ingest documents in two ways: JSON body (for text content) or multipart form (for file uploads).

Method 1: JSON Body (Text)

Use this method when you have text content ready to ingest directly.
curl -X POST https://api.remem.io/v1/documents/ingest \
  -H "Content-Type: application/json" \
  -H "X-API-Key: vlt_..." \
  -d '{
    "title": "Meeting Notes - Q1 Planning",
    "content": "We decided to focus on three priorities...",
    "source": "api",
    "source_id": "meeting-2026-02-04",
    "source_path": "/home/user/Notes/meetings/2026-02-04.md",
    "mime_type": "text/plain",
    "metadata": {"type": "meeting_notes", "date": "2026-02-04"}
  }'
Response:
{
  "job_id": "1770218554334-0",
  "message": "Document queued for ingestion"
}

Request Fields

FieldTypeRequiredDescription
contentstringYesRaw text content to ingest
titlestringNoDocument title (auto-generated if omitted)
sourcestringNoIngestion source (enum): api, quick_capture, folder_sync, gmail. Default: api
source_idstringNoExternal ID for deduplication. If a document with the same source_id exists, creates a new version
source_pathstringNoOriginal file path or URI (stored encrypted; returned on GET /v1/documents/{id})
mime_typestringNoMIME type hint (e.g., text/plain, text/markdown)
metadataobjectNoArbitrary key-value pairs, stored encrypted
source is the ingestion channel, not the file format. File formats are inferred by the classifier and show up as source_type (e.g., pdf, markdown, image). If you need to preserve a local file path, set source to folder_sync and pass source_path. Invalid source values return a 422.

Method 2: Multipart Form (File Upload)

Use this method to upload files directly (PDFs, images, text files, etc.).
curl -X POST https://api.remem.io/v1/documents/ingest \
  -H "X-API-Key: vlt_..." \
  -F "[email protected]" \
  -F "title=Q1 Report" \
  -F "source=api" \
  -F "source_path=/home/user/Reports/Q1/report.pdf" \
  -F 'metadata={"department": "engineering"}'
Response:
{
  "job_id": "1770218554335-0",
  "message": "Document queued for ingestion"
}

Supported File Types

CategoryExtensionsNotes
Text.txt, .md, .markdown, .vtt, .srt, .log, .rtfProcessed directly (VTT/SRT transcripts supported)
PDF.pdfRendered to images, text extracted
Images.jpg, .jpeg, .png, .heic, .webp, .gifVision-based classification
Code.py, .js, .ts, .swift, .go, .rs, .java, .cpp, .c, .hText extraction
Data.csv, .tsv, .json, .yaml, .yml, .toml, .xmlStructured data extraction
Email.eml, .msgEmail parsing
Web.html, .htmHTML parsing
Max content size: Practical limit is ~50KB of text for JSON body ingestion. For file uploads, the limit is 10MB.
Classifier source_type values are: pdf, image, text, code, markdown, document, spreadsheet, email, web, unknown. These are inferred by Remem (file extension + classifier) and are not user-supplied.

Async Processing Pipeline

After ingestion, the document goes through a multi-stage processing pipeline. Here’s what happens:
1

Document Queued

Your document is added to the Redis Streams job queue.
2

Worker Picks Up Job

A background worker retrieves the job for processing.
3

Content Encrypted

Text content is encrypted with your tenant’s Data Encryption Key (DEK) using AES-256-GCM.
4

Text Chunking

Content is split into semantic chunks using adaptive v2 chunking (~1200 char targets, 1800 max, 15% overlap).
5

Embedding Generation

Each chunk is embedded using voyage-3.5-lite (1024-dimensional vectors).
6

Document Classification

Grok 4 Fast analyzes the document and extracts:
  • Category (e.g., “invoice”, “manual”, “screenshot”)
  • Tags (key:value format like vendor:amazon, topic:finance)
  • Sensitivity level (public, internal, confidential, personal)
  • Language detection
  • Summary for search
  • Extracted structured data (amounts, dates, names, etc.)
7

Vector Indexing

Embeddings are indexed in your isolated Qdrant collection for semantic search.
8

BM25 Indexing

BM25 search vectors are computed for keyword-based hybrid search.
9

PageIndex Tree (Optional)

For long PDFs and Markdown files (default: ≥20k characters), a separate worker builds a hierarchical PageIndex tree of sections and summaries. Node summaries are stored encrypted in PostgreSQL and later used in rich-mode retrieval.
Documents are not immediately searchable. For typical text documents, expect 5-15 seconds of processing time. Large PDFs or image-heavy documents may take longer.

PageIndex for Long Documents

PageIndex is an optional long-document indexer that complements standard vector + BM25 search. It is designed to improve retrieval quality for very large PDFs and Markdown files by building a hierarchical map of sections. How it works:
  • Runs asynchronously in a separate pageindex-worker
  • Only applies to PDF and Markdown inputs above a size threshold (default: 20,000 characters)
  • Generates a tree of section nodes with short summaries
  • Stores node summaries encrypted in PostgreSQL (same DEK model as documents)
  • Does not replace standard chunking or vector search — it augments rich-mode results
Notes:
  • PageIndex uses an external LLM (OpenAI-compatible API; Grok is the default in production).
  • If PageIndex is disabled or the worker is offline, ingestion still completes normally and all other search paths continue to work.

Retrieve a Document

GET /v1/documents/{document_id} Fetch a document with decrypted content, metadata, and classification results.
curl https://api.remem.io/v1/documents/{document_id} \
  -H "X-API-Key: vlt_..."
Response:
{
  "id": "a1b2c3d4-...",
  "tenant_id": "...",
  "title": "Meeting Notes - Q1 Planning",
  "content": "We decided to focus on three priorities...",
  "source": "api",
  "status": "completed",
  "chunk_count": 3,
  "created_at": "2026-02-04T10:30:00Z",
  "classification": {
    "category": "meeting_notes",
    "tags": ["project:q1-planning", "topic:strategy"],
    "sensitivity": "internal",
    "language": "en",
    "summary": "Q1 planning meeting discussing priorities for EU expansion and mobile app launch",
    "confidence": 0.92,
    "extracted": {
      "date": "2026-01-15",
      "attendees": ["Alice", "Bob"],
      "priorities": ["EU expansion", "mobile app", "hiring"]
    }
  }
}
The API respects your API key’s sensitivity scope. If your key has internal access, you cannot retrieve confidential or personal documents.

Update a Document

POST /v1/documents/{document_id}/update Creates a new version of the document. Does not overwrite the original.
curl -X POST https://api.remem.io/v1/documents/{document_id}/update \
  -H "Content-Type: application/json" \
  -H "X-API-Key: vlt_..." \
  -d '{
    "content": "Updated meeting notes...",
    "title": "Meeting Notes - Q1 Planning (Revised)"
  }'
Response:
{
  "id": "a1b2c3d4-...",
  "version": 2,
  "is_new_version": true,
  "message": "New version created"
}
Versioning: Each update creates a new document version. The previous version is marked as superseded but remains in the database for audit purposes. Old chunks and vectors are queued for deletion after the new version is indexed.

Delete a Document

DELETE /v1/documents/{document_id} Soft delete—marks the document as deleted and excludes it from search results immediately.
curl -X DELETE https://api.remem.io/v1/documents/{document_id} \
  -H "X-API-Key: vlt_..."
Response:
{
  "message": "Document marked for deletion",
  "deleted_at": "2026-02-04T11:00:00Z"
}
Soft delete: The document is hidden from queries immediately but remains in the database. Hard deletion (complete removal from PostgreSQL, Qdrant, and Spaces) happens asynchronously within 30 days.

Hard Deletion

Hard deletion removes the document from all stores:
  • PostgreSQL (document record, chunks, metadata)
  • Qdrant (vector embeddings)
  • DigitalOcean Spaces (raw file)
  • Redis (cached data)
Hard delete is triggered automatically 30 days after soft delete, or you can request immediate hard delete using the hard_delete flag:
curl -X DELETE https://api.remem.io/v1/documents/{document_id}?hard_delete=true \
  -H "X-API-Key: vlt_..."

Classification Fields

When a document is processed, Grok 4 Fast extracts the following metadata fields:
What type of document is this?Free-form string determined by the LLM based on content, not file extension.Examples: "invoice", "manual", "receipt", "screenshot", "meeting_notes", "property-listing", "guidebook"
Categories are not enums. The LLM decides what the document is based on its actual content.
Key-value tags for filtering and organizationSemi-structured tags in key:value format. Examples:
  • vendor:amazon
  • topic:real-estate
  • project:q4-2025
  • department:engineering
Tags are used for filtering in queries and organizing documents into collections.
Access control levelOne of four levels:
  • public — Can be shared freely
  • internal — Within organization only
  • confidential — Limited access (financial, legal)
  • personal — Private to user (default)
API keys are scoped to a maximum sensitivity level. Queries automatically filter results based on the key’s access level.
Detected languageISO 639-1 language code (e.g., "en", "es", "fr", "de").Used for language-specific search and filtering.
1-2 sentence descriptionA concise summary of the document content (max 500 chars) optimized for search.Example: "Q1 planning meeting discussing priorities for EU expansion, mobile app launch by March, and hiring two engineers."
Detailed indexable descriptionA longer, more detailed description extracted by the LLM for full-text search. Includes key entities, topics, and context that might not be in the summary.
Structured key-value dataFree-form JSON object containing structured data extracted from the document. Only present when the LLM detects extractable fields.Examples:Invoice:
{
  "vendor": "Amazon",
  "amount": 49.99,
  "date": "2026-01-10",
  "currency": "USD"
}
Property Listing:
{
  "address": "123 Main St",
  "price": 450000,
  "beds": 3,
  "sqft": 1800
}
Meeting Notes:
{
  "date": "2026-01-10",
  "attendees": ["Alice", "Bob"],
  "topic": "Q1 Planning"
}
Extracted data is stored encrypted in PostgreSQL and can be queried using structured filters.
Classification confidence scoreFloat between 0.0 and 1.0 indicating how confident the LLM is in its classification.
  • >= 0.7 — High confidence, classification is reliable
  • < 0.7 — Low confidence, may require manual review
If primary model (Grok 4 Fast) returns confidence < 0.7, the system automatically falls back to Claude Haiku 4.5.

Common Pitfalls & Tips

Ingestion is async. Don’t expect documents to be immediately searchable. Poll the query API or wait 10-15 seconds before testing search.
Large documents are chunked automatically. You don’t need to split them yourself. Send the full content—the system handles semantic chunking.
Duplicate detection uses source_id. Always set it for idempotent ingestion. If you retry the same source_id, the system creates a new version instead of a duplicate.
Metadata is encrypted at rest. You can store sensitive key-value pairs (customer IDs, account numbers, etc.) safely. They’re encrypted in PostgreSQL and only decrypted during queries.
File uploads have a 10MB limit. For larger files, consider splitting them or using a file transfer service.
Supported file types. Remem supports PDF, images, text, markdown, spreadsheets, code, and more. Check the supported file types table above.
Max content for JSON body: ~50KB of text. For larger content, use multipart file upload instead.

Idempotent Ingestion

Use the Idempotency-Key header to ensure duplicate requests don’t create multiple documents:
curl -X POST https://api.remem.io/v1/documents/ingest \
  -H "Content-Type: application/json" \
  -H "X-API-Key: vlt_..." \
  -H "Idempotency-Key: upload-20260204-meeting-notes" \
  -d '{
    "title": "Meeting Notes",
    "content": "..."
  }'
If you retry this request with the same Idempotency-Key, Remem returns the cached response from the first request instead of creating a duplicate.
Idempotency keys are stored in Redis for 24 hours and are tenant-scoped. After 24 hours, the key expires and a new request would be treated as fresh.

Raw File Storage & TTL

Remem stores raw files in DigitalOcean Spaces with a hybrid retention policy:
Document TypeRetention Policy
High-value (contracts, invoices, legal, tax, medical)Kept forever
User-starredKept forever
Everything elseKept for 90 days, then deleted
Always keptText content, embeddings, summaries, extracted data
Even after raw files are deleted, searchable content remains. You can still query the document using semantic search—you just can’t retrieve the original PDF or image.
To mark a document for permanent retention:
curl -X POST https://api.remem.io/v1/documents/{document_id}/star \
  -H "X-API-Key: vlt_..."

Next Steps