TrustGraph Document Embeddings API

This API provides import, export, and query capabilities for document embeddings. It handles document chunks with their vector embeddings and metadata, supporting both real-time WebSocket operations and request/response patterns.

Schema Overview

DocumentEmbeddings Structure

metadata: Document metadata (ID, user, collection, RDF triples)
chunks: Array of document chunks with embeddings

ChunkEmbeddings Structure

chunk: Text chunk as bytes
vectors: Array of vector embeddings (Array of Array of Double)

DocumentEmbeddingsRequest Structure

vectors: Query vector embeddings
limit: Maximum number of results
user: User identifier
collection: Collection identifier

DocumentEmbeddingsResponse Structure

error: Error information if operation fails
documents: Array of matching documents as bytes

Import/Export Operations

Import - WebSocket Endpoint

Endpoint: /api/v1/flow/{flow}/import/document-embeddings

Method: WebSocket connection

Request Format:

{
    "metadata": {
        "id": "doc-123",
        "user": "alice",
        "collection": "research",
        "metadata": [
            {
                "s": {"v": "doc-123", "e": true},
                "p": {"v": "dc:title", "e": true},
                "o": {"v": "Research Paper", "e": false}
            }
        ]
    },
    "chunks": [
        {
            "chunk": "This is the first chunk of the document...",
            "vectors": [
                [0.1, 0.2, 0.3, 0.4],
                [0.5, 0.6, 0.7, 0.8]
            ]
        },
        {
            "chunk": "This is the second chunk...",
            "vectors": [
                [0.9, 0.8, 0.7, 0.6],
                [0.5, 0.4, 0.3, 0.2]
            ]
        }
    ]
}

Response: Import operations are fire-and-forget with no response payload.

Export - WebSocket Endpoint

Endpoint: /api/v1/flow/{flow}/export/document-embeddings

Method: WebSocket connection

The export endpoint streams document embeddings data in real-time. Each message contains:

{
    "metadata": {
        "id": "doc-123",
        "user": "alice",
        "collection": "research",
        "metadata": [
            {
                "s": {"v": "doc-123", "e": true},
                "p": {"v": "dc:title", "e": true},
                "o": {"v": "Research Paper", "e": false}
            }
        ]
    },
    "chunks": [
        {
            "chunk": "Decoded text content of chunk",
            "vectors": [[0.1, 0.2, 0.3, 0.4]]
        }
    ]
}

Query Operations

Query Document Embeddings

Purpose: Find documents similar to provided vector embeddings

Request:

{
    "vectors": [
        [0.1, 0.2, 0.3, 0.4, 0.5],
        [0.6, 0.7, 0.8, 0.9, 1.0]
    ],
    "limit": 10,
    "user": "alice",
    "collection": "research"
}

Response:

{
    "documents": [
        "base64-encoded-document-1",
        "base64-encoded-document-2"
    ]
}

WebSocket Usage Examples

Importing Document Embeddings

// Connect to import endpoint
const ws = new WebSocket('ws://api-gateway:8080/api/v1/flow/my-flow/import/document-embeddings');

// Send document embeddings
ws.send(JSON.stringify({
    metadata: {
        id: "doc-123",
        user: "alice",
        collection: "research"
    },
    chunks: [
        {
            chunk: "Document content chunk 1",
            vectors: [[0.1, 0.2, 0.3]]
        }
    ]
}));

Exporting Document Embeddings

// Connect to export endpoint
const ws = new WebSocket('ws://api-gateway:8080/api/v1/flow/my-flow/export/document-embeddings');

// Listen for exported data
ws.onmessage = (event) => {
    const documentEmbeddings = JSON.parse(event.data);
    console.log('Received document:', documentEmbeddings.metadata.id);
    console.log('Chunks:', documentEmbeddings.chunks.length);
};

Data Format Details

Metadata Format

Each metadata triple contains:

s: Subject (object with v for value and e for is_entity boolean)
p: Predicate (object with v for value and e for is_entity boolean)
o: Object (object with v for value and e for is_entity boolean)

Vector Format

Vectors are arrays of floating-point numbers
Each chunk can have multiple vectors (different embedding models)
Vectors should be consistently dimensioned within a collection

Text Encoding

Chunk text is handled as UTF-8 encoded bytes internally
WebSocket API accepts/returns plain text strings
Base64 encoding used for binary data in query responses

Python SDK

from trustgraph.clients.document_embeddings_client import DocumentEmbeddingsClient

# Create client
client = DocumentEmbeddingsClient()

# Query similar documents
request = {
    "vectors": [[0.1, 0.2, 0.3, 0.4]],
    "limit": 5,
    "user": "alice",
    "collection": "research"
}

response = await client.query(request)
documents = response.documents

Integration with TrustGraph

Storage Integration

Document embeddings are stored in vector databases
Metadata is cross-referenced with knowledge graph
Supports multi-tenant isolation by user and collection

Processing Pipeline

Document Ingestion: Text documents loaded via text-load API
Chunking: Documents split into manageable chunks
Embedding Generation: Vector embeddings created for each chunk
Storage: Embeddings stored via import API
Retrieval: Similar documents found via query API

Use Cases

Semantic Search: Find documents similar to query embeddings
RAG Systems: Retrieve relevant document chunks for question answering
Document Clustering: Group similar documents using embeddings
Content Recommendations: Suggest related documents to users
Knowledge Discovery: Find connections between document collections

Error Handling

Common error scenarios:

Invalid vector dimensions
Missing required metadata fields
User/collection access restrictions
WebSocket connection failures
Malformed JSON data

Errors are returned in the response error field:

{
    "error": {
        "type": "ValidationError",
        "message": "Invalid vector dimensions"
    }
}

Performance Considerations

Batch Processing: Import multiple documents in single WebSocket session
Vector Dimensions: Consistent embedding dimensions improve performance
Collection Sizing: Limit collections to reasonable sizes for query performance
Real-time vs Batch: Choose appropriate method based on use case requirements