TrustGraph Document Embeddings API
This API provides import, export, and query capabilities for document embeddings. It handles document chunks with their vector embeddings and metadata, supporting both real-time WebSocket operations and request/response patterns.
Schema Overview
DocumentEmbeddings Structure
metadata
: Document metadata (ID, user, collection, RDF triples)chunks
: Array of document chunks with embeddings
ChunkEmbeddings Structure
chunk
: Text chunk as bytesvectors
: Array of vector embeddings (Array of Array of Double)
DocumentEmbeddingsRequest Structure
vectors
: Query vector embeddingslimit
: Maximum number of resultsuser
: User identifiercollection
: Collection identifier
DocumentEmbeddingsResponse Structure
error
: Error information if operation failsdocuments
: Array of matching documents as bytes
Import/Export Operations
Import - WebSocket Endpoint
Endpoint: /api/v1/flow/{flow}/import/document-embeddings
Method: WebSocket connection
Request Format:
{
"metadata": {
"id": "doc-123",
"user": "alice",
"collection": "research",
"metadata": [
{
"s": {"v": "doc-123", "e": true},
"p": {"v": "dc:title", "e": true},
"o": {"v": "Research Paper", "e": false}
}
]
},
"chunks": [
{
"chunk": "This is the first chunk of the document...",
"vectors": [
[0.1, 0.2, 0.3, 0.4],
[0.5, 0.6, 0.7, 0.8]
]
},
{
"chunk": "This is the second chunk...",
"vectors": [
[0.9, 0.8, 0.7, 0.6],
[0.5, 0.4, 0.3, 0.2]
]
}
]
}
Response: Import operations are fire-and-forget with no response payload.
Export - WebSocket Endpoint
Endpoint: /api/v1/flow/{flow}/export/document-embeddings
Method: WebSocket connection
The export endpoint streams document embeddings data in real-time. Each message contains:
{
"metadata": {
"id": "doc-123",
"user": "alice",
"collection": "research",
"metadata": [
{
"s": {"v": "doc-123", "e": true},
"p": {"v": "dc:title", "e": true},
"o": {"v": "Research Paper", "e": false}
}
]
},
"chunks": [
{
"chunk": "Decoded text content of chunk",
"vectors": [[0.1, 0.2, 0.3, 0.4]]
}
]
}
Query Operations
Query Document Embeddings
Purpose: Find documents similar to provided vector embeddings
Request:
{
"vectors": [
[0.1, 0.2, 0.3, 0.4, 0.5],
[0.6, 0.7, 0.8, 0.9, 1.0]
],
"limit": 10,
"user": "alice",
"collection": "research"
}
Response:
{
"documents": [
"base64-encoded-document-1",
"base64-encoded-document-2"
]
}
WebSocket Usage Examples
Importing Document Embeddings
// Connect to import endpoint
const ws = new WebSocket('ws://api-gateway:8080/api/v1/flow/my-flow/import/document-embeddings');
// Send document embeddings
ws.send(JSON.stringify({
metadata: {
id: "doc-123",
user: "alice",
collection: "research"
},
chunks: [
{
chunk: "Document content chunk 1",
vectors: [[0.1, 0.2, 0.3]]
}
]
}));
Exporting Document Embeddings
// Connect to export endpoint
const ws = new WebSocket('ws://api-gateway:8080/api/v1/flow/my-flow/export/document-embeddings');
// Listen for exported data
ws.onmessage = (event) => {
const documentEmbeddings = JSON.parse(event.data);
console.log('Received document:', documentEmbeddings.metadata.id);
console.log('Chunks:', documentEmbeddings.chunks.length);
};
Data Format Details
Metadata Format
Each metadata triple contains:
s
: Subject (object withv
for value ande
for is_entity boolean)p
: Predicate (object withv
for value ande
for is_entity boolean)o
: Object (object withv
for value ande
for is_entity boolean)
Vector Format
- Vectors are arrays of floating-point numbers
- Each chunk can have multiple vectors (different embedding models)
- Vectors should be consistently dimensioned within a collection
Text Encoding
- Chunk text is handled as UTF-8 encoded bytes internally
- WebSocket API accepts/returns plain text strings
- Base64 encoding used for binary data in query responses
Python SDK
from trustgraph.clients.document_embeddings_client import DocumentEmbeddingsClient
# Create client
client = DocumentEmbeddingsClient()
# Query similar documents
request = {
"vectors": [[0.1, 0.2, 0.3, 0.4]],
"limit": 5,
"user": "alice",
"collection": "research"
}
response = await client.query(request)
documents = response.documents
Integration with TrustGraph
Storage Integration
- Document embeddings are stored in vector databases
- Metadata is cross-referenced with knowledge graph
- Supports multi-tenant isolation by user and collection
Processing Pipeline
- Document Ingestion: Text documents loaded via text-load API
- Chunking: Documents split into manageable chunks
- Embedding Generation: Vector embeddings created for each chunk
- Storage: Embeddings stored via import API
- Retrieval: Similar documents found via query API
Use Cases
- Semantic Search: Find documents similar to query embeddings
- RAG Systems: Retrieve relevant document chunks for question answering
- Document Clustering: Group similar documents using embeddings
- Content Recommendations: Suggest related documents to users
- Knowledge Discovery: Find connections between document collections
Error Handling
Common error scenarios:
- Invalid vector dimensions
- Missing required metadata fields
- User/collection access restrictions
- WebSocket connection failures
- Malformed JSON data
Errors are returned in the response error
field:
{
"error": {
"type": "ValidationError",
"message": "Invalid vector dimensions"
}
}
Performance Considerations
- Batch Processing: Import multiple documents in single WebSocket session
- Vector Dimensions: Consistent embedding dimensions improve performance
- Collection Sizing: Limit collections to reasonable sizes for query performance
- Real-time vs Batch: Choose appropriate method based on use case requirements