tg-load-doc-embeds
Loads document embeddings from MessagePack format into TrustGraph processing pipelines.
Synopsis
tg-load-doc-embeds -i INPUT_FILE [options]
Description
The tg-load-doc-embeds
command loads document embeddings from MessagePack files into a running TrustGraph system. This is typically used to restore previously saved document embeddings or to load embeddings generated by external systems.
The command reads document embedding data in MessagePack format and streams it to TrustGraph’s document embeddings import API via WebSocket connections.
Options
Required Arguments
-i, --input-file FILE
: Input MessagePack file containing document embeddings
Optional Arguments
-u, --url URL
: TrustGraph API URL (default:$TRUSTGRAPH_API
orhttp://localhost:8088/
)-f, --flow-id ID
: Flow instance ID to use (default:default
)--format FORMAT
: Input format -msgpack
orjson
(default:msgpack
)--user USER
: Override user ID from input data--collection COLLECTION
: Override collection ID from input data
Examples
Basic Loading
tg-load-doc-embeds -i document-embeddings.msgpack
Load with Custom Flow
tg-load-doc-embeds \
-i embeddings.msgpack \
-f "document-processing-flow"
Override User and Collection
tg-load-doc-embeds \
-i embeddings.msgpack \
--user "research-team" \
--collection "research-docs"
Load from JSON Format
tg-load-doc-embeds \
-i embeddings.json \
--format json
Production Loading
tg-load-doc-embeds \
-i production-embeddings.msgpack \
-u https://trustgraph-api.company.com/ \
-f "production-flow" \
--user "system" \
--collection "production-docs"
Input Data Format
MessagePack Structure
Document embeddings are stored as MessagePack records with this structure:
["de", {
"m": {
"i": "document-id",
"m": [{"metadata": "objects"}],
"u": "user-id",
"c": "collection-id"
},
"c": [{
"c": "text chunk content",
"v": [0.1, 0.2, 0.3, ...]
}]
}]
Components
- Document Metadata (
m
):i
: Document IDm
: Document metadata objectsu
: User IDc
: Collection ID
- Chunks (
c
): Array of text chunks with embeddings:c
: Text content of the chunkv
: Vector embedding array
Use Cases
Backup Restoration
# Restore document embeddings from backup
restore_embeddings() {
local backup_file="$1"
local target_collection="$2"
echo "Restoring document embeddings from: $backup_file"
if [ ! -f "$backup_file" ]; then
echo "Backup file not found: $backup_file"
return 1
fi
# Verify backup file
if tg-dump-msgpack -i "$backup_file" --summary | grep -q "Vector dimension:"; then
echo "✓ Backup file contains embeddings"
else
echo "✗ Backup file does not contain valid embeddings"
return 1
fi
# Load embeddings
tg-load-doc-embeds \
-i "$backup_file" \
--collection "$target_collection"
echo "Embedding restoration complete"
}
# Restore from backup
restore_embeddings "backup-20231215.msgpack" "restored-docs"
Data Migration
# Migrate embeddings between environments
migrate_embeddings() {
local source_file="$1"
local target_env="$2"
local target_user="$3"
echo "Migrating embeddings to: $target_env"
# Load to target environment
tg-load-doc-embeds \
-i "$source_file" \
-u "https://$target_env/api/" \
--user "$target_user" \
--collection "migrated-docs"
echo "Migration complete"
}
# Migrate to production
migrate_embeddings "dev-embeddings.msgpack" "prod.company.com" "migration-user"
Batch Processing
# Load multiple embedding files
batch_load_embeddings() {
local input_dir="$1"
local collection="$2"
echo "Batch loading embeddings from: $input_dir"
for file in "$input_dir"/*.msgpack; do
if [ -f "$file" ]; then
echo "Loading: $(basename "$file")"
tg-load-doc-embeds \
-i "$file" \
--collection "$collection"
if [ $? -eq 0 ]; then
echo "✓ Loaded: $(basename "$file")"
else
echo "✗ Failed: $(basename "$file")"
fi
fi
done
echo "Batch loading complete"
}
# Load all embeddings
batch_load_embeddings "embeddings/" "batch-processed"
Incremental Loading
# Load new embeddings incrementally
incremental_load() {
local embeddings_dir="$1"
local processed_log="processed_embeddings.log"
# Create log if it doesn't exist
touch "$processed_log"
for file in "$embeddings_dir"/*.msgpack; do
if [ -f "$file" ]; then
# Check if already processed
if grep -q "$(basename "$file")" "$processed_log"; then
echo "Skipping already processed: $(basename "$file")"
continue
fi
echo "Processing new file: $(basename "$file")"
if tg-load-doc-embeds -i "$file"; then
echo "$(date): $(basename "$file")" >> "$processed_log"
echo "✓ Processed: $(basename "$file")"
else
echo "✗ Failed: $(basename "$file")"
fi
fi
done
}
# Run incremental loading
incremental_load "embeddings/"
Advanced Usage
Parallel Loading
# Load multiple files in parallel
parallel_load_embeddings() {
local files=("$@")
local max_parallel=3
local current_jobs=0
for file in "${files[@]}"; do
# Wait if max parallel jobs reached
while [ $current_jobs -ge $max_parallel ]; do
wait -n # Wait for any job to complete
current_jobs=$((current_jobs - 1))
done
# Start loading in background
(
echo "Loading: $file"
tg-load-doc-embeds -i "$file"
echo "Completed: $file"
) &
current_jobs=$((current_jobs + 1))
done
# Wait for all remaining jobs
wait
echo "All parallel loading completed"
}
# Load files in parallel
embedding_files=(embeddings1.msgpack embeddings2.msgpack embeddings3.msgpack)
parallel_load_embeddings "${embedding_files[@]}"
Validation and Loading
# Validate before loading
validate_and_load() {
local file="$1"
local collection="$2"
echo "Validating embedding file: $file"
# Check file exists and is readable
if [ ! -r "$file" ]; then
echo "Error: Cannot read file $file"
return 1
fi
# Validate MessagePack structure
if ! tg-dump-msgpack -i "$file" --summary > /dev/null 2>&1; then
echo "Error: Invalid MessagePack format"
return 1
fi
# Check for document embeddings
if ! tg-dump-msgpack -i "$file" | grep -q '^\["de"'; then
echo "Error: No document embeddings found"
return 1
fi
# Get embedding statistics
summary=$(tg-dump-msgpack -i "$file" --summary)
vector_dim=$(echo "$summary" | grep "Vector dimension:" | awk '{print $3}')
if [ -n "$vector_dim" ]; then
echo "✓ Found embeddings with dimension: $vector_dim"
else
echo "Warning: Could not determine vector dimension"
fi
# Load embeddings
echo "Loading validated embeddings..."
tg-load-doc-embeds -i "$file" --collection "$collection"
echo "Loading complete"
}
# Validate and load
validate_and_load "embeddings.msgpack" "validated-docs"
Progress Monitoring
# Monitor loading progress
monitor_loading() {
local file="$1"
local log_file="loading_progress.log"
# Start loading in background
tg-load-doc-embeds -i "$file" > "$log_file" 2>&1 &
local load_pid=$!
echo "Monitoring loading progress (PID: $load_pid)..."
# Monitor progress
while kill -0 $load_pid 2>/dev/null; do
if [ -f "$log_file" ]; then
# Extract progress from log
embeddings_count=$(grep -o "Document embeddings:.*[0-9]" "$log_file" | tail -1 | awk '{print $3}')
if [ -n "$embeddings_count" ]; then
echo "Progress: $embeddings_count embeddings loaded"
fi
fi
sleep 5
done
# Check final status
wait $load_pid
if [ $? -eq 0 ]; then
echo "✓ Loading completed successfully"
else
echo "✗ Loading failed"
cat "$log_file"
fi
rm "$log_file"
}
# Monitor loading
monitor_loading "large-embeddings.msgpack"
Data Transformation
# Transform embeddings during loading
transform_and_load() {
local input_file="$1"
local output_file="transformed-$(basename "$input_file")"
local new_user="$2"
local new_collection="$3"
echo "Transforming embeddings: user=$new_user, collection=$new_collection"
# This would require a transformation script
# For now, we'll show the concept
# Load with override parameters
tg-load-doc-embeds \
-i "$input_file" \
--user "$new_user" \
--collection "$new_collection"
echo "Transformation and loading complete"
}
# Transform during loading
transform_and_load "original.msgpack" "new-user" "new-collection"
Performance Optimization
Memory Management
# Monitor memory usage during loading
monitor_memory_usage() {
local file="$1"
echo "Starting memory-monitored loading..."
# Start loading in background
tg-load-doc-embeds -i "$file" &
local load_pid=$!
# Monitor memory usage
while kill -0 $load_pid 2>/dev/null; do
memory_usage=$(ps -p $load_pid -o rss= 2>/dev/null | awk '{print $1/1024}')
if [ -n "$memory_usage" ]; then
echo "Memory usage: ${memory_usage}MB"
fi
sleep 10
done
wait $load_pid
echo "Loading completed"
}
Chunked Loading
# Load large files in chunks
chunked_load() {
local large_file="$1"
local chunk_size=1000 # Records per chunk
echo "Loading large file in chunks: $large_file"
# Split the MessagePack file (this would need special tooling)
# For demonstration, assuming we have pre-split files
for chunk in "${large_file%.msgpack}"_chunk_*.msgpack; do
if [ -f "$chunk" ]; then
echo "Loading chunk: $(basename "$chunk")"
tg-load-doc-embeds -i "$chunk"
# Add delay between chunks to reduce system load
sleep 2
fi
done
echo "Chunked loading complete"
}
Error Handling
File Not Found
Exception: [Errno 2] No such file or directory
Solution: Verify file path and ensure the MessagePack file exists.
Invalid Format
Exception: Unpack failed
Solution: Verify the file is a valid MessagePack file with document embeddings.
WebSocket Connection Issues
Exception: Connection failed
Solution: Check API URL and ensure TrustGraph is running with WebSocket support.
Memory Errors
MemoryError: Unable to allocate memory
Solution: Process large files in smaller chunks or increase available memory.
Flow Not Found
Exception: Flow not found
Solution: Verify the flow ID exists with tg-show-flows
.
Integration with Other Commands
Complete Workflow
# Complete document processing workflow
process_documents_workflow() {
local pdf_dir="$1"
local embeddings_file="embeddings.msgpack"
echo "Starting complete document workflow..."
# 1. Load PDFs
for pdf in "$pdf_dir"/*.pdf; do
tg-load-pdf "$pdf"
done
# 2. Wait for processing
sleep 30
# 3. Save embeddings
tg-save-doc-embeds -o "$embeddings_file"
# 4. Process embeddings (example: load to different collection)
tg-load-doc-embeds -i "$embeddings_file" --collection "processed-docs"
echo "Complete workflow finished"
}
Backup and Restore
# Complete backup and restore cycle
backup_restore_cycle() {
local backup_file="embeddings-backup.msgpack"
echo "Creating embeddings backup..."
tg-save-doc-embeds -o "$backup_file"
echo "Simulating data loss..."
# (In real scenario, this might be system failure)
echo "Restoring from backup..."
tg-load-doc-embeds -i "$backup_file" --collection "restored"
echo "Backup/restore cycle complete"
}
Environment Variables
TRUSTGRAPH_API
: Default API URL
Related Commands
tg-save-doc-embeds
- Save document embeddings to MessagePacktg-dump-msgpack
- Analyze MessagePack filestg-load-pdf
- Load PDF documents for processingtg-show-flows
- List available flows
API Integration
This command uses TrustGraph’s WebSocket API for document embeddings import, specifically the /api/v1/flow/{flow-id}/import/document-embeddings
endpoint.
Best Practices
- Validation: Always validate MessagePack files before loading
- Backups: Keep backups of original embedding files
- Monitoring: Monitor memory usage and loading progress
- Chunking: Process large files in manageable chunks
- Error Handling: Implement robust error handling and retry logic
- Documentation: Document the source and format of embedding files
- Testing: Test loading procedures in non-production environments
Troubleshooting
Loading Stalls
# Check WebSocket connection
netstat -an | grep :8088
# Check system resources
free -h
df -h
Incomplete Loading
# Compare input vs loaded data
input_count=$(tg-dump-msgpack -i input.msgpack | grep '^\["de"' | wc -l)
echo "Input embeddings: $input_count"
# Check loaded data (would need query command)
# loaded_count=$(tg-query-embeddings --count)
# echo "Loaded embeddings: $loaded_count"
Performance Issues
# Monitor network usage
iftop
# Check TrustGraph service logs
docker logs trustgraph-service