tg-save-doc-embeds
Saves document embeddings from TrustGraph processing streams to MessagePack format files.
Synopsis
tg-save-doc-embeds -o OUTPUT_FILE [options]
Description
The tg-save-doc-embeds
command connects to TrustGraph’s document embeddings export stream and saves the embeddings to a file in MessagePack format. This is useful for creating backups of document embeddings, exporting data for analysis, or preparing data for migration between systems.
The command should typically be started before document processing begins to capture all embeddings as they are generated.
Options
Required Arguments
-o, --output-file FILE
: Output file for saved embeddings
Optional Arguments
-u, --url URL
: TrustGraph API URL (default:$TRUSTGRAPH_API
orhttp://localhost:8088/
)-f, --flow-id ID
: Flow instance ID to monitor (default:default
)--format FORMAT
: Output format -msgpack
orjson
(default:msgpack
)--user USER
: Filter by user ID (default: no filter)--collection COLLECTION
: Filter by collection ID (default: no filter)
Examples
Basic Document Embeddings Export
tg-save-doc-embeds -o document-embeddings.msgpack
Export from Specific Flow
tg-save-doc-embeds \
-o research-embeddings.msgpack \
-f "research-processing-flow"
Filter by User and Collection
tg-save-doc-embeds \
-o filtered-embeddings.msgpack \
--user "research-team" \
--collection "research-docs"
Export to JSON Format
tg-save-doc-embeds \
-o embeddings.json \
--format json
Production Backup
tg-save-doc-embeds \
-o "backup-$(date +%Y%m%d-%H%M%S).msgpack" \
-u https://production-api.company.com/ \
-f "production-flow"
Output Format
MessagePack Structure
Document embeddings are saved as MessagePack records:
["de", {
"m": {
"i": "document-id",
"m": [{"metadata": "objects"}],
"u": "user-id",
"c": "collection-id"
},
"c": [{
"c": "text chunk content",
"v": [0.1, 0.2, 0.3, ...]
}]
}]
Components
- Record Type:
"de"
indicates document embeddings - Metadata (
m
): Document information and context - Chunks (
c
): Text chunks with their vector embeddings
Use Cases
Backup Creation
# Create regular backups of document embeddings
create_embeddings_backup() {
local backup_dir="embeddings-backups"
local timestamp=$(date +%Y%m%d_%H%M%S)
local backup_file="$backup_dir/embeddings-$timestamp.msgpack"
mkdir -p "$backup_dir"
echo "Creating embeddings backup: $backup_file"
# Start backup process
tg-save-doc-embeds -o "$backup_file" &
save_pid=$!
echo "Backup process started (PID: $save_pid)"
echo "To stop: kill $save_pid"
echo "Backup file: $backup_file"
# Optionally wait for a specific duration
# sleep 3600 # Run for 1 hour
# kill $save_pid
}
# Create backup
create_embeddings_backup
Data Migration Preparation
# Prepare embeddings for migration
prepare_migration_data() {
local source_env="$1"
local collection="$2"
local migration_file="migration-$(date +%Y%m%d).msgpack"
echo "Preparing migration data from: $source_env"
echo "Collection: $collection"
# Export embeddings from source
tg-save-doc-embeds \
-o "$migration_file" \
-u "http://$source_env:8088/" \
--collection "$collection" &
export_pid=$!
# Let it run for specified time to capture data
echo "Capturing embeddings for migration..."
echo "Process PID: $export_pid"
# In practice, you'd run this for the duration needed
# sleep 1800 # 30 minutes
# kill $export_pid
echo "Migration data will be saved to: $migration_file"
}
# Prepare migration from dev to production
prepare_migration_data "dev-server" "processed-docs"
Continuous Export
# Continuous embeddings export with rotation
continuous_export() {
local output_dir="continuous-exports"
local rotation_hours=24
local file_prefix="embeddings"
mkdir -p "$output_dir"
while true; do
timestamp=$(date +%Y%m%d_%H%M%S)
output_file="$output_dir/${file_prefix}-${timestamp}.msgpack"
echo "Starting export to: $output_file"
# Start export for specified duration
timeout ${rotation_hours}h tg-save-doc-embeds -o "$output_file"
# Compress completed file
gzip "$output_file"
echo "Export completed and compressed: ${output_file}.gz"
# Optional: clean up old files
find "$output_dir" -name "*.msgpack.gz" -mtime +30 -delete
# Brief pause before next rotation
sleep 60
done
}
# Start continuous export (run in background)
continuous_export &
Analysis and Research
# Export embeddings for research analysis
export_for_research() {
local research_topic="$1"
local output_file="research-${research_topic}-$(date +%Y%m%d).msgpack"
echo "Exporting embeddings for research: $research_topic"
# Start export with filtering
tg-save-doc-embeds \
-o "$output_file" \
--collection "$research_topic" &
export_pid=$!
echo "Research export started (PID: $export_pid)"
echo "Output: $output_file"
# Create analysis script
cat > "analyze-${research_topic}.sh" << EOF
#!/bin/bash
# Analysis script for $research_topic embeddings
echo "Analyzing $research_topic embeddings..."
# Basic statistics
echo "=== Basic Statistics ==="
tg-dump-msgpack -i "$output_file" --summary
# Detailed analysis
echo "=== Detailed Analysis ==="
tg-dump-msgpack -i "$output_file" | head -10
echo "Analysis complete for $research_topic"
EOF
chmod +x "analyze-${research_topic}.sh"
echo "Analysis script created: analyze-${research_topic}.sh"
}
# Export for different research topics
export_for_research "cybersecurity"
export_for_research "climate-change"
Advanced Usage
Selective Export
# Export embeddings with multiple filters
selective_export() {
local users=("user1" "user2" "user3")
local collections=("docs1" "docs2")
for user in "${users[@]}"; do
for collection in "${collections[@]}"; do
output_file="embeddings-${user}-${collection}.msgpack"
echo "Exporting for user: $user, collection: $collection"
tg-save-doc-embeds \
-o "$output_file" \
--user "$user" \
--collection "$collection" &
# Store PID for later management
echo $! > "${output_file}.pid"
done
done
echo "All selective exports started"
}
Monitoring and Statistics
# Monitor export progress with statistics
monitor_export() {
local output_file="$1"
local pid_file="${output_file}.pid"
if [ ! -f "$pid_file" ]; then
echo "PID file not found: $pid_file"
return 1
fi
local export_pid=$(cat "$pid_file")
echo "Monitoring export (PID: $export_pid)..."
echo "Output file: $output_file"
while kill -0 "$export_pid" 2>/dev/null; do
if [ -f "$output_file" ]; then
file_size=$(stat -c%s "$output_file" 2>/dev/null || echo "0")
human_size=$(numfmt --to=iec-i --suffix=B "$file_size")
# Try to count embeddings
embedding_count=$(tg-dump-msgpack -i "$output_file" 2>/dev/null | grep -c '^\["de"' || echo "0")
echo "File size: $human_size, Embeddings: $embedding_count"
else
echo "Output file not yet created..."
fi
sleep 30
done
echo "Export process completed"
rm "$pid_file"
}
# Start export and monitor
tg-save-doc-embeds -o "monitored-export.msgpack" &
echo $! > "monitored-export.msgpack.pid"
monitor_export "monitored-export.msgpack"
Export Validation
# Validate exported embeddings
validate_export() {
local export_file="$1"
echo "Validating export file: $export_file"
# Check file exists and has content
if [ ! -s "$export_file" ]; then
echo "✗ Export file is empty or missing"
return 1
fi
# Check MessagePack format
if tg-dump-msgpack -i "$export_file" --summary > /dev/null 2>&1; then
echo "✓ Valid MessagePack format"
else
echo "✗ Invalid MessagePack format"
return 1
fi
# Check for document embeddings
embedding_count=$(tg-dump-msgpack -i "$export_file" | grep -c '^\["de"' || echo "0")
if [ "$embedding_count" -gt 0 ]; then
echo "✓ Contains $embedding_count document embeddings"
else
echo "✗ No document embeddings found"
return 1
fi
# Get vector dimension information
summary=$(tg-dump-msgpack -i "$export_file" --summary)
if echo "$summary" | grep -q "Vector dimension:"; then
dimension=$(echo "$summary" | grep "Vector dimension:" | awk '{print $3}')
echo "✓ Vector dimension: $dimension"
else
echo "⚠ Could not determine vector dimension"
fi
echo "Validation completed successfully"
}
Export Scheduling
# Scheduled export with cron-like functionality
schedule_export() {
local schedule="$1" # e.g., "daily", "hourly", "weekly"
local output_prefix="$2"
case "$schedule" in
"hourly")
interval=3600
;;
"daily")
interval=86400
;;
"weekly")
interval=604800
;;
*)
echo "Invalid schedule: $schedule"
return 1
;;
esac
echo "Starting $schedule exports with prefix: $output_prefix"
while true; do
timestamp=$(date +%Y%m%d_%H%M%S)
output_file="${output_prefix}-${timestamp}.msgpack"
echo "Starting scheduled export: $output_file"
# Run export for the scheduled interval
timeout ${interval}s tg-save-doc-embeds -o "$output_file"
# Validate and compress
if validate_export "$output_file"; then
gzip "$output_file"
echo "✓ Export completed and compressed: ${output_file}.gz"
else
echo "✗ Export validation failed: $output_file"
mv "$output_file" "${output_file}.failed"
fi
# Brief pause before next cycle
sleep 60
done
}
# Start daily scheduled exports
schedule_export "daily" "daily-embeddings" &
Performance Considerations
Memory Management
# Monitor memory usage during export
monitor_memory_export() {
local output_file="$1"
# Start export
tg-save-doc-embeds -o "$output_file" &
export_pid=$!
echo "Monitoring memory usage for export (PID: $export_pid)..."
while kill -0 "$export_pid" 2>/dev/null; do
memory_usage=$(ps -p "$export_pid" -o rss= 2>/dev/null | awk '{print $1/1024}')
if [ -n "$memory_usage" ]; then
echo "Memory usage: ${memory_usage}MB"
fi
sleep 10
done
echo "Export completed"
}
Network Optimization
# Optimize for network conditions
network_optimized_export() {
local output_file="$1"
local api_url="$2"
echo "Starting network-optimized export..."
# Use compression and buffering
tg-save-doc-embeds \
-o "$output_file" \
-u "$api_url" \
--format msgpack & # MessagePack is more compact than JSON
export_pid=$!
# Monitor network usage
echo "Monitoring export (PID: $export_pid)..."
while kill -0 "$export_pid" 2>/dev/null; do
# Monitor network connections
connections=$(netstat -an | grep ":8088" | wc -l)
echo "Active connections: $connections"
sleep 30
done
}
Error Handling
Connection Issues
Exception: WebSocket connection failed
Solution: Check API URL and ensure TrustGraph WebSocket service is running.
Disk Space Issues
Exception: No space left on device
Solution: Free up disk space or use a different output location.
Permission Errors
Exception: Permission denied
Solution: Check write permissions for the output file location.
Memory Issues
MemoryError: Unable to allocate memory
Solution: Monitor memory usage and consider using smaller export windows.
Integration with Other Commands
Complete Backup Workflow
# Complete backup and restore workflow
backup_restore_workflow() {
local backup_file="embeddings-backup.msgpack"
echo "=== Backup Phase ==="
# Create backup
tg-save-doc-embeds -o "$backup_file" &
backup_pid=$!
# Let it run for a while
sleep 300 # 5 minutes
kill $backup_pid
echo "Backup created: $backup_file"
# Validate backup
validate_export "$backup_file"
echo "=== Restore Phase ==="
# Restore from backup (to different collection)
tg-load-doc-embeds -i "$backup_file" --collection "restored"
echo "Backup and restore workflow completed"
}
Analysis Pipeline
# Export and analyze embeddings
export_analyze_pipeline() {
local topic="$1"
local export_file="analysis-${topic}.msgpack"
echo "Starting export and analysis pipeline for: $topic"
# Export embeddings
tg-save-doc-embeds \
-o "$export_file" \
--collection "$topic" &
export_pid=$!
# Run for analysis duration
sleep 600 # 10 minutes
kill $export_pid
# Analyze exported data
echo "Analyzing exported embeddings..."
tg-dump-msgpack -i "$export_file" --summary
# Count embeddings by user
echo "Embeddings by user:"
tg-dump-msgpack -i "$export_file" | \
jq -r '.[1].m.u' | \
sort | uniq -c
echo "Analysis pipeline completed"
}
Environment Variables
TRUSTGRAPH_API
: Default API URL
Related Commands
tg-load-doc-embeds
- Load document embeddings from filestg-dump-msgpack
- Analyze MessagePack filestg-show-flows
- List available flows for monitoring
API Integration
This command uses TrustGraph’s WebSocket API for document embeddings export, specifically the /api/v1/flow/{flow-id}/export/document-embeddings
endpoint.
Best Practices
- Start Early: Begin export before processing starts to capture all data
- Monitoring: Monitor export progress and file sizes
- Validation: Always validate exported files
- Compression: Use compression for long-term storage
- Rotation: Implement file rotation for continuous exports
- Backup: Keep multiple backup copies in different locations
- Documentation: Document export schedules and procedures
Troubleshooting
No Data Captured
# Check if processing is generating embeddings
tg-show-flows | grep processing
# Verify WebSocket connection
netstat -an | grep :8088
Large File Issues
# Monitor file growth
watch -n 5 'ls -lh *.msgpack'
# Check available disk space
df -h
Process Management
# List running export processes
ps aux | grep tg-save-doc-embeds
# Kill stuck processes
pkill -f tg-save-doc-embeds