tg-save-doc-embeds

Saves document embeddings from TrustGraph processing streams to MessagePack format files.

Synopsis

tg-save-doc-embeds -o OUTPUT_FILE [options]

Description

The tg-save-doc-embeds command connects to TrustGraph’s document embeddings export stream and saves the embeddings to a file in MessagePack format. This is useful for creating backups of document embeddings, exporting data for analysis, or preparing data for migration between systems.

The command should typically be started before document processing begins to capture all embeddings as they are generated.

Options

Required Arguments

  • -o, --output-file FILE: Output file for saved embeddings

Optional Arguments

  • -u, --url URL: TrustGraph API URL (default: $TRUSTGRAPH_API or http://localhost:8088/)
  • -f, --flow-id ID: Flow instance ID to monitor (default: default)
  • --format FORMAT: Output format - msgpack or json (default: msgpack)
  • --user USER: Filter by user ID (default: no filter)
  • --collection COLLECTION: Filter by collection ID (default: no filter)

Examples

Basic Document Embeddings Export

tg-save-doc-embeds -o document-embeddings.msgpack

Export from Specific Flow

tg-save-doc-embeds \
  -o research-embeddings.msgpack \
  -f "research-processing-flow"

Filter by User and Collection

tg-save-doc-embeds \
  -o filtered-embeddings.msgpack \
  --user "research-team" \
  --collection "research-docs"

Export to JSON Format

tg-save-doc-embeds \
  -o embeddings.json \
  --format json

Production Backup

tg-save-doc-embeds \
  -o "backup-$(date +%Y%m%d-%H%M%S).msgpack" \
  -u https://production-api.company.com/ \
  -f "production-flow"

Output Format

MessagePack Structure

Document embeddings are saved as MessagePack records:

["de", {
  "m": {
    "i": "document-id",
    "m": [{"metadata": "objects"}],
    "u": "user-id",
    "c": "collection-id"
  },
  "c": [{
    "c": "text chunk content",
    "v": [0.1, 0.2, 0.3, ...]
  }]
}]

Components

  • Record Type: "de" indicates document embeddings
  • Metadata (m): Document information and context
  • Chunks (c): Text chunks with their vector embeddings

Use Cases

Backup Creation

# Create regular backups of document embeddings
create_embeddings_backup() {
  local backup_dir="embeddings-backups"
  local timestamp=$(date +%Y%m%d_%H%M%S)
  local backup_file="$backup_dir/embeddings-$timestamp.msgpack"
  
  mkdir -p "$backup_dir"
  
  echo "Creating embeddings backup: $backup_file"
  
  # Start backup process
  tg-save-doc-embeds -o "$backup_file" &
  save_pid=$!
  
  echo "Backup process started (PID: $save_pid)"
  echo "To stop: kill $save_pid"
  echo "Backup file: $backup_file"
  
  # Optionally wait for a specific duration
  # sleep 3600  # Run for 1 hour
  # kill $save_pid
}

# Create backup
create_embeddings_backup

Data Migration Preparation

# Prepare embeddings for migration
prepare_migration_data() {
  local source_env="$1"
  local collection="$2"
  local migration_file="migration-$(date +%Y%m%d).msgpack"
  
  echo "Preparing migration data from: $source_env"
  echo "Collection: $collection"
  
  # Export embeddings from source
  tg-save-doc-embeds \
    -o "$migration_file" \
    -u "http://$source_env:8088/" \
    --collection "$collection" &
  
  export_pid=$!
  
  # Let it run for specified time to capture data
  echo "Capturing embeddings for migration..."
  echo "Process PID: $export_pid"
  
  # In practice, you'd run this for the duration needed
  # sleep 1800  # 30 minutes
  # kill $export_pid
  
  echo "Migration data will be saved to: $migration_file"
}

# Prepare migration from dev to production
prepare_migration_data "dev-server" "processed-docs"

Continuous Export

# Continuous embeddings export with rotation
continuous_export() {
  local output_dir="continuous-exports"
  local rotation_hours=24
  local file_prefix="embeddings"
  
  mkdir -p "$output_dir"
  
  while true; do
    timestamp=$(date +%Y%m%d_%H%M%S)
    output_file="$output_dir/${file_prefix}-${timestamp}.msgpack"
    
    echo "Starting export to: $output_file"
    
    # Start export for specified duration
    timeout ${rotation_hours}h tg-save-doc-embeds -o "$output_file"
    
    # Compress completed file
    gzip "$output_file"
    
    echo "Export completed and compressed: ${output_file}.gz"
    
    # Optional: clean up old files
    find "$output_dir" -name "*.msgpack.gz" -mtime +30 -delete
    
    # Brief pause before next rotation
    sleep 60
  done
}

# Start continuous export (run in background)
continuous_export &

Analysis and Research

# Export embeddings for research analysis
export_for_research() {
  local research_topic="$1"
  local output_file="research-${research_topic}-$(date +%Y%m%d).msgpack"
  
  echo "Exporting embeddings for research: $research_topic"
  
  # Start export with filtering
  tg-save-doc-embeds \
    -o "$output_file" \
    --collection "$research_topic" &
  
  export_pid=$!
  
  echo "Research export started (PID: $export_pid)"
  echo "Output: $output_file"
  
  # Create analysis script
  cat > "analyze-${research_topic}.sh" << EOF
#!/bin/bash
# Analysis script for $research_topic embeddings

echo "Analyzing $research_topic embeddings..."

# Basic statistics
echo "=== Basic Statistics ==="
tg-dump-msgpack -i "$output_file" --summary

# Detailed analysis
echo "=== Detailed Analysis ==="
tg-dump-msgpack -i "$output_file" | head -10

echo "Analysis complete for $research_topic"
EOF
  
  chmod +x "analyze-${research_topic}.sh"
  echo "Analysis script created: analyze-${research_topic}.sh"
}

# Export for different research topics
export_for_research "cybersecurity"
export_for_research "climate-change"

Advanced Usage

Selective Export

# Export embeddings with multiple filters
selective_export() {
  local users=("user1" "user2" "user3")
  local collections=("docs1" "docs2")
  
  for user in "${users[@]}"; do
    for collection in "${collections[@]}"; do
      output_file="embeddings-${user}-${collection}.msgpack"
      
      echo "Exporting for user: $user, collection: $collection"
      
      tg-save-doc-embeds \
        -o "$output_file" \
        --user "$user" \
        --collection "$collection" &
      
      # Store PID for later management
      echo $! > "${output_file}.pid"
    done
  done
  
  echo "All selective exports started"
}

Monitoring and Statistics

# Monitor export progress with statistics
monitor_export() {
  local output_file="$1"
  local pid_file="${output_file}.pid"
  
  if [ ! -f "$pid_file" ]; then
    echo "PID file not found: $pid_file"
    return 1
  fi
  
  local export_pid=$(cat "$pid_file")
  
  echo "Monitoring export (PID: $export_pid)..."
  echo "Output file: $output_file"
  
  while kill -0 "$export_pid" 2>/dev/null; do
    if [ -f "$output_file" ]; then
      file_size=$(stat -c%s "$output_file" 2>/dev/null || echo "0")
      human_size=$(numfmt --to=iec-i --suffix=B "$file_size")
      
      # Try to count embeddings
      embedding_count=$(tg-dump-msgpack -i "$output_file" 2>/dev/null | grep -c '^\["de"' || echo "0")
      
      echo "File size: $human_size, Embeddings: $embedding_count"
    else
      echo "Output file not yet created..."
    fi
    
    sleep 30
  done
  
  echo "Export process completed"
  rm "$pid_file"
}

# Start export and monitor
tg-save-doc-embeds -o "monitored-export.msgpack" &
echo $! > "monitored-export.msgpack.pid"
monitor_export "monitored-export.msgpack"

Export Validation

# Validate exported embeddings
validate_export() {
  local export_file="$1"
  
  echo "Validating export file: $export_file"
  
  # Check file exists and has content
  if [ ! -s "$export_file" ]; then
    echo "✗ Export file is empty or missing"
    return 1
  fi
  
  # Check MessagePack format
  if tg-dump-msgpack -i "$export_file" --summary > /dev/null 2>&1; then
    echo "✓ Valid MessagePack format"
  else
    echo "✗ Invalid MessagePack format"
    return 1
  fi
  
  # Check for document embeddings
  embedding_count=$(tg-dump-msgpack -i "$export_file" | grep -c '^\["de"' || echo "0")
  
  if [ "$embedding_count" -gt 0 ]; then
    echo "✓ Contains $embedding_count document embeddings"
  else
    echo "✗ No document embeddings found"
    return 1
  fi
  
  # Get vector dimension information
  summary=$(tg-dump-msgpack -i "$export_file" --summary)
  if echo "$summary" | grep -q "Vector dimension:"; then
    dimension=$(echo "$summary" | grep "Vector dimension:" | awk '{print $3}')
    echo "✓ Vector dimension: $dimension"
  else
    echo "⚠ Could not determine vector dimension"
  fi
  
  echo "Validation completed successfully"
}

Export Scheduling

# Scheduled export with cron-like functionality
schedule_export() {
  local schedule="$1"  # e.g., "daily", "hourly", "weekly"
  local output_prefix="$2"
  
  case "$schedule" in
    "hourly")
      interval=3600
      ;;
    "daily")
      interval=86400
      ;;
    "weekly")
      interval=604800
      ;;
    *)
      echo "Invalid schedule: $schedule"
      return 1
      ;;
  esac
  
  echo "Starting $schedule exports with prefix: $output_prefix"
  
  while true; do
    timestamp=$(date +%Y%m%d_%H%M%S)
    output_file="${output_prefix}-${timestamp}.msgpack"
    
    echo "Starting scheduled export: $output_file"
    
    # Run export for the scheduled interval
    timeout ${interval}s tg-save-doc-embeds -o "$output_file"
    
    # Validate and compress
    if validate_export "$output_file"; then
      gzip "$output_file"
      echo "✓ Export completed and compressed: ${output_file}.gz"
    else
      echo "✗ Export validation failed: $output_file"
      mv "$output_file" "${output_file}.failed"
    fi
    
    # Brief pause before next cycle
    sleep 60
  done
}

# Start daily scheduled exports
schedule_export "daily" "daily-embeddings" &

Performance Considerations

Memory Management

# Monitor memory usage during export
monitor_memory_export() {
  local output_file="$1"
  
  # Start export
  tg-save-doc-embeds -o "$output_file" &
  export_pid=$!
  
  echo "Monitoring memory usage for export (PID: $export_pid)..."
  
  while kill -0 "$export_pid" 2>/dev/null; do
    memory_usage=$(ps -p "$export_pid" -o rss= 2>/dev/null | awk '{print $1/1024}')
    
    if [ -n "$memory_usage" ]; then
      echo "Memory usage: ${memory_usage}MB"
    fi
    
    sleep 10
  done
  
  echo "Export completed"
}

Network Optimization

# Optimize for network conditions
network_optimized_export() {
  local output_file="$1"
  local api_url="$2"
  
  echo "Starting network-optimized export..."
  
  # Use compression and buffering
  tg-save-doc-embeds \
    -o "$output_file" \
    -u "$api_url" \
    --format msgpack &  # MessagePack is more compact than JSON
  
  export_pid=$!
  
  # Monitor network usage
  echo "Monitoring export (PID: $export_pid)..."
  
  while kill -0 "$export_pid" 2>/dev/null; do
    # Monitor network connections
    connections=$(netstat -an | grep ":8088" | wc -l)
    echo "Active connections: $connections"
    sleep 30
  done
}

Error Handling

Connection Issues

Exception: WebSocket connection failed

Solution: Check API URL and ensure TrustGraph WebSocket service is running.

Disk Space Issues

Exception: No space left on device

Solution: Free up disk space or use a different output location.

Permission Errors

Exception: Permission denied

Solution: Check write permissions for the output file location.

Memory Issues

MemoryError: Unable to allocate memory

Solution: Monitor memory usage and consider using smaller export windows.

Integration with Other Commands

Complete Backup Workflow

# Complete backup and restore workflow
backup_restore_workflow() {
  local backup_file="embeddings-backup.msgpack"
  
  echo "=== Backup Phase ==="
  
  # Create backup
  tg-save-doc-embeds -o "$backup_file" &
  backup_pid=$!
  
  # Let it run for a while
  sleep 300  # 5 minutes
  kill $backup_pid
  
  echo "Backup created: $backup_file"
  
  # Validate backup
  validate_export "$backup_file"
  
  echo "=== Restore Phase ==="
  
  # Restore from backup (to different collection)
  tg-load-doc-embeds -i "$backup_file" --collection "restored"
  
  echo "Backup and restore workflow completed"
}

Analysis Pipeline

# Export and analyze embeddings
export_analyze_pipeline() {
  local topic="$1"
  local export_file="analysis-${topic}.msgpack"
  
  echo "Starting export and analysis pipeline for: $topic"
  
  # Export embeddings
  tg-save-doc-embeds \
    -o "$export_file" \
    --collection "$topic" &
  
  export_pid=$!
  
  # Run for analysis duration
  sleep 600  # 10 minutes
  kill $export_pid
  
  # Analyze exported data
  echo "Analyzing exported embeddings..."
  tg-dump-msgpack -i "$export_file" --summary
  
  # Count embeddings by user
  echo "Embeddings by user:"
  tg-dump-msgpack -i "$export_file" | \
    jq -r '.[1].m.u' | \
    sort | uniq -c
  
  echo "Analysis pipeline completed"
}

Environment Variables

  • TRUSTGRAPH_API: Default API URL

API Integration

This command uses TrustGraph’s WebSocket API for document embeddings export, specifically the /api/v1/flow/{flow-id}/export/document-embeddings endpoint.

Best Practices

  1. Start Early: Begin export before processing starts to capture all data
  2. Monitoring: Monitor export progress and file sizes
  3. Validation: Always validate exported files
  4. Compression: Use compression for long-term storage
  5. Rotation: Implement file rotation for continuous exports
  6. Backup: Keep multiple backup copies in different locations
  7. Documentation: Document export schedules and procedures

Troubleshooting

No Data Captured

# Check if processing is generating embeddings
tg-show-flows | grep processing

# Verify WebSocket connection
netstat -an | grep :8088

Large File Issues

# Monitor file growth
watch -n 5 'ls -lh *.msgpack'

# Check available disk space
df -h

Process Management

# List running export processes
ps aux | grep tg-save-doc-embeds

# Kill stuck processes
pkill -f tg-save-doc-embeds