tg-load-sample-documents

Loads predefined sample documents into TrustGraph library for testing and demonstration purposes.

Synopsis

tg-load-sample-documents [options]

Description

The tg-load-sample-documents command loads a curated set of sample documents into TrustGraph’s document library. These documents include academic papers, government reports, and reference materials that demonstrate TrustGraph’s capabilities and provide data for testing and evaluation.

The command downloads documents from public sources and adds them to the library with comprehensive metadata including RDF triples for semantic relationships.

Options

Optional Arguments

-u, --url URL: TrustGraph API URL (default: $TRUSTGRAPH_URL or http://localhost:8088/)
-U, --user USER: User ID for document ownership (default: trustgraph)

Examples

Basic Loading

tg-load-sample-documents

Load with Custom User

tg-load-sample-documents -U "demo-user"

Load to Custom Environment

tg-load-sample-documents -u http://demo.trustgraph.ai:8088/

Sample Documents

The command loads the following sample documents:

1. NASA Challenger Report

Title: Report of the Presidential Commission on the Space Shuttle Challenger Accident, Volume 1
Topics: Safety engineering, space shuttle, NASA
Format: PDF
Source: NASA Technical Reports Server
Use Case: Demonstrates technical document processing and safety analysis

2. Old Icelandic Dictionary

Title: A Concise Dictionary of Old Icelandic
Topics: Language, linguistics, Old Norse, grammar
Format: PDF
Publication: 1910, Clarendon Press
Use Case: Historical document processing and linguistic analysis

3. US Intelligence Threat Assessment

Title: Annual Threat Assessment of the U.S. Intelligence Community - March 2025
Topics: National security, cyberthreats, geopolitics
Format: PDF
Source: Director of National Intelligence
Use Case: Current affairs analysis and security research

4. Intelligence and State Policy

Title: The Role of Intelligence and State Policies in International Security
Topics: Intelligence, international security, state policy
Format: PDF (sample)
Publication: Cambridge Scholars Publishing, 2021
Use Case: Academic research and policy analysis

5. Globalization and Intelligence

Title: Beyond the Vigilant State: Globalisation and Intelligence
Topics: Intelligence, globalization, security studies
Format: PDF
Author: Richard J. Aldrich
Use Case: Academic paper analysis and research

Use Cases

Demo Environment Setup

# Set up demonstration environment
setup_demo_environment() {
  echo "Setting up TrustGraph demo environment..."
  
  # Initialize system
  tg-init-trustgraph
  
  # Load sample documents
  echo "Loading sample documents..."
  tg-load-sample-documents -U "demo"
  
  # Wait for processing
  echo "Waiting for document processing..."
  sleep 60
  
  # Start document processing
  echo "Starting document processing..."
  tg-show-library-documents -U "demo" | \
    grep "| id" | \
    awk '{print $3}' | \
    while read doc_id; do
      proc_id="demo_proc_$(date +%s)_${doc_id}"
      tg-start-library-processing -d "$doc_id" --id "$proc_id" -U "demo"
    done
  
  echo "Demo environment ready!"
  echo "Try: tg-invoke-document-rag -q 'What caused the Challenger accident?' -U demo"
}

Testing Data Pipeline

# Test complete document processing pipeline
test_document_pipeline() {
  echo "Testing document processing pipeline..."
  
  # Load sample documents
  tg-load-sample-documents -U "test"
  
  # List loaded documents
  echo "Loaded documents:"
  tg-show-library-documents -U "test"
  
  # Start processing for each document
  tg-show-library-documents -U "test" | \
    grep "| id" | \
    awk '{print $3}' | \
    while read doc_id; do
      echo "Processing document: $doc_id"
      proc_id="test_$(date +%s)_${doc_id}"
      tg-start-library-processing -d "$doc_id" --id "$proc_id" -U "test"
    done
  
  # Wait for processing
  echo "Processing documents... (this may take several minutes)"
  sleep 300
  
  # Test document queries
  echo "Testing document queries..."
  
  test_queries=(
    "What is the Challenger accident?"
    "What is Old Icelandic?"
    "What are the main cybersecurity threats?"
    "What is intelligence policy?"
  )
  
  for query in "${test_queries[@]}"; do
    echo "Query: $query"
    tg-invoke-document-rag -q "$query" -U "test" | head -5
    echo "---"
  done
  
  echo "Pipeline test complete!"
}

Educational Environment

# Set up educational/training environment
setup_educational_environment() {
  local class_name="$1"
  
  echo "Setting up educational environment for: $class_name"
  
  # Create user for the class
  class_user=$(echo "$class_name" | tr '[:upper:]' '[:lower:]' | tr ' ' '-')
  
  # Load sample documents for the class
  tg-load-sample-documents -U "$class_user"
  
  # Process documents
  echo "Processing documents for educational use..."
  tg-show-library-documents -U "$class_user" | \
    grep "| id" | \
    awk '{print $3}' | \
    while read doc_id; do
      proc_id="edu_$(date +%s)_${doc_id}"
      tg-start-library-processing \
        -d "$doc_id" \
        --id "$proc_id" \
        -U "$class_user" \
        --collection "education"
    done
  
  echo "Educational environment ready for: $class_name"
  echo "User: $class_user"
  echo "Collection: education"
}

# Set up for different classes
setup_educational_environment "AI Research Methods"
setup_educational_environment "Security Studies"

Benchmarking and Performance Testing

# Benchmark document processing performance
benchmark_processing() {
  echo "Starting document processing benchmark..."
  
  # Load sample documents
  start_time=$(date +%s)
  tg-load-sample-documents -U "benchmark"
  load_time=$(date +%s)
  
  echo "Document loading time: $((load_time - start_time))s"
  
  # Count documents
  doc_count=$(tg-show-library-documents -U "benchmark" | grep -c "| id")
  echo "Documents loaded: $doc_count"
  
  # Start processing
  processing_ids=()
  tg-show-library-documents -U "benchmark" | \
    grep "| id" | \
    awk '{print $3}' | \
    while read doc_id; do
      proc_id="bench_$(date +%s)_${doc_id}"
      processing_ids+=("$proc_id")
      tg-start-library-processing -d "$doc_id" --id "$proc_id" -U "benchmark"
    done
  
  processing_start=$(date +%s)
  
  # Monitor processing completion
  echo "Monitoring processing completion..."
  while true; do
    active_processing=$(tg-show-flows | grep -c "bench_" || echo "0")
    
    if [ "$active_processing" -eq 0 ]; then
      break
    fi
    
    echo "Active processing jobs: $active_processing"
    sleep 30
  done
  
  processing_end=$(date +%s)
  
  echo "Processing completion time: $((processing_end - processing_start))s"
  echo "Total benchmark time: $((processing_end - start_time))s"
  
  # Test query performance
  echo "Testing query performance..."
  query_start=$(date +%s)
  
  for i in {1..10}; do
    tg-invoke-document-rag \
      -q "What are the main topics in these documents?" \
      -U "benchmark" > /dev/null
  done
  
  query_end=$(date +%s)
  echo "Average query time: $(echo "scale=2; ($query_end - $query_start) / 10" | bc)s"
}

Advanced Usage

Selective Document Loading

# Load only specific types of documents
load_by_category() {
  local category="$1"
  
  case "$category" in
    "government")
      echo "Loading government documents..."
      # This would require modifying the script to load selectively
      # For now, we load all and filter by tags later
      tg-load-sample-documents -U "gov-docs"
      ;;
    "academic")
      echo "Loading academic documents..."
      tg-load-sample-documents -U "academic-docs"
      ;;
    "historical")
      echo "Loading historical documents..."
      tg-load-sample-documents -U "historical-docs"
      ;;
    *)
      echo "Loading all sample documents..."
      tg-load-sample-documents
      ;;
  esac
}

# Load by category
load_by_category "government"
load_by_category "academic"

Multi-Environment Loading

# Load sample documents to multiple environments
multi_environment_setup() {
  local environments=("dev" "staging" "demo")
  
  for env in "${environments[@]}"; do
    echo "Setting up $env environment..."
    
    tg-load-sample-documents \
      -u "http://$env.trustgraph.company.com:8088/" \
      -U "sample-data"
    
    echo "✓ $env environment loaded"
  done
  
  echo "All environments loaded with sample documents"
}

Custom Document Sets

# Create custom document loading scripts based on the sample
create_custom_loader() {
  local domain="$1"
  
  cat > "load-${domain}-documents.py" << 'EOF'
#!/usr/bin/env python3
"""
Custom document loader for specific domain
Based on tg-load-sample-documents
"""

import argparse
import os
from trustgraph.api import Api

# Define your own document set here
documents = [
    {
        "id": "https://example.com/doc/custom-1",
        "title": "Custom Document 1",
        "url": "https://example.com/docs/custom1.pdf",
        # Add your document definitions...
    }
]

# Rest of the implementation similar to tg-load-sample-documents
EOF
  
  echo "Custom loader created: load-${domain}-documents.py"
}

# Create custom loaders for different domains
create_custom_loader "medical"
create_custom_loader "legal"
create_custom_loader "technical"

Document Analysis

Content Analysis

# Analyze loaded sample documents
analyze_sample_documents() {
  echo "Analyzing sample documents..."
  
  # Get document statistics
  total_docs=$(tg-show-library-documents | grep -c "| id")
  echo "Total documents: $total_docs"
  
  # Analyze by type
  echo "Document types:"
  tg-show-library-documents | \
    grep "| kind" | \
    awk '{print $3}' | \
    sort | uniq -c
  
  # Analyze tags
  echo "Popular tags:"
  tg-show-library-documents | \
    grep "| tags" | \
    sed 's/.*| tags.*| \(.*\) |.*/\1/' | \
    tr ',' '\n' | \
    sed 's/^ *//;s/ *$//' | \
    sort | uniq -c | sort -nr | head -10
  
  # Document sizes (would need additional API)
  echo "Document analysis complete"
}

Query Testing

# Test sample documents with various queries
test_sample_queries() {
  echo "Testing sample document queries..."
  
  # Define test queries for different domains
  queries=(
    "What caused the Challenger space shuttle accident?"
    "What is Old Norse language?"
    "What are current cybersecurity threats?"
    "How does globalization affect intelligence services?"
    "What are the main security challenges in international relations?"
  )
  
  for query in "${queries[@]}"; do
    echo "Testing query: $query"
    echo "===================="
    
    result=$(tg-invoke-document-rag -q "$query" 2>/dev/null)
    
    if [ $? -eq 0 ]; then
      echo "$result" | head -3
      echo "✓ Query successful"
    else
      echo "✗ Query failed"
    fi
    
    echo ""
  done
}

Error Handling

Network Issues

Exception: Connection failed during download

Solution: Check internet connectivity and retry. Documents are cached locally after first download.

Insufficient Storage

Exception: No space left on device

Solution: Free up disk space. Sample documents total approximately 50-100MB.

API Connection Issues

Exception: Connection refused

Solution: Verify TrustGraph API is running and accessible.

Processing Failures

Exception: Document processing failed

Solution: Check TrustGraph service logs and ensure all components are running.

Monitoring and Validation

Loading Progress

# Monitor sample document loading
monitor_sample_loading() {
  echo "Starting sample document loading with monitoring..."
  
  # Start loading in background
  tg-load-sample-documents &
  load_pid=$!
  
  # Monitor progress
  while kill -0 $load_pid 2>/dev/null; do
    doc_count=$(tg-show-library-documents 2>/dev/null | grep -c "| id" || echo "0")
    echo "Documents loaded so far: $doc_count"
    sleep 10
  done
  
  wait $load_pid
  
  if [ $? -eq 0 ]; then
    final_count=$(tg-show-library-documents | grep -c "| id")
    echo "✓ Loading completed successfully"
    echo "Total documents loaded: $final_count"
  else
    echo "✗ Loading failed"
  fi
}

Validation

# Validate sample document loading
validate_sample_loading() {
  echo "Validating sample document loading..."
  
  # Expected document count (based on current sample set)
  expected_docs=5
  
  # Check actual count
  actual_docs=$(tg-show-library-documents | grep -c "| id")
  
  if [ "$actual_docs" -eq "$expected_docs" ]; then
    echo "✓ Document count correct: $actual_docs"
  else
    echo "⚠ Document count mismatch: expected $expected_docs, got $actual_docs"
  fi
  
  # Check for expected documents
  expected_titles=(
    "Challenger"
    "Icelandic"
    "Intelligence"
    "Threat Assessment"
    "Vigilant State"
  )
  
  for title in "${expected_titles[@]}"; do
    if tg-show-library-documents | grep -q "$title"; then
      echo "✓ Found document containing: $title"
    else
      echo "✗ Missing document containing: $title"
    fi
  done
  
  echo "Validation complete"
}

Environment Variables

TRUSTGRAPH_URL: Default API URL

tg-show-library-documents - List loaded documents
tg-start-library-processing - Process loaded documents
tg-invoke-document-rag - Query processed documents
tg-load-pdf - Load individual PDF documents

API Integration

This command uses the Library API to add sample documents to TrustGraph’s document repository.

Best Practices

Demo Preparation: Use for setting up demonstration environments
Testing: Ideal for testing document processing pipelines
Education: Excellent for training and educational purposes
Development: Use in development environments for consistent test data
Benchmarking: Suitable for performance testing and optimization
Documentation: Great for documenting TrustGraph capabilities

Troubleshooting

Download Failures

# Check document URLs are accessible
curl -I "https://ntrs.nasa.gov/api/citations/19860015255/downloads/19860015255.pdf"

# Check local cache
ls -la doc-cache/

Processing Issues

# Check document processing status
tg-show-library-processing

# Verify documents are in library
tg-show-library-documents | grep -E "(Challenger|Icelandic|Intelligence)"

Performance Problems

# Monitor system resources during loading
top
df -h