tg-load-pdf

Loads PDF documents into TrustGraph for processing and analysis.

Synopsis

tg-load-pdf [options] file1.pdf [file2.pdf ...]

Description

The tg-load-pdf command loads PDF documents into TrustGraph by directing them to the PDF decoder service. The command extracts content, generates document metadata, and makes the documents available for processing by other TrustGraph services.

Each PDF is assigned a unique identifier based on its content hash, and comprehensive metadata can be attached including copyright information, publication details, and keywords.

Note: Consider using tg-add-library-document followed by tg-start-library-processing for more comprehensive document management.

Options

Required Arguments

  • files: One or more PDF files to load

Optional Arguments

  • -u, --url URL: TrustGraph API URL (default: $TRUSTGRAPH_URL or http://localhost:8088/)
  • -f, --flow-id ID: Flow instance ID to use (default: default)
  • -U, --user USER: User ID for document ownership (default: trustgraph)
  • -C, --collection COLLECTION: Collection to assign document (default: default)

Document Metadata

  • --name NAME: Document name/title
  • --description DESCRIPTION: Document description
  • --identifier ID: Custom document identifier
  • --document-url URL: Source URL for the document
  • --keyword KEYWORD: Document keywords (can be specified multiple times)
  • --copyright-notice NOTICE: Copyright notice text
  • --copyright-holder HOLDER: Copyright holder name
  • --copyright-year YEAR: Copyright year
  • --license LICENSE: Copyright license

Publication Details

  • --publication-organization ORG: Publishing organization
  • --publication-description DESC: Publication description
  • --publication-date DATE: Publication date

Examples

Basic PDF Loading

tg-load-pdf document.pdf

Multiple Files

tg-load-pdf report1.pdf report2.pdf manual.pdf

With Basic Metadata

tg-load-pdf \
  --name "Technical Manual" \
  --description "System administration guide" \
  --keyword "technical" --keyword "manual" \
  technical-manual.pdf

Complete Metadata

tg-load-pdf \
  --name "Annual Report 2023" \
  --description "Company annual financial report" \
  --copyright-holder "Acme Corporation" \
  --copyright-year "2023" \
  --license "All Rights Reserved" \
  --publication-organization "Acme Corporation" \
  --publication-date "2023-12-31" \
  --keyword "financial" --keyword "annual" --keyword "report" \
  annual-report-2023.pdf

Custom Flow and Collection

tg-load-pdf \
  -f "document-processing-flow" \
  -U "finance-team" \
  -C "financial-documents" \
  --name "Budget Analysis" \
  budget-2024.pdf

Document Processing

Content Extraction

The PDF loader:

  1. Calculates SHA256 hash for unique document ID
  2. Extracts text content from PDF
  3. Preserves document structure and formatting metadata
  4. Generates searchable text index

Metadata Generation

Document metadata includes:

  • Document ID: SHA256 hash-based unique identifier
  • Content Hash: For duplicate detection
  • File Information: Size, format, creation date
  • Custom Metadata: User-provided attributes

Integration with Processing Pipeline

# Load PDF and start processing
tg-load-pdf research-paper.pdf --name "AI Research Paper"

# Check processing status
tg-show-flows | grep "document-processing"

# Query loaded content
tg-invoke-document-rag -q "What is the main conclusion?" -C "default"

Error Handling

File Not Found

Exception: [Errno 2] No such file or directory: 'missing.pdf'

Solution: Verify file path and ensure PDF exists.

Invalid PDF Format

Exception: PDF parsing failed: Invalid PDF structure

Solution: Verify PDF is not corrupted and is a valid PDF file.

Permission Errors

Exception: [Errno 13] Permission denied: 'protected.pdf'

Solution: Check file permissions and ensure read access.

Flow Not Found

Exception: Flow instance 'invalid-flow' not found

Solution: Verify flow ID exists with tg-show-flows.

API Connection Issues

Exception: Connection refused

Solution: Check API URL and ensure TrustGraph is running.

Advanced Usage

Batch Processing

# Process all PDFs in directory
for pdf in *.pdf; do
  echo "Loading $pdf..."
  tg-load-pdf \
    --name "$(basename "$pdf" .pdf)" \
    --collection "research-papers" \
    "$pdf"
done

Organized Loading

# Load with structured metadata
categories=("technical" "financial" "legal")
for category in "${categories[@]}"; do
  for pdf in "$category"/*.pdf; do
    if [ -f "$pdf" ]; then
      tg-load-pdf \
        --collection "$category-documents" \
        --keyword "$category" \
        --name "$(basename "$pdf" .pdf)" \
        "$pdf"
    fi
  done
done

CSV-Driven Loading

# Load PDFs with metadata from CSV
# Format: filename,title,description,keywords
while IFS=',' read -r filename title description keywords; do
  if [ -f "$filename" ]; then
    echo "Loading $filename..."
    
    # Convert comma-separated keywords to multiple --keyword args
    keyword_args=""
    IFS='|' read -ra KEYWORDS <<< "$keywords"
    for kw in "${KEYWORDS[@]}"; do
      keyword_args="$keyword_args --keyword \"$kw\""
    done
    
    eval "tg-load-pdf \
      --name \"$title\" \
      --description \"$description\" \
      $keyword_args \
      \"$filename\""
  fi
done < documents.csv

Publication Processing

# Load academic papers with publication details
load_academic_paper() {
    local file="$1"
    local title="$2"
    local authors="$3"
    local journal="$4"
    local year="$5"
    
    tg-load-pdf \
      --name "$title" \
      --description "Academic paper: $title" \
      --copyright-holder "$authors" \
      --copyright-year "$year" \
      --publication-organization "$journal" \
      --publication-date "$year-01-01" \
      --keyword "academic" --keyword "research" \
      "$file"
}

# Usage
load_academic_paper "ai-paper.pdf" "AI in Healthcare" "Smith et al." "AI Journal" "2023"

Monitoring and Validation

Load Status Checking

# Check document loading progress
check_load_status() {
    local file="$1"
    local expected_name="$2"
    
    echo "Checking load status for: $file"
    
    # Check if document appears in library
    if tg-show-library-documents | grep -q "$expected_name"; then
        echo "✓ Document loaded successfully"
    else
        echo "✗ Document not found in library"
        return 1
    fi
}

# Monitor batch loading
for pdf in *.pdf; do
    name=$(basename "$pdf" .pdf)
    check_load_status "$pdf" "$name"
done

Content Verification

# Verify PDF content is accessible
verify_pdf_content() {
    local pdf_name="$1"
    local test_query="$2"
    
    echo "Verifying content for: $pdf_name"
    
    # Try to query the document
    result=$(tg-invoke-document-rag -q "$test_query" -C "default" 2>/dev/null)
    
    if [ $? -eq 0 ] && [ -n "$result" ]; then
        echo "✓ Content accessible via RAG"
    else
        echo "✗ Content not accessible"
        return 1
    fi
}

# Verify loaded documents
verify_pdf_content "Technical Manual" "What is the installation process?"

Performance Optimization

Parallel Loading

# Load multiple PDFs in parallel
pdf_files=(document1.pdf document2.pdf document3.pdf)
for pdf in "${pdf_files[@]}"; do
    (
        echo "Loading $pdf in background..."
        tg-load-pdf \
          --name "$(basename "$pdf" .pdf)" \
          --collection "batch-$(date +%Y%m%d)" \
          "$pdf"
    ) &
done
wait
echo "All PDFs loaded"

Size-Based Processing

# Process files based on size
for pdf in *.pdf; do
    size=$(stat -c%s "$pdf")
    if [ $size -lt 10485760 ]; then  # < 10MB
        echo "Processing small file: $pdf"
        tg-load-pdf --collection "small-docs" "$pdf"
    else
        echo "Processing large file: $pdf"
        tg-load-pdf --collection "large-docs" "$pdf"
    fi
done

Document Organization

Collection Management

# Organize by document type
organize_by_type() {
    local pdf="$1"
    local filename=$(basename "$pdf" .pdf)
    
    case "$filename" in
        *manual*|*guide*) collection="manuals" ;;
        *report*|*analysis*) collection="reports" ;;
        *spec*|*specification*) collection="specifications" ;;
        *legal*|*contract*) collection="legal" ;;
        *) collection="general" ;;
    esac
    
    tg-load-pdf \
      --collection "$collection" \
      --name "$filename" \
      "$pdf"
}

# Process all PDFs
for pdf in *.pdf; do
    organize_by_type "$pdf"
done

Metadata Standardization

# Apply consistent metadata standards
standardize_metadata() {
    local pdf="$1"
    local dept="$2"
    local year="$3"
    
    local name=$(basename "$pdf" .pdf)
    local collection="$dept-$(date +%Y)"
    
    tg-load-pdf \
      --name "$name" \
      --description "$dept document from $year" \
      --copyright-holder "Company Name" \
      --copyright-year "$year" \
      --collection "$collection" \
      --keyword "$dept" --keyword "$year" \
      "$pdf"
}

# Usage
standardize_metadata "finance-report.pdf" "finance" "2023"

Integration with Other Services

Library Integration

# Alternative approach using library services
load_via_library() {
    local pdf="$1"
    local name="$2"
    
    # Add to library first
    tg-add-library-document \
      --name "$name" \
      --file "$pdf" \
      --collection "documents"
    
    # Start processing
    tg-start-library-processing \
      --collection "documents"
}

Workflow Integration

# Complete document workflow
process_document_workflow() {
    local pdf="$1"
    local name="$2"
    
    echo "Starting document workflow for: $name"
    
    # 1. Load PDF
    tg-load-pdf --name "$name" "$pdf"
    
    # 2. Wait for processing
    sleep 5
    
    # 3. Verify availability
    if tg-show-library-documents | grep -q "$name"; then
        echo "Document available in library"
        
        # 4. Test RAG functionality
        tg-invoke-document-rag -q "What is this document about?"
        
        # 5. Extract key information
        tg-invoke-prompt extract-key-points \
          text="Document: $name" \
          format="bullet_points"
    else
        echo "Document processing failed"
    fi
}

Environment Variables

  • TRUSTGRAPH_URL: Default API URL

API Integration

This command uses the document loading API to process PDF files and make them available for text extraction, search, and analysis.

Best Practices

  1. Metadata Completeness: Provide comprehensive metadata for better organization
  2. Collection Organization: Use logical collections for document categorization
  3. Error Handling: Implement robust error handling for batch operations
  4. Performance: Consider file sizes and processing capacity
  5. Monitoring: Verify successful loading and processing
  6. Security: Ensure sensitive documents are properly protected
  7. Backup: Maintain backups of source PDFs

Troubleshooting

PDF Processing Issues

# Check PDF validity
file document.pdf
pdfinfo document.pdf

# Try alternative PDF processors
qpdf --check document.pdf

Memory Issues

# For large PDFs, monitor memory usage
free -h
# Consider processing large files separately

Content Extraction Problems

# Verify PDF contains extractable text
pdftotext document.pdf test-output.txt
cat test-output.txt | head -20