tg-load-pdf
Loads PDF documents into TrustGraph for processing and analysis.
Synopsis
tg-load-pdf [options] file1.pdf [file2.pdf ...]
Description
The tg-load-pdf
command loads PDF documents into TrustGraph by directing them to the PDF decoder service. The command extracts content, generates document metadata, and makes the documents available for processing by other TrustGraph services.
Each PDF is assigned a unique identifier based on its content hash, and comprehensive metadata can be attached including copyright information, publication details, and keywords.
Note: Consider using tg-add-library-document
followed by tg-start-library-processing
for more comprehensive document management.
Options
Required Arguments
files
: One or more PDF files to load
Optional Arguments
-u, --url URL
: TrustGraph API URL (default:$TRUSTGRAPH_URL
orhttp://localhost:8088/
)-f, --flow-id ID
: Flow instance ID to use (default:default
)-U, --user USER
: User ID for document ownership (default:trustgraph
)-C, --collection COLLECTION
: Collection to assign document (default:default
)
Document Metadata
--name NAME
: Document name/title--description DESCRIPTION
: Document description--identifier ID
: Custom document identifier--document-url URL
: Source URL for the document--keyword KEYWORD
: Document keywords (can be specified multiple times)
Copyright Information
--copyright-notice NOTICE
: Copyright notice text--copyright-holder HOLDER
: Copyright holder name--copyright-year YEAR
: Copyright year--license LICENSE
: Copyright license
Publication Details
--publication-organization ORG
: Publishing organization--publication-description DESC
: Publication description--publication-date DATE
: Publication date
Examples
Basic PDF Loading
tg-load-pdf document.pdf
Multiple Files
tg-load-pdf report1.pdf report2.pdf manual.pdf
With Basic Metadata
tg-load-pdf \
--name "Technical Manual" \
--description "System administration guide" \
--keyword "technical" --keyword "manual" \
technical-manual.pdf
Complete Metadata
tg-load-pdf \
--name "Annual Report 2023" \
--description "Company annual financial report" \
--copyright-holder "Acme Corporation" \
--copyright-year "2023" \
--license "All Rights Reserved" \
--publication-organization "Acme Corporation" \
--publication-date "2023-12-31" \
--keyword "financial" --keyword "annual" --keyword "report" \
annual-report-2023.pdf
Custom Flow and Collection
tg-load-pdf \
-f "document-processing-flow" \
-U "finance-team" \
-C "financial-documents" \
--name "Budget Analysis" \
budget-2024.pdf
Document Processing
Content Extraction
The PDF loader:
- Calculates SHA256 hash for unique document ID
- Extracts text content from PDF
- Preserves document structure and formatting metadata
- Generates searchable text index
Metadata Generation
Document metadata includes:
- Document ID: SHA256 hash-based unique identifier
- Content Hash: For duplicate detection
- File Information: Size, format, creation date
- Custom Metadata: User-provided attributes
Integration with Processing Pipeline
# Load PDF and start processing
tg-load-pdf research-paper.pdf --name "AI Research Paper"
# Check processing status
tg-show-flows | grep "document-processing"
# Query loaded content
tg-invoke-document-rag -q "What is the main conclusion?" -C "default"
Error Handling
File Not Found
Exception: [Errno 2] No such file or directory: 'missing.pdf'
Solution: Verify file path and ensure PDF exists.
Invalid PDF Format
Exception: PDF parsing failed: Invalid PDF structure
Solution: Verify PDF is not corrupted and is a valid PDF file.
Permission Errors
Exception: [Errno 13] Permission denied: 'protected.pdf'
Solution: Check file permissions and ensure read access.
Flow Not Found
Exception: Flow instance 'invalid-flow' not found
Solution: Verify flow ID exists with tg-show-flows
.
API Connection Issues
Exception: Connection refused
Solution: Check API URL and ensure TrustGraph is running.
Advanced Usage
Batch Processing
# Process all PDFs in directory
for pdf in *.pdf; do
echo "Loading $pdf..."
tg-load-pdf \
--name "$(basename "$pdf" .pdf)" \
--collection "research-papers" \
"$pdf"
done
Organized Loading
# Load with structured metadata
categories=("technical" "financial" "legal")
for category in "${categories[@]}"; do
for pdf in "$category"/*.pdf; do
if [ -f "$pdf" ]; then
tg-load-pdf \
--collection "$category-documents" \
--keyword "$category" \
--name "$(basename "$pdf" .pdf)" \
"$pdf"
fi
done
done
CSV-Driven Loading
# Load PDFs with metadata from CSV
# Format: filename,title,description,keywords
while IFS=',' read -r filename title description keywords; do
if [ -f "$filename" ]; then
echo "Loading $filename..."
# Convert comma-separated keywords to multiple --keyword args
keyword_args=""
IFS='|' read -ra KEYWORDS <<< "$keywords"
for kw in "${KEYWORDS[@]}"; do
keyword_args="$keyword_args --keyword \"$kw\""
done
eval "tg-load-pdf \
--name \"$title\" \
--description \"$description\" \
$keyword_args \
\"$filename\""
fi
done < documents.csv
Publication Processing
# Load academic papers with publication details
load_academic_paper() {
local file="$1"
local title="$2"
local authors="$3"
local journal="$4"
local year="$5"
tg-load-pdf \
--name "$title" \
--description "Academic paper: $title" \
--copyright-holder "$authors" \
--copyright-year "$year" \
--publication-organization "$journal" \
--publication-date "$year-01-01" \
--keyword "academic" --keyword "research" \
"$file"
}
# Usage
load_academic_paper "ai-paper.pdf" "AI in Healthcare" "Smith et al." "AI Journal" "2023"
Monitoring and Validation
Load Status Checking
# Check document loading progress
check_load_status() {
local file="$1"
local expected_name="$2"
echo "Checking load status for: $file"
# Check if document appears in library
if tg-show-library-documents | grep -q "$expected_name"; then
echo "✓ Document loaded successfully"
else
echo "✗ Document not found in library"
return 1
fi
}
# Monitor batch loading
for pdf in *.pdf; do
name=$(basename "$pdf" .pdf)
check_load_status "$pdf" "$name"
done
Content Verification
# Verify PDF content is accessible
verify_pdf_content() {
local pdf_name="$1"
local test_query="$2"
echo "Verifying content for: $pdf_name"
# Try to query the document
result=$(tg-invoke-document-rag -q "$test_query" -C "default" 2>/dev/null)
if [ $? -eq 0 ] && [ -n "$result" ]; then
echo "✓ Content accessible via RAG"
else
echo "✗ Content not accessible"
return 1
fi
}
# Verify loaded documents
verify_pdf_content "Technical Manual" "What is the installation process?"
Performance Optimization
Parallel Loading
# Load multiple PDFs in parallel
pdf_files=(document1.pdf document2.pdf document3.pdf)
for pdf in "${pdf_files[@]}"; do
(
echo "Loading $pdf in background..."
tg-load-pdf \
--name "$(basename "$pdf" .pdf)" \
--collection "batch-$(date +%Y%m%d)" \
"$pdf"
) &
done
wait
echo "All PDFs loaded"
Size-Based Processing
# Process files based on size
for pdf in *.pdf; do
size=$(stat -c%s "$pdf")
if [ $size -lt 10485760 ]; then # < 10MB
echo "Processing small file: $pdf"
tg-load-pdf --collection "small-docs" "$pdf"
else
echo "Processing large file: $pdf"
tg-load-pdf --collection "large-docs" "$pdf"
fi
done
Document Organization
Collection Management
# Organize by document type
organize_by_type() {
local pdf="$1"
local filename=$(basename "$pdf" .pdf)
case "$filename" in
*manual*|*guide*) collection="manuals" ;;
*report*|*analysis*) collection="reports" ;;
*spec*|*specification*) collection="specifications" ;;
*legal*|*contract*) collection="legal" ;;
*) collection="general" ;;
esac
tg-load-pdf \
--collection "$collection" \
--name "$filename" \
"$pdf"
}
# Process all PDFs
for pdf in *.pdf; do
organize_by_type "$pdf"
done
Metadata Standardization
# Apply consistent metadata standards
standardize_metadata() {
local pdf="$1"
local dept="$2"
local year="$3"
local name=$(basename "$pdf" .pdf)
local collection="$dept-$(date +%Y)"
tg-load-pdf \
--name "$name" \
--description "$dept document from $year" \
--copyright-holder "Company Name" \
--copyright-year "$year" \
--collection "$collection" \
--keyword "$dept" --keyword "$year" \
"$pdf"
}
# Usage
standardize_metadata "finance-report.pdf" "finance" "2023"
Integration with Other Services
Library Integration
# Alternative approach using library services
load_via_library() {
local pdf="$1"
local name="$2"
# Add to library first
tg-add-library-document \
--name "$name" \
--file "$pdf" \
--collection "documents"
# Start processing
tg-start-library-processing \
--collection "documents"
}
Workflow Integration
# Complete document workflow
process_document_workflow() {
local pdf="$1"
local name="$2"
echo "Starting document workflow for: $name"
# 1. Load PDF
tg-load-pdf --name "$name" "$pdf"
# 2. Wait for processing
sleep 5
# 3. Verify availability
if tg-show-library-documents | grep -q "$name"; then
echo "Document available in library"
# 4. Test RAG functionality
tg-invoke-document-rag -q "What is this document about?"
# 5. Extract key information
tg-invoke-prompt extract-key-points \
text="Document: $name" \
format="bullet_points"
else
echo "Document processing failed"
fi
}
Environment Variables
TRUSTGRAPH_URL
: Default API URL
Related Commands
tg-add-library-document
- Add documents to librarytg-start-library-processing
- Process library documentstg-show-library-documents
- List library documentstg-invoke-document-rag
- Query document contenttg-show-flows
- Monitor processing flows
API Integration
This command uses the document loading API to process PDF files and make them available for text extraction, search, and analysis.
Best Practices
- Metadata Completeness: Provide comprehensive metadata for better organization
- Collection Organization: Use logical collections for document categorization
- Error Handling: Implement robust error handling for batch operations
- Performance: Consider file sizes and processing capacity
- Monitoring: Verify successful loading and processing
- Security: Ensure sensitive documents are properly protected
- Backup: Maintain backups of source PDFs
Troubleshooting
PDF Processing Issues
# Check PDF validity
file document.pdf
pdfinfo document.pdf
# Try alternative PDF processors
qpdf --check document.pdf
Memory Issues
# For large PDFs, monitor memory usage
free -h
# Consider processing large files separately
Content Extraction Problems
# Verify PDF contains extractable text
pdftotext document.pdf test-output.txt
cat test-output.txt | head -20