tg-start-library-processing
Submits a library document for processing through TrustGraph workflows.
Synopsis
tg-start-library-processing -d DOCUMENT_ID --id PROCESSING_ID [options]
Description
The tg-start-library-processing
command initiates processing of a document stored in TrustGraph’s document library. This triggers workflows that can extract text, generate embeddings, create knowledge graphs, and enable document search and analysis.
Each processing job is assigned a unique processing ID for tracking and management purposes.
Options
Required Arguments
-d, --document-id ID
: Document ID from the library to process--id, --processing-id ID
: Unique identifier for this processing job
Optional Arguments
-u, --api-url URL
: TrustGraph API URL (default:$TRUSTGRAPH_URL
orhttp://localhost:8088/
)-U, --user USER
: User ID for processing context (default:trustgraph
)-i, --flow-id ID
: Flow instance to use for processing (default:default
)--collection COLLECTION
: Collection to assign processed data (default:default
)--tags TAGS
: Comma-separated tags for the processing job
Examples
Basic Document Processing
tg-start-library-processing -d "doc_123456789" --id "proc_001"
Processing with Custom Collection
tg-start-library-processing \
-d "research_paper_456" \
--id "research_proc_001" \
--collection "research-papers" \
--tags "nlp,research,2023"
Processing with Specific Flow
tg-start-library-processing \
-d "technical_manual" \
--id "manual_proc_001" \
-i "document-analysis-flow" \
-U "technical-team" \
--collection "technical-docs"
Processing Multiple Documents
# Process several documents in sequence
documents=("doc_001" "doc_002" "doc_003")
for i in "${!documents[@]}"; do
doc_id="${documents[$i]}"
proc_id="batch_proc_$(printf %03d $((i+1)))"
echo "Processing document: $doc_id"
tg-start-library-processing \
-d "$doc_id" \
--id "$proc_id" \
--collection "batch-processing" \
--tags "batch,automated"
done
Processing Workflow
Document Processing Steps
- Document Retrieval: Fetch document from library
- Content Extraction: Extract text and metadata
- Text Processing: Clean and normalize content
- Embedding Generation: Create vector embeddings
- Knowledge Extraction: Generate triples and entities
- Index Creation: Make content searchable
Processing Types
Different document types may trigger different processing workflows:
- PDF Documents: Text extraction, OCR if needed
- Text Files: Direct text processing
- Images: OCR and image analysis
- Structured Data: Schema extraction and mapping
Use Cases
Batch Document Processing
# Process all unprocessed documents
process_all_documents() {
local collection="$1"
local batch_id="batch_$(date +%Y%m%d_%H%M%S)"
echo "Starting batch processing for collection: $collection"
# Get all document IDs
tg-show-library-documents | \
grep "| id" | \
awk '{print $3}' | \
while read -r doc_id; do
proc_id="${batch_id}_${doc_id}"
echo "Processing document: $doc_id"
tg-start-library-processing \
-d "$doc_id" \
--id "$proc_id" \
--collection "$collection" \
--tags "batch,automated,$(date +%Y%m%d)"
# Add delay to avoid overwhelming the system
sleep 2
done
}
# Process all documents
process_all_documents "processed-docs"
Department-Specific Processing
# Process documents by department
process_by_department() {
local dept="$1"
local flow="$2"
echo "Processing documents for department: $dept"
# Find documents with department tag
tg-show-library-documents -U "$dept" | \
grep "| id" | \
awk '{print $3}' | \
while read -r doc_id; do
proc_id="${dept}_proc_$(date +%s)_${doc_id}"
echo "Processing $dept document: $doc_id"
tg-start-library-processing \
-d "$doc_id" \
--id "$proc_id" \
-i "$flow" \
-U "$dept" \
--collection "${dept}-processed" \
--tags "$dept,departmental"
done
}
# Process documents for different departments
process_by_department "research" "research-flow"
process_by_department "finance" "document-flow"
process_by_department "legal" "compliance-flow"
Priority Processing
# Process high-priority documents first
priority_processing() {
local priority_tags=("urgent" "high-priority" "critical")
for tag in "${priority_tags[@]}"; do
echo "Processing $tag documents..."
tg-show-library-documents | \
grep -B5 -A5 "$tag" | \
grep "| id" | \
awk '{print $3}' | \
while read -r doc_id; do
proc_id="priority_$(date +%s)_${doc_id}"
echo "Processing priority document: $doc_id"
tg-start-library-processing \
-d "$doc_id" \
--id "$proc_id" \
--collection "priority-processed" \
--tags "priority,$tag"
done
done
}
priority_processing
Conditional Processing
# Process documents based on criteria
conditional_processing() {
local criteria="$1"
local flow="$2"
echo "Processing documents matching criteria: $criteria"
tg-show-library-documents | \
grep -B10 -A10 "$criteria" | \
grep "| id" | \
awk '{print $3}' | \
while read -r doc_id; do
# Check if already processed
if tg-invoke-document-rag -q "test" 2>/dev/null | grep -q "$doc_id"; then
echo "Document $doc_id already processed, skipping"
continue
fi
proc_id="conditional_$(date +%s)_${doc_id}"
echo "Processing document: $doc_id"
tg-start-library-processing \
-d "$doc_id" \
--id "$proc_id" \
-i "$flow" \
--collection "conditional-processed" \
--tags "conditional,$criteria"
done
}
# Process technical documents
conditional_processing "technical" "technical-flow"
Advanced Usage
Processing with Validation
# Process with pre and post validation
validated_processing() {
local doc_id="$1"
local proc_id="$2"
local collection="$3"
echo "Starting validated processing for: $doc_id"
# Pre-processing validation
if ! tg-show-library-documents | grep -q "$doc_id"; then
echo "ERROR: Document $doc_id not found"
return 1
fi
# Check if processing ID is unique
if tg-show-flows | grep -q "$proc_id"; then
echo "ERROR: Processing ID $proc_id already in use"
return 1
fi
# Start processing
echo "Starting processing..."
tg-start-library-processing \
-d "$doc_id" \
--id "$proc_id" \
--collection "$collection" \
--tags "validated,$(date +%Y%m%d)"
# Monitor processing
echo "Monitoring processing progress..."
timeout=300 # 5 minutes
elapsed=0
interval=10
while [ $elapsed -lt $timeout ]; do
if tg-invoke-document-rag -q "test" -C "$collection" 2>/dev/null | grep -q "$doc_id"; then
echo "✓ Processing completed successfully"
return 0
fi
echo "Processing in progress... (${elapsed}s elapsed)"
sleep $interval
elapsed=$((elapsed + interval))
done
echo "⚠ Processing timeout reached"
return 1
}
# Usage
validated_processing "doc_123" "validated_proc_001" "validated-docs"
Parallel Processing with Limits
# Process multiple documents in parallel with concurrency limits
parallel_processing() {
local doc_list=("$@")
local max_concurrent=5
local current_jobs=0
echo "Processing ${#doc_list[@]} documents with max $max_concurrent concurrent jobs"
for doc_id in "${doc_list[@]}"; do
# Wait if max concurrent jobs reached
while [ $current_jobs -ge $max_concurrent ]; do
wait -n # Wait for any job to complete
current_jobs=$((current_jobs - 1))
done
# Start processing in background
(
proc_id="parallel_$(date +%s)_${doc_id}"
echo "Starting processing: $doc_id"
tg-start-library-processing \
-d "$doc_id" \
--id "$proc_id" \
--collection "parallel-processed" \
--tags "parallel,batch"
echo "Completed processing: $doc_id"
) &
current_jobs=$((current_jobs + 1))
done
# Wait for all remaining jobs
wait
echo "All processing jobs completed"
}
# Get document list and process in parallel
doc_list=($(tg-show-library-documents | grep "| id" | awk '{print $3}'))
parallel_processing "${doc_list[@]}"
Processing with Retry Logic
# Process with automatic retry on failure
processing_with_retry() {
local doc_id="$1"
local proc_id="$2"
local max_retries=3
local retry_delay=30
for attempt in $(seq 1 $max_retries); do
echo "Processing attempt $attempt/$max_retries for document: $doc_id"
if tg-start-library-processing \
-d "$doc_id" \
--id "${proc_id}_attempt_${attempt}" \
--collection "retry-processed" \
--tags "retry,attempt_$attempt"; then
# Wait and check if processing succeeded
sleep $retry_delay
if tg-invoke-document-rag -q "test" 2>/dev/null | grep -q "$doc_id"; then
echo "✓ Processing succeeded on attempt $attempt"
return 0
else
echo "Processing started but content not yet accessible"
fi
else
echo "✗ Processing failed on attempt $attempt"
fi
if [ $attempt -lt $max_retries ]; then
echo "Retrying in ${retry_delay}s..."
sleep $retry_delay
fi
done
echo "✗ Processing failed after $max_retries attempts"
return 1
}
# Usage
processing_with_retry "doc_123" "retry_proc_001"
Configuration-Driven Processing
# Process documents based on configuration file
config_driven_processing() {
local config_file="$1"
if [ ! -f "$config_file" ]; then
echo "Configuration file not found: $config_file"
return 1
fi
echo "Processing documents based on configuration: $config_file"
# Example configuration format:
# doc_id,flow_id,collection,tags
# doc_123,research-flow,research-docs,nlp research
while IFS=',' read -r doc_id flow_id collection tags; do
# Skip header line
if [ "$doc_id" = "doc_id" ]; then
continue
fi
proc_id="config_$(date +%s)_${doc_id}"
echo "Processing: $doc_id -> $collection (flow: $flow_id)"
tg-start-library-processing \
-d "$doc_id" \
--id "$proc_id" \
-i "$flow_id" \
--collection "$collection" \
--tags "$tags"
done < "$config_file"
}
# Create example configuration
cat > processing_config.csv << EOF
doc_id,flow_id,collection,tags
doc_123,research-flow,research-docs,nlp research
doc_456,finance-flow,finance-docs,financial quarterly
doc_789,general-flow,general-docs,general processing
EOF
# Process based on configuration
config_driven_processing "processing_config.csv"
Error Handling
Document Not Found
Exception: Document not found
Solution: Verify document exists with tg-show-library-documents
.
Processing ID Conflict
Exception: Processing ID already exists
Solution: Use a unique processing ID or check existing jobs with tg-show-flows
.
Flow Not Found
Exception: Flow instance not found
Solution: Verify flow exists with tg-show-flows
or tg-show-flow-classes
.
Insufficient Resources
Exception: Processing queue full
Solution: Wait for current jobs to complete or scale processing resources.
Monitoring and Management
Processing Status
# Monitor processing progress
monitor_processing() {
local proc_id="$1"
local timeout="${2:-300}" # 5 minutes default
echo "Monitoring processing: $proc_id"
elapsed=0
interval=10
while [ $elapsed -lt $timeout ]; do
# Check if processing is active
if tg-show-flows | grep -q "$proc_id"; then
echo "Processing active... (${elapsed}s elapsed)"
else
echo "Processing completed or stopped"
break
fi
sleep $interval
elapsed=$((elapsed + interval))
done
if [ $elapsed -ge $timeout ]; then
echo "Monitoring timeout reached"
fi
}
# Monitor specific processing job
monitor_processing "proc_001" 600
Batch Monitoring
# Monitor multiple processing jobs
monitor_batch() {
local proc_pattern="$1"
echo "Monitoring batch processing: $proc_pattern"
while true; do
active_jobs=$(tg-show-flows | grep -c "$proc_pattern" || echo "0")
if [ "$active_jobs" -eq 0 ]; then
echo "All batch processing jobs completed"
break
fi
echo "Active jobs: $active_jobs"
sleep 30
done
}
# Monitor batch processing
monitor_batch "batch_proc_"
Environment Variables
TRUSTGRAPH_URL
: Default API URL
Related Commands
tg-show-library-documents
- List available documentstg-stop-library-processing
- Stop processing jobstg-show-flows
- Monitor processing flowstg-invoke-document-rag
- Query processed documents
API Integration
This command uses the Library API to initiate document processing workflows.
Best Practices
- Unique IDs: Always use unique processing IDs to avoid conflicts
- Resource Management: Monitor system resources during batch processing
- Error Handling: Implement retry logic for robust processing
- Monitoring: Track processing progress and completion
- Collection Organization: Use meaningful collection names
- Tagging: Apply consistent tagging for better organization
- Documentation: Document processing procedures and configurations
Troubleshooting
Processing Not Starting
# Check document exists
tg-show-library-documents | grep "document-id"
# Check flow is available
tg-show-flows | grep "flow-id"
# Check system resources
free -h
df -h
Slow Processing
# Check processing queue
tg-show-flows | grep processing | wc -l
# Monitor system load
top
htop
Processing Failures
# Check processing logs
# (Log location depends on TrustGraph configuration)
# Retry with different flow
tg-start-library-processing -d "doc-id" --id "retry-proc" -i "alternative-flow"