tg-load-structured-data

Load structured data (CSV, JSON, XML) into TrustGraph for querying and analysis.

Note: This is an emerging utility that may change as structured data capabilities become more integrated into the TrustGraph platform.

Synopsis

tg-load-structured-data -f FILE -s SCHEMA [OPTIONS]

Description

The tg-load-structured-data command loads structured data files into TrustGraph, making them available for GraphQL queries, natural language queries, and agent-based extraction. It supports various formats including CSV, JSON, and XML, with automatic or manual schema detection.

This tool bridges the gap between traditional structured data and TrustGraph’s knowledge graph capabilities, enabling:

  • Direct querying of structured data via GraphQL
  • Natural language queries against tabular data
  • Integration with document-based knowledge
  • Agent-based data enrichment

Options

Option Description Default
-f, --file FILE Data file to load (CSV, JSON, XML) Required
-s, --schema SCHEMA Schema definition file or auto-detect auto
-c, --collection COLLECTION Target collection name From filename
-t, --type TYPE Object type name From schema
-u, --url URL TrustGraph API URL http://localhost:8088/
--format FORMAT Force input format: csv, json, xml Auto-detect
--delimiter DELIM CSV delimiter character ,
--has-header CSV has header row true
--batch-size N Records per batch 1000
--validate Validate data before loading false
--update Update existing records false
--dry-run Preview without loading false
-h, --help Show help message -

Supported Formats

CSV Files

# Load CSV with auto-detected schema
tg-load-structured-data -f customers.csv -s auto

# Custom delimiter
tg-load-structured-data -f data.tsv --delimiter '\t'

# No header row
tg-load-structured-data -f data.csv --has-header false

JSON Files

# Load JSON array
tg-load-structured-data -f products.json

# Load newline-delimited JSON
tg-load-structured-data -f events.jsonl

# Nested JSON with schema
tg-load-structured-data -f complex.json -s schema.json

XML Files

# Load XML with schema
tg-load-structured-data -f catalog.xml -s catalog-schema.xsd

# Auto-detect XML structure
tg-load-structured-data -f data.xml -s auto

Schema Definition

Auto-Detection

The tool can automatically detect schemas:

# Auto-detect from CSV headers
tg-load-structured-data -f sales.csv -s auto

# Preview detected schema
tg-load-structured-data -f data.csv -s auto --dry-run

Manual Schema Files

Define schemas in JSON format:

{
  "name": "Product",
  "fields": [
    {"name": "id", "type": "string", "required": true},
    {"name": "name", "type": "string", "required": true},
    {"name": "price", "type": "number", "required": false},
    {"name": "category", "type": "string", "required": false},
    {"name": "in_stock", "type": "boolean", "default": true}
  ],
  "indexes": ["id", "category"]
}

Load with schema:

tg-load-structured-data -f products.csv -s product-schema.json

Examples

Basic Data Loading

# Load customer data
tg-load-structured-data -f customers.csv -c customers

# Load with specific type
tg-load-structured-data -f employees.csv -t Employee

# Load to custom collection
tg-load-structured-data -f q1-sales.csv -c sales-2024-q1

Data Validation

# Validate before loading
tg-load-structured-data -f data.csv --validate

# Dry run to preview
tg-load-structured-data -f large-dataset.csv --dry-run

# Show validation errors
tg-load-structured-data -f data.csv --validate 2> errors.log

Batch Processing

# Load large file in batches
tg-load-structured-data -f huge-dataset.csv --batch-size 5000

# Process directory of files
for file in data/*.csv; do
  tg-load-structured-data -f "$file" -c "$(basename $file .csv)"
done

Update Operations

# Update existing records
tg-load-structured-data -f updated-products.csv --update

# Replace entire collection
tg-invoke-objects-query -c products --delete-all
tg-load-structured-data -f new-products.csv -c products

Integration with Queries

After loading data, query it using:

GraphQL Queries

# Query loaded data
tg-invoke-structured-query -q 'query { products { id name price } }'

Natural Language Queries

# Ask questions about the data
tg-invoke-nlp-query -q "Show all products under $50"

Object Queries

# Direct object queries
tg-invoke-objects-query -c products -t Product

Advanced Features

Data Transformation

# Apply transformations during load
tg-load-structured-data -f raw-data.csv \
  --transform "price:number,date:datetime"

# Custom field mapping
tg-load-structured-data -f legacy.csv \
  --field-map "ProductID:id,ProductName:name"

Relationship Detection

# Auto-detect foreign keys
tg-load-structured-data -f orders.csv \
  --detect-relations

# Specify relationships
tg-load-structured-data -f orders.csv \
  --relation "customer_id:customers.id"

Incremental Loading

# Load only new records
tg-load-structured-data -f daily-data.csv \
  --incremental --key-field id

# Timestamp-based loading
tg-load-structured-data -f events.csv \
  --incremental --timestamp-field created_at \
  --since "2024-01-01"

Data Types

Supported data types for schema fields:

Type Description Example
string Text data “Product Name”
number Numeric values 42.99
integer Whole numbers 100
boolean True/false true
date Date only “2024-01-15”
datetime Date and time “2024-01-15T10:30:00Z”
array List of values [“tag1”, “tag2”]
object Nested structure {“address”: {…}}

Performance Considerations

Large Files

For files over 100MB:

# Use larger batch sizes
tg-load-structured-data -f large.csv --batch-size 10000

# Enable compression if supported
gzip -c data.csv | tg-load-structured-data -f - --format csv

# Split into chunks
split -l 100000 huge.csv chunk_
for chunk in chunk_*; do
  tg-load-structured-data -f "$chunk"
done

Optimization Tips

  1. Index key fields for faster queries
  2. Use appropriate batch sizes based on record size
  3. Validate locally before loading large datasets
  4. Consider partitioning very large datasets
  5. Use incremental loading for regular updates

Error Handling

Validation Errors

$ tg-load-structured-data -f data.csv --validate

Validation Errors:
  Row 10: Invalid date format in field 'created_date'
  Row 25: Missing required field 'id'
  Row 30: Type mismatch in field 'price' (expected number, got string)

Recovery Options

# Skip invalid records
tg-load-structured-data -f data.csv --skip-errors

# Log errors and continue
tg-load-structured-data -f data.csv \
  --on-error continue \
  --error-log errors.txt

# Stop on first error (default)
tg-load-structured-data -f data.csv --on-error stop

Environment Variables

Variable Description Default
TRUSTGRAPH_URL Default API URL http://localhost:8088/
TRUSTGRAPH_BATCH_SIZE Default batch size 1000
TRUSTGRAPH_TEMP_DIR Temporary file directory /tmp

Exit Codes

Code Description
0 Success
1 Error (validation, loading, etc.)
2 Partial success (some records failed)

Limitations

Current limitations (may change in future versions):

  • Maximum file size: 2GB (uncompressed)
  • Maximum records per collection: 10 million
  • Maximum field count: 1000 per schema
  • Nested depth limit: 10 levels for JSON/XML

Troubleshooting

Schema Detection Issues

If auto-detection fails:

  • Ensure consistent data types in columns
  • Check for special characters in headers
  • Consider creating manual schema

Memory Issues

For memory errors with large files:

  • Reduce batch size
  • Process file in chunks
  • Increase available memory

Slow Loading

If loading is slow:

  • Increase batch size for small records
  • Disable validation for trusted data
  • Use parallel loading for multiple files

Future Enhancements

Planned improvements for this utility:

  • Direct database connections (PostgreSQL, MySQL)
  • Real-time streaming data support
  • Advanced transformation pipelines
  • Automatic relationship discovery
  • Data quality profiling

See Also