Structure Descriptor Language (SDL)

The TrustGraph Structure Descriptor Language (SDL) is a JSON-based configuration language that describes how to parse, transform, and import structured data into TrustGraph. It provides a declarative approach to data ingestion, supporting multiple input formats and complex transformation pipelines without requiring custom code.

Overview

SDL enables you to:

Import multiple formats - CSV, JSON, XML, Excel, Parquet, and fixed-width files
Transform data - Clean, normalize, and convert data types during import
Validate data - Apply quality checks and constraints
Handle errors - Configure how to deal with invalid data
Enrich data - Use lookup tables and calculated fields

Basic Structure

Every SDL configuration has this top-level structure:

{
  "version": "1.0",
  "metadata": { ... },
  "format": { ... },
  "globals": { ... },
  "preprocessing": [ ... ],
  "mappings": [ ... ],
  "postprocessing": [ ... ],
  "output": { ... }
}

Required Sections

Section	Description
`version`	SDL specification version (currently “1.0”)
`format`	Input data format and parsing options
`mappings`	Field-by-field mapping from source to target
`output`	Output configuration for TrustGraph

Optional Sections

Section	Description
`metadata`	Configuration information (name, author, etc.)
`globals`	Variables, constants, and lookup tables
`preprocessing`	Operations applied before field mapping
`postprocessing`	Operations applied after field mapping

Format Definition

The format section describes the input data format and parsing options:

{
  "format": {
    "type": "csv|json|xml|fixed-width|excel|parquet",
    "encoding": "utf-8",
    "options": {
      // Format-specific options
    }
  }
}

CSV Format

{
  "format": {
    "type": "csv",
    "encoding": "utf-8",
    "options": {
      "delimiter": ",",
      "quote_char": "\"",
      "escape_char": "\\",
      "skip_rows": 1,
      "has_header": true,
      "null_values": ["", "NULL", "null", "N/A"]
    }
  }
}

CSV Options:

Option	Type	Default	Description
`delimiter`	string	`","`	Field separator character
`quote_char`	string	`"\""`	Character used to quote fields
`escape_char`	string	`"\\"`	Character used to escape quotes
`skip_rows`	integer	`0`	Number of rows to skip at start
`has_header`	boolean	`true`	First row contains column names
`null_values`	array	`[""]`	Values to treat as null

JSON Format

{
  "format": {
    "type": "json",
    "options": {
      "root_path": "$.data",
      "array_mode": "records",
      "flatten": false
    }
  }
}

JSON Options:

Option	Type	Default	Description
`root_path`	string	`"$"`	JSONPath to array of records
`array_mode`	string	`"records"`	How to handle arrays: `records` or `single`
`flatten`	boolean	`false`	Flatten nested objects

XML Format

{
  "format": {
    "type": "xml",
    "options": {
      "root_element": "//records/record",
      "namespaces": {
        "ns": "http://example.com/namespace"
      }
    }
  }
}

XML Options:

Option	Type	Default	Description
`root_element`	string	Required	XPath to record elements
`namespaces`	object	`{}`	Namespace prefixes and URIs

Global Settings

The globals section defines lookup tables, variables, and constants used throughout the configuration:

{
  "globals": {
    "variables": {
      "current_date": "2024-01-01",
      "batch_id": "BATCH_001",
      "default_confidence": 0.8
    },
    "lookup_tables": {
      "country_codes": {
        "US": "United States",
        "UK": "United Kingdom",
        "CA": "Canada"
      },
      "status_mapping": {
        "1": "active",
        "0": "inactive"
      }
    },
    "constants": {
      "source_system": "legacy_crm",
      "import_type": "full"
    }
  }
}

Usage in expressions:

Variables: ${current_date}
Constants: ${source_system}
Lookup tables: Used with lookup transform

Field Mappings

The mappings section defines how source data maps to target fields:

{
  "mappings": [
    {
      "target_field": "person_name",
      "source": "$.name",
      "transforms": [
        {"type": "trim"},
        {"type": "title_case"},
        {"type": "required"}
      ],
      "validation": [
        {"type": "min_length", "value": 2},
        {"type": "max_length", "value": 100}
      ]
    }
  ]
}

Mapping Fields

Field	Type	Required	Description
`target_field`	string	Yes	Name of the target field
`source`	string	Yes	Source path expression
`transforms`	array	No	List of transformations to apply
`validation`	array	No	List of validation rules

Source Path Expressions

Different formats use different path expression languages:

Format	Example	Description
CSV	`"customer_name"` or `"[2]"`	Column name or index
JSON	`"$.user.profile.email"`	JSONPath expression
XML	`"//product[@id='123']/price"`	XPath expression
Fixed-width	`"customer_name"`	Field name from definition

Transformations

Transforms are applied to field values in the order specified:

String Transforms

{"type": "trim"}                                    // Remove whitespace
{"type": "upper"}                                   // Convert to uppercase
{"type": "lower"}                                   // Convert to lowercase  
{"type": "title_case"}                              // Title Case Format
{"type": "replace", "pattern": "old", "replacement": "new"}
{"type": "regex_replace", "pattern": "\\d+", "replacement": "XXX"}
{"type": "substring", "start": 0, "end": 10}       // Extract substring
{"type": "pad_left", "length": 10, "char": "0"}    // Pad with characters

Type Conversions

{"type": "to_string"}                               // Convert to string
{"type": "to_int"}                                  // Convert to integer
{"type": "to_float"}                                // Convert to float
{"type": "to_bool"}                                 // Convert to boolean
{"type": "to_date", "format": "YYYY-MM-DD"}        // Parse date
{"type": "parse_json"}                              // Parse JSON string

Data Operations

{"type": "default", "value": "default_value"}      // Set default if null
{"type": "lookup", "table": "table_name"}          // Lookup in table
{"type": "concat", "values": ["field1", " - ", "field2"]}
{"type": "calculate", "expression": "${field1} + ${field2}"}
{"type": "conditional", 
 "condition": "${age} > 18", 
 "true_value": "adult", 
 "false_value": "minor"}

Expression Syntax

Expressions use ${} syntax to reference:

Current field value: ${.}
Other fields: ${field_name}
Variables: ${variable_name}

Examples:

{"type": "calculate", "expression": "${price} * 1.1"}
{"type": "conditional", "condition": "${status} == 'active'", "true_value": "yes", "false_value": "no"}

Validation Rules

Validation rules check data quality and can trigger error handling:

Basic Validations

{"type": "required"}                                // Field must have value
{"type": "not_null"}                               // Field cannot be null
{"type": "min_length", "value": 5}                // Minimum string length
{"type": "max_length", "value": 100}              // Maximum string length
{"type": "range", "min": 0, "max": 1000}          // Numeric range
{"type": "pattern", "value": "^[A-Z]{2,3}$"}      // Regex pattern
{"type": "in_list", "values": ["active", "inactive", "pending"]}

Custom Validations

{
  "type": "custom",
  "expression": "${age} >= 18 && ${country} == 'US'",
  "message": "Must be 18+ and in US"
}

{
  "type": "cross_field",
  "fields": ["start_date", "end_date"],
  "expression": "${start_date} < ${end_date}",
  "message": "Start date must be before end date"
}

Preprocessing and Postprocessing

Preprocessing

Operations applied to all records before field mapping:

{
  "preprocessing": [
    {
      "type": "filter",
      "condition": "${status} != 'deleted'"
    },
    {
      "type": "sort",
      "field": "created_date",
      "order": "asc"
    }
  ]
}

Postprocessing

Operations applied to all records after field mapping:

{
  "postprocessing": [
    {
      "type": "deduplicate",
      "key_fields": ["email", "phone"]
    },
    {
      "type": "aggregate",
      "group_by": ["country"],
      "functions": {
        "total_count": {"type": "count"},
        "avg_age": {"type": "avg", "field": "age"}
      }
    }
  ]
}

Output Configuration

The output section configures how processed data is saved:

{
  "output": {
    "format": "trustgraph-objects",
    "schema_name": "person",
    "options": {
      "batch_size": 1000,
      "confidence": 0.9,
      "source_span_field": "raw_text",
      "metadata": {
        "source": "crm_import",
        "version": "1.0"
      }
    },
    "error_handling": {
      "on_validation_error": "skip",
      "on_transform_error": "default",
      "max_errors": 100,
      "error_output": "errors.json"
    }
  }
}

Output Options

Option	Type	Description
`format`	string	Always “trustgraph-objects”
`schema_name`	string	Target schema name
`batch_size`	integer	Records per batch
`confidence`	float	Confidence score (0.0-1.0)
`metadata`	object	Additional metadata to include

Error Handling

Option	Values	Description
`on_validation_error`	`skip`, `fail`, `log`	What to do when validation fails
`on_transform_error`	`skip`, `fail`, `default`	What to do when transform fails
`max_errors`	integer	Maximum errors before stopping
`error_output`	string	File to write error details

Complete Example

Here’s a complete SDL configuration for importing customer data from CSV:

{
  "version": "1.0",
  "metadata": {
    "name": "Customer Import from CRM CSV",
    "description": "Imports customer data from legacy CRM system",
    "author": "Data Team",
    "created": "2024-01-01T00:00:00Z"
  },
  "format": {
    "type": "csv",
    "encoding": "utf-8",
    "options": {
      "delimiter": ",",
      "has_header": true,
      "skip_rows": 1,
      "null_values": ["", "NULL", "N/A"]
    }
  },
  "globals": {
    "variables": {
      "import_date": "2024-01-01",
      "default_confidence": 0.85
    },
    "lookup_tables": {
      "country_codes": {
        "US": "United States",
        "CA": "Canada", 
        "UK": "United Kingdom"
      },
      "status_codes": {
        "1": "active",
        "0": "inactive"
      }
    }
  },
  "preprocessing": [
    {
      "type": "filter",
      "condition": "${status} != ''"
    }
  ],
  "mappings": [
    {
      "target_field": "full_name",
      "source": "customer_name",
      "transforms": [
        {"type": "trim"},
        {"type": "title_case"}
      ],
      "validation": [
        {"type": "required"},
        {"type": "min_length", "value": 2},
        {"type": "max_length", "value": 100}
      ]
    },
    {
      "target_field": "email",
      "source": "email_address", 
      "transforms": [
        {"type": "trim"},
        {"type": "lower"}
      ],
      "validation": [
        {"type": "pattern", "value": "^[\\w.-]+@[\\w.-]+\\.[a-zA-Z]{2,}$"}
      ]
    },
    {
      "target_field": "age",
      "source": "age",
      "transforms": [
        {"type": "to_int"},
        {"type": "default", "value": 0}
      ],
      "validation": [
        {"type": "range", "min": 0, "max": 120}
      ]
    },
    {
      "target_field": "country",
      "source": "country_code",
      "transforms": [
        {"type": "lookup", "table": "country_codes"},
        {"type": "default", "value": "Unknown"}
      ]
    },
    {
      "target_field": "status",
      "source": "status",
      "transforms": [
        {"type": "lookup", "table": "status_codes"},
        {"type": "default", "value": "unknown"}
      ]
    }
  ],
  "postprocessing": [
    {
      "type": "deduplicate",
      "key_fields": ["email"]
    }
  ],
  "output": {
    "format": "trustgraph-objects",
    "schema_name": "customer",
    "options": {
      "confidence": "${default_confidence}",
      "batch_size": 500,
      "metadata": {
        "import_date": "${import_date}",
        "source": "legacy_crm"
      }
    },
    "error_handling": {
      "on_validation_error": "log",
      "on_transform_error": "default",
      "max_errors": 50,
      "error_output": "customer_import_errors.json"
    }
  }
}

Usage with TrustGraph

SDL configurations are used with the tg-load-structured-data command:

# Use SDL configuration file
tg-load-structured-data -f data.csv -d config.json -c customers

# Generate SDL from sample data
tg-load-structured-data -f data.csv -s auto -c customers --save-config config.json

Best Practices

1. Start Simple

Begin with basic field mappings and add complexity incrementally:

{
  "mappings": [
    {
      "target_field": "name",
      "source": "customer_name",
      "transforms": [{"type": "trim"}]
    }
  ]
}

2. Use Descriptive Names

Choose clear names for target fields and include metadata:

{
  "metadata": {
    "name": "Customer Import - Legacy CRM System",
    "description": "Imports customer records from CSV export of legacy CRM",
    "version": "1.2"
  }
}

3. Handle Data Quality

Include validation and error handling:

{
  "validation": [
    {"type": "required"},
    {"type": "pattern", "value": "^[\\w.-]+@[\\w.-]+\\.[a-zA-Z]{2,}$"}
  ]
}

4. Use Lookup Tables

Define lookup tables for code-to-value mappings:

{
  "globals": {
    "lookup_tables": {
      "department_codes": {
        "ENG": "Engineering",
        "SALES": "Sales",
        "MKT": "Marketing"
      }
    }
  }
}

5. Test with Small Datasets

Validate your configuration with a small sample before processing large files.

6. Document Complex Logic

Use comments in metadata to explain complex transformations:

{
  "metadata": {
    "description": "Age calculation: Uses birth_date if available, otherwise uses age_years field with ${current_date} as reference"
  }
}

Error Handling

SDL provides several levels of error handling:

Transform Errors

When a transformation fails (e.g., converting “abc” to integer):

skip: Skip the record entirely
fail: Stop processing with error
default: Use the field’s default value

Validation Errors

When validation rules fail:

skip: Skip the record entirely
fail: Stop processing with error
log: Log error but continue processing

Error Output

All errors are logged to the specified error output file with details:

Record number
Field name
Error type and message
Original value

Structure Descriptor Language (SDL)

Overview

Basic Structure

Required Sections

Optional Sections

Format Definition

CSV Format

JSON Format

XML Format

Global Settings

Field Mappings

Mapping Fields

Source Path Expressions

Transformations

String Transforms

Type Conversions

Data Operations

Expression Syntax

Validation Rules

Basic Validations

Custom Validations

Preprocessing and Postprocessing

Preprocessing

Postprocessing

Output Configuration

Output Options

Error Handling

Complete Example

Usage with TrustGraph

Best Practices

1. Start Simple

2. Use Descriptive Names

3. Handle Data Quality

4. Use Lookup Tables

5. Test with Small Datasets

6. Document Complex Logic

Error Handling

Transform Errors

Validation Errors

Error Output

See Also