Structured Data Load from a Document
In this guide a document containing tabular data will be loaded using the schema defined in the previous guide.
Overview
This guide walks through using a previously defined schema in an extraction flow. In an object-extraction flow, the system uses LLM prompts to located objects matching defined schemas, which are then loaded into the object store.
What You’ll Learn
- How to load test documents into TrustGraph
- How to start an object extraction flow
- How to process documents through the extraction pipeline
Prerequisites
Before starting this guide, ensure you have:
- A running TrustGraph instance version 1.3 or later (see Installation Guide)
- Python 3.10 with the TrustGraph CLI tools installed (
pip install trustgraph-cli
) - Sample documents or structured data files to process
Step 1: Load Data into TrustGraph
You have two options for loading a document into TrustGraph:
- Using the workbench: The librarian screen allows loading a PDF or text document.
- Command line: Running the same operation using CLI
Prepare Test Documents
For this example, we’ll use a sample document containing city demographic data that matches our schema.
-
Access the Sample Document Open this Google Docs document containing city demographic data: https://raw.githubusercontent.com/trustgraph-ai/example-data/main/cities/most-populous-cities.pdf
- Save as PDF
- In Google Docs, click File → Download → PDF Document (.pdf)
- Save the file as
cities_data.pdf
to your local machine
- Alternative: Create Your Own Test Document If you prefer, you can create your own test document with city information. Make sure it includes:
- City names and their countries
- Population figures
- Climate descriptions
- Primary languages spoken
- Local currencies
Example content format:
Tokyo, Japan has a population of 37.4 million people. The climate is humid subtropical with hot summers and mild winters. The primary language is Japanese and the currency is the Japanese Yen. Delhi, India has a population of 32.9 million. The climate is semi-arid with extreme summers and cool winters. The primary language is Hindi and the currency is the Indian Rupee. Shanghai, China has a population of 28.5 million people. The climate is humid subtropical with hot, humid summers and cool, damp winters. The primary language is Mandarin Chinese and the currency is the Chinese Yuan. São Paulo, Brazil has a population of 22.8 million people. The climate is subtropical highland with warm, rainy summers and mild, dry winters. The primary language is Portuguese and the currency is the Brazilian Real. Dhaka, Bangladesh has a population of 22.4 million people. The climate is tropical monsoon with hot, humid summers and mild winters. The primary language is Bengali and the currency is the Bangladeshi Taka.
Load Documents into TrustGraph using the Workbench
-
Navigate to Library In the TrustGraph workbench, click on Library in the navigation menu.
-
Upload Documents Click Upload Documents button.
- Configure Document Upload
- Title: Enter
30 Most Populous Cities
- Document Type: Select the
PDF
button - Select Files: Click Select PDF files and choose your
cities_data.pdf
file
- Title: Enter
- Submit Upload Click Submit to load the document into the library.
Using the Command Line
Load the document using the tg-add-library-document
command:
tg-add-library-document cities_data.pdf \
--name "30 Most Populous Cities" \
--kind "application/pdf"
Step 2: Start an Object Extraction Flow
Use the Workbench to create a new flow on the Flows page.
Select ‘Create’, give your flow an ID e.g. object-extraction
and select the object-extract
flow class. Give it a helpful description e.g. Object extraction
.
Step 3: Launch Document Processing
When loading the document on the workbench, it can help to decide to store the data in a particular collection for later. Click the dialog top right, and set the collection to cities
.
On the Library page, select your document containing city information, click ‘Submit’ at the bottom of the screen.
Select your new object extraction processing and submit the document.
You can track progress on the Grafana dashboard. The sample document provided should process quickly, say under 1 minute. The processing makes use of an LLM to perform extraction so there will be time needed when processing large datasets.
After about 1 minute, the data should be available for querying.
Next Steps
See the next guide.
Best Practices
Document Preparation**
- Ensure documents contain clear, structured information
- Use consistent formatting and terminology
- Include relevant context around entities
Further Reading
- tg-add-library-document - Load documents into the library
- TrustGraph CLI Reference - Complete CLI documentation