Agent Extraction Process
The agent extraction process uses AI agents to automatically extract structured data, knowledge graphs, and objects from unstructured text. This guide explains how the extraction workflow operates and how to configure it for different use cases.
Overview
Agent extraction is a multi-stage process that:
- Analyzes unstructured text using LLM-based agents
- Identifies entities, relationships, and structured data
- Extracts information according to defined schemas
- Stores results in the knowledge graph and object storage
- Makes data queryable through GraphQL and other APIs
Extraction Workflow
1. Document Ingestion
from trustgraph.api import Api
api = Api("http://localhost:8088/").flow().id("default")
# Load document for extraction
document_text = """
Apple Inc. announced its Q4 2024 earnings today. The company reported
revenue of $89.5 billion, led by strong iPhone 15 sales. CEO Tim Cook
stated that the new AI features have driven customer upgrades.
"""
2. Agent Configuration
Agents can be configured with specific extraction instructions:
# Configure extraction agent
extraction_prompt = """
Extract the following information:
1. Company entities (name, executives, products)
2. Financial data (revenue, earnings, metrics)
3. Product information (name, features)
4. Relationships between entities
5. Temporal information (dates, quarters)
Structure the output as:
- Knowledge graph triples
- Structured objects matching defined schemas
"""
3. Invoke Agent Extraction
# Execute extraction
response = api.invoke_agent(
prompt=extraction_prompt,
text=document_text,
extraction_mode="comprehensive" # Options: simple, comprehensive, schema-based
)
# Response contains extracted data
print(f"Extracted {len(response['triples'])} knowledge graph triples")
print(f"Extracted {len(response['objects'])} structured objects")
print(f"Extraction ID: {response['extraction_id']}")
Extraction Modes
Simple Extraction
Basic entity and relationship extraction:
response = api.invoke_agent(
prompt="Extract key entities and relationships",
text=document_text,
extraction_mode="simple"
)
# Results include:
# - Named entities (people, organizations, locations)
# - Basic relationships
# - Key facts
Comprehensive Extraction
Detailed multi-level extraction:
response = api.invoke_agent(
prompt="Perform comprehensive knowledge extraction",
text=document_text,
extraction_mode="comprehensive"
)
# Results include:
# - All entities with properties
# - Complex relationships
# - Temporal information
# - Contextual metadata
# - Inferred relationships
Schema-Based Extraction
Extract data matching specific schemas:
# Define target schema
schema = {
"Company": {
"fields": ["name", "ceo", "revenue", "products"],
"required": ["name"]
},
"Product": {
"fields": ["name", "features", "company"],
"required": ["name", "company"]
}
}
response = api.invoke_agent(
prompt="Extract data according to provided schemas",
text=document_text,
extraction_mode="schema-based",
schemas=schema
)
# Results match the defined schemas exactly
Knowledge Graph Extraction
The agent creates knowledge graph triples in RDF format:
# Example extracted triples
triples = [
("Apple_Inc", "rdf:type", "Organization"),
("Apple_Inc", "has_ceo", "Tim_Cook"),
("Apple_Inc", "reported_revenue", "89.5_billion"),
("iPhone_15", "rdf:type", "Product"),
("iPhone_15", "manufactured_by", "Apple_Inc"),
("iPhone_15", "has_feature", "AI_capabilities"),
("Q4_2024", "rdf:type", "TimePeriod"),
("89.5_billion", "reported_in", "Q4_2024")
]
# Triples are automatically stored in the knowledge graph
# Query them using SPARQL or GraphQL
Object Extraction
Structured objects are extracted and stored:
# Example extracted objects
objects = [
{
"schema": "Company",
"data": {
"id": "apple-inc",
"name": "Apple Inc.",
"ceo": "Tim Cook",
"revenue": 89500000000,
"revenue_period": "Q4 2024",
"products": ["iPhone 15"]
}
},
{
"schema": "Product",
"data": {
"id": "iphone-15",
"name": "iPhone 15",
"company": "apple-inc",
"features": ["AI capabilities"],
"category": "Smartphone"
}
}
]
# Objects are stored and queryable via Structured Query API
Extraction Pipelines
Sequential Extraction
Process documents through multiple extraction stages:
# Stage 1: Entity extraction
entities = api.invoke_agent(
prompt="Extract all named entities",
text=document_text,
extraction_mode="simple"
)
# Stage 2: Relationship extraction
relationships = api.invoke_agent(
prompt=f"Extract relationships between these entities: {entities['entities']}",
text=document_text,
extraction_mode="comprehensive"
)
# Stage 3: Property extraction
properties = api.invoke_agent(
prompt="Extract properties and attributes for each entity",
text=document_text,
context=entities
)
Batch Extraction
Process multiple documents:
documents = ["doc1.txt", "doc2.txt", "doc3.txt"]
all_extractions = []
for doc_path in documents:
with open(doc_path, 'r') as f:
text = f.read()
response = api.invoke_agent(
prompt="Extract structured data",
text=text,
extraction_mode="comprehensive",
metadata={"source": doc_path}
)
all_extractions.append(response)
# Aggregate results
total_triples = sum(len(e['triples']) for e in all_extractions)
total_objects = sum(len(e['objects']) for e in all_extractions)
Querying Extracted Data
Query Knowledge Graph
# SPARQL query for extracted triples
sparql_query = """
SELECT ?company ?revenue ?period
WHERE {
?company rdf:type Organization .
?company reported_revenue ?revenue .
?revenue reported_in ?period .
}
"""
results = api.query_triples(sparql_query)
Query Structured Objects
# GraphQL query for extracted objects
response = api.structured_query(
question="Show all companies with revenue over 50 billion"
)
companies = response["data"]["companies"]
Combined Queries
# Natural language query across all extracted data
response = api.structured_query(
question="What products were mentioned with AI features?"
)
Advanced Configuration
Custom Extraction Rules
extraction_config = {
"rules": {
"financial_data": {
"patterns": ["revenue", "earnings", "profit", "sales"],
"extract_as": "FinancialMetric",
"include_context": True
},
"temporal": {
"patterns": ["Q[1-4] \\d{4}", "\\d{4}"],
"extract_as": "TimePeriod"
}
},
"confidence_threshold": 0.8,
"include_metadata": True
}
response = api.invoke_agent(
prompt="Extract financial information",
text=document_text,
config=extraction_config
)
Extraction Validation
# Validate extracted data against schemas
validation_result = api.validate_extraction(
extraction_id=response["extraction_id"],
schemas=schema
)
if validation_result["valid"]:
print("Extraction passed validation")
else:
print(f"Validation errors: {validation_result['errors']}")
Best Practices
1. Clear Extraction Prompts
Provide specific instructions:
# Good: Specific and structured
prompt = """
Extract:
1. Company names and their executives
2. Financial metrics with time periods
3. Product names and features
Format as knowledge graph triples.
"""
# Avoid: Vague instructions
prompt = "Extract important information"
2. Schema Definition
Define schemas before extraction:
# Define clear schemas
schemas = {
"Person": ["name", "title", "company"],
"Company": ["name", "industry", "revenue"],
"Product": ["name", "category", "features"]
}
3. Incremental Processing
Process large documents in chunks:
chunks = split_document(large_document, chunk_size=1000)
for chunk in chunks:
response = api.invoke_agent(
prompt="Extract entities",
text=chunk,
context=previous_extractions
)
4. Validation and Quality Control
Always validate critical extractions:
# Set quality thresholds
if response["confidence"] < 0.7:
# Request human review
flag_for_review(response["extraction_id"])
Monitoring Extraction
Check Extraction Status
# Get extraction details
status = api.get_extraction_status(extraction_id)
print(f"Status: {status['state']}")
print(f"Progress: {status['progress']}%")
print(f"Triples extracted: {status['triple_count']}")
print(f"Objects extracted: {status['object_count']}")
View Extraction Logs
# Get detailed logs
logs = api.get_extraction_logs(extraction_id)
for log in logs:
print(f"{log['timestamp']}: {log['message']}")
Error Handling
try:
response = api.invoke_agent(
prompt="Extract data",
text=document_text
)
except ExtractionError as e:
print(f"Extraction failed: {e.message}")
# Retry with different parameters
response = api.invoke_agent(
prompt="Extract basic entities only",
text=document_text,
extraction_mode="simple"
)
See Also
- Object Extraction Process - Detailed object extraction
- Agent API - Agent API reference
- Structured Query Integration - Query extracted data
- Object Storage API - Store extracted objects