TrustGraph Metrics API

This API provides access to TrustGraph system metrics through a Prometheus proxy endpoint. It allows authenticated access to monitoring and observability data from the TrustGraph system components.

Overview

The Metrics API is implemented as a proxy to a Prometheus metrics server, providing:

  • System performance metrics
  • Service health information
  • Resource utilization data
  • Request/response statistics
  • Error rates and latency metrics

Authentication

All metrics endpoints require Bearer token authentication:

Authorization: Bearer <your-api-token>

Unauthorized requests return HTTP 401.

Endpoint

Base Path: /api/metrics

Method: GET

Description: Proxies requests to the underlying Prometheus API

Usage Examples

Query Current Metrics

# Get all available metrics
curl -H "Authorization: Bearer your-token" \
  "http://api-gateway:8080/api/metrics/query?query=up"

# Get specific metric with time range
curl -H "Authorization: Bearer your-token" \
  "http://api-gateway:8080/api/metrics/query_range?query=cpu_usage&start=1640995200&end=1640998800&step=60"

# Get metric metadata
curl -H "Authorization: Bearer your-token" \
  "http://api-gateway:8080/api/metrics/metadata"

Common Prometheus API Endpoints

The metrics API supports all standard Prometheus API endpoints:

Instant Queries

GET /api/metrics/query?query=<prometheus_query>

Range Queries

GET /api/metrics/query_range?query=<query>&start=<timestamp>&end=<timestamp>&step=<duration>

Metadata

GET /api/metrics/metadata
GET /api/metrics/metadata?metric=<metric_name>

Series

GET /api/metrics/series?match[]=<series_selector>

Label Values

GET /api/metrics/label/<label_name>/values

Targets

GET /api/metrics/targets

Example Queries

System Health

# Check if services are up
curl -H "Authorization: Bearer token" \
  "http://api-gateway:8080/api/metrics/query?query=up"

# Get service uptime
curl -H "Authorization: Bearer token" \
  "http://api-gateway:8080/api/metrics/query?query=time()-process_start_time_seconds"

Performance Metrics

# CPU usage
curl -H "Authorization: Bearer token" \
  "http://api-gateway:8080/api/metrics/query?query=rate(cpu_seconds_total[5m])"

# Memory usage
curl -H "Authorization: Bearer token" \
  "http://api-gateway:8080/api/metrics/query?query=process_resident_memory_bytes"

# Request rate
curl -H "Authorization: Bearer token" \
  "http://api-gateway:8080/api/metrics/query?query=rate(http_requests_total[5m])"

TrustGraph-Specific Metrics

# Document processing rate
curl -H "Authorization: Bearer token" \
  "http://api-gateway:8080/api/metrics/query?query=rate(trustgraph_documents_processed_total[5m])"

# Knowledge graph size
curl -H "Authorization: Bearer token" \
  "http://api-gateway:8080/api/metrics/query?query=trustgraph_triples_count"

# Embedding generation rate
curl -H "Authorization: Bearer token" \
  "http://api-gateway:8080/api/metrics/query?query=rate(trustgraph_embeddings_generated_total[5m])"

Response Format

Responses follow the standard Prometheus API format:

Successful Query Response

{
    "status": "success",
    "data": {
        "resultType": "vector",
        "result": [
            {
                "metric": {
                    "__name__": "up",
                    "instance": "api-gateway:8080",
                    "job": "trustgraph"
                },
                "value": [1640995200, "1"]
            }
        ]
    }
}

Range Query Response

{
    "status": "success", 
    "data": {
        "resultType": "matrix",
        "result": [
            {
                "metric": {
                    "__name__": "cpu_usage",
                    "instance": "worker-1"
                },
                "values": [
                    [1640995200, "0.15"],
                    [1640995260, "0.18"],
                    [1640995320, "0.12"]
                ]
            }
        ]
    }
}

Error Response

{
    "status": "error",
    "errorType": "bad_data",
    "error": "invalid query syntax"
}

Available Metrics

Standard System Metrics

  • up: Service availability (1 = up, 0 = down)
  • process_resident_memory_bytes: Memory usage
  • process_cpu_seconds_total: CPU time
  • http_requests_total: HTTP request count
  • http_request_duration_seconds: Request latency

TrustGraph-Specific Metrics

  • trustgraph_documents_processed_total: Documents processed count
  • trustgraph_triples_count: Knowledge graph triple count
  • trustgraph_embeddings_generated_total: Embeddings generated count
  • trustgraph_flow_executions_total: Flow execution count
  • trustgraph_pulsar_messages_total: Pulsar message count
  • trustgraph_errors_total: Error count by component

Time Series Queries

Time Ranges

Use standard Prometheus time range formats:

  • 5m: 5 minutes
  • 1h: 1 hour
  • 1d: 1 day
  • 1w: 1 week

Rate Calculations

# 5-minute rate
rate(metric_name[5m])

# Increase over time
increase(metric_name[1h])

Aggregations

# Sum across instances
sum(metric_name)

# Average by label
avg by (instance) (metric_name)

# Top 5 values
topk(5, metric_name)

Integration Examples

Python Integration

import requests

def query_metrics(token, query):
    headers = {"Authorization": f"Bearer {token}"}
    params = {"query": query}
    
    response = requests.get(
        "http://api-gateway:8080/api/metrics/query",
        headers=headers,
        params=params
    )
    
    return response.json()

# Get system uptime
uptime = query_metrics("your-token", "time() - process_start_time_seconds")

JavaScript Integration

async function queryMetrics(token, query) {
    const response = await fetch(
        `http://api-gateway:8080/api/metrics/query?query=${encodeURIComponent(query)}`,
        {
            headers: {
                'Authorization': `Bearer ${token}`
            }
        }
    );
    
    return await response.json();
}

// Get request rate
const requestRate = await queryMetrics('your-token', 'rate(http_requests_total[5m])');

Error Handling

Common HTTP Status Codes

  • 200: Success
  • 400: Bad request (invalid query)
  • 401: Unauthorized (invalid/missing token)
  • 422: Unprocessable entity (query execution error)
  • 500: Internal server error

Error Types

  • bad_data: Invalid query syntax
  • timeout: Query execution timeout
  • canceled: Query was canceled
  • execution: Query execution error

Best Practices

Query Optimization

  • Use appropriate time ranges to limit data volume
  • Apply label filters to reduce result sets
  • Use recording rules for frequently accessed metrics

Rate Limiting

  • Avoid high-frequency polling
  • Cache results when appropriate
  • Use appropriate step sizes for range queries

Security

  • Keep API tokens secure
  • Use HTTPS in production
  • Rotate tokens regularly

Use Cases

  • System Monitoring: Track system health and performance
  • Capacity Planning: Monitor resource utilization trends
  • Alerting: Set up alerts based on metric thresholds
  • Performance Analysis: Analyze system performance over time
  • Debugging: Investigate issues using detailed metrics
  • Business Intelligence: Track document processing and knowledge extraction metrics