Monitoring

Description

Monitor TrustGraph deployments using Prometheus and Grafana

Difficulty

Intermediate

Duration

30 min

You will need
  • A running TrustGraph deployment
Goal

Access metrics, logs, and dashboards to monitor TrustGraph system health and performance.

Observability Stack

TrustGraph deployments include a complete observability stack for monitoring system health and performance.

Components:

  • Prometheus - Time-series metrics database that collects and stores metrics from all TrustGraph services
  • Grafana - Visualization platform providing dashboards for metrics and logs
  • Loki - Log aggregation system that collects logs from TrustGraph components

What’s monitored:

TrustGraph components expose metrics and logs automatically. The monitoring stack currently captures:

  • All TrustGraph processing components (flows, queues, processors)
  • API gateway request/response metrics
  • Pulsar message queue statistics
  • System resource usage

Note: Infrastructure components (Cassandra, Pulsar, etc.) are not yet integrated into the monitoring stack but may be added in future releases.

Accessing Grafana

Grafana provides the primary interface for viewing metrics and logs.

Access Grafana at http://localhost:3000 (or your deployment hostname).

Default credentials:

  • Username: admin
  • Password: admin

Grafana overview screen

The home screen shows:

  • Recent dashboards
  • Starred dashboards
  • Navigation to logs and metrics

Available Dashboards

Grafana includes pre-configured dashboards for monitoring TrustGraph:

Grafana dashboards list

TrustGraph Dashboard - Main monitoring dashboard showing:

  • Flow processing rates
  • Queue depths and throughput
  • API request metrics
  • System resource usage
  • Processing latency

Custom Dashboards - Create additional dashboards for:

  • Specific flow instances
  • Document processing metrics
  • LLM usage and costs
  • Custom business metrics

Viewing Logs

Loki collects logs from all TrustGraph components, providing centralized log access.

TrustGraph logs

Log sources:

  • Processing flows (document-load, graph-rag, etc.)
  • API gateway requests
  • Initialization services
  • Error messages and stack traces

Query logs:

  • Use the Explore interface in Grafana
  • Filter by component, log level, or time range
  • Search log content for debugging

Current limitation: Only TrustGraph components send logs to Loki. Infrastructure components (Cassandra, Pulsar) log to their own systems.

TrustGraph Dashboard

The main TrustGraph dashboard provides comprehensive system monitoring:

TrustGraph dashboard part 1

Top section shows:

  • Knowledge backlog - backlog on knowledge extraction queues
  • Graph and tripple load backlog
  • Latency through LLM as a heatmap
  • Error rates

TrustGraph dashboard part 2

Middle section displays:

  • Request rates per queue
  • Pub/sub queue backlogs
  • Chunk size counts histogram
  • Indicator of the number of rate-limit events

TrustGraph dashboard part 3

Bottom section includes:

  • Resource utilization (CPU, memory)
  • List of models in use + token counts
  • Token usage
  • Token cost (based on token const configuration)

Using Prometheus

Prometheus provides direct access to raw metrics data for custom queries and analysis.

Access Prometheus at http://localhost:9090.

Prometheus web UI

Use Prometheus to:

  • Execute custom PromQL queries
  • Explore available metrics
  • Test alert expressions
  • Analyze time-series data
  • Debug metric collection

Example queries:

# Total messages processed per minute
rate(trustgraph_messages_processed_total[1m])

# Queue depth for a specific flow
trustgraph_queue_depth{flow="default"}

# API request latency (95th percentile)
histogram_quantile(0.95, rate(trustgraph_api_duration_seconds_bucket[5m]))

Monitoring Best Practices

Regular checks:

  • Monitor queue depths - growing queues indicate processing bottlenecks
  • Track error rates - spikes suggest configuration or resource issues
  • Watch processing latency - increasing latency means slower responses
  • Review logs for errors - catch issues before they impact users

Set up alerts for:

  • Queue depth exceeding thresholds
  • Error rates above acceptable levels
  • Processing failures
  • Resource exhaustion (CPU, memory, disk)

Performance tuning:

  • Use metrics to identify slow processors
  • Balance flow processing resources
  • Optimize queue configurations
  • Scale components based on load patterns

Troubleshooting with Metrics

Queue backlog growing:

  • Check processor resource limits
  • Verify flow configuration
  • Look for processing errors in logs
  • Consider scaling processors

High error rates:

  • Filter logs by error level
  • Identify failing components
  • Check resource availability
  • Review recent configuration changes

Slow API responses:

  • Check API gateway metrics
  • Review queue processing times
  • Verify LLM response latency
  • Examine database query performance

Resource exhaustion:

  • Monitor CPU and memory usage
  • Identify resource-hungry components
  • Adjust container limits
  • Scale horizontally if needed

Next Steps

  • Configure alert rules in Prometheus
  • Create custom Grafana dashboards for your workflows
  • Export metrics to external monitoring systems
  • Set up long-term metric retention
  • Integrate infrastructure component metrics