Google Cloud Platform Deployment

Description

Production GKE deployment with VertexAI Gemini integration and comprehensive GCP services

Difficulty

Intermediate

Duration

2 - 4 hr

You will need
  • GCP account with billing enabled (see below for setup)
  • gcloud CLI installed and configured
  • Pulumi installed locally
  • kubectl command-line tool
  • Python 3.11+ for CLI tools
  • Basic command-line and Kubernetes familiarity
Goal

Deploy a production-ready TrustGraph environment on Google Kubernetes Engine with VertexAI Gemini integration using Infrastructure as Code.

Overview

This guide walks you through deploying TrustGraph on Google Cloud Platform’s Kubernetes Engine (GKE) using Pulumi (Infrastructure as Code). The deployment automatically provisions a production-ready Kubernetes cluster integrated with Google’s VertexAI services.

Pulumi is an open-source Infrastructure as Code tool that uses general-purpose programming languages (TypeScript/JavaScript in this case) to define cloud infrastructure. Unlike manual deployments, Pulumi provides:

  • Reproducible, version-controlled infrastructure
  • Testable and retryable deployments
  • Automatic resource dependency management
  • Simple rollback capabilities

Once deployed, you’ll have a complete TrustGraph stack running on GCP infrastructure with:

  • Google Kubernetes Engine (GKE) cluster (2-node pool, configurable)
  • VertexAI Gemini Flash 1.5 integration
  • Complete monitoring with Grafana and Prometheus
  • Web workbench for document processing and Graph RAG
  • Secure secrets management

Why Google Cloud Platform for TrustGraph?

GCP offers unique advantages for AI-focused organizations:

  • VertexAI Integration: Native access to Google’s Gemini models for state-of-the-art LLM capabilities
  • ML/AI Optimization: Purpose-built infrastructure for machine learning workloads
  • Global Infrastructure: Deploy across 40+ regions worldwide with Google’s network
  • Sustainability: Carbon-neutral operations with renewable energy commitment
  • Free Tier & Credits: $300 in free credits for new users to get started

Ideal for organizations requiring cutting-edge AI capabilities and ML/AI-optimized infrastructure.

Getting ready

GCP Account

You’ll need a GCP account with billing enabled. If you don’t have one:

  1. Sign up at https://cloud.google.com/
  2. Complete account verification
  3. Enable billing for your project
  4. New users receive $300 in free credits

Create a GCP Project

Create a dedicated project for TrustGraph:

  1. Navigate to the GCP Console
  2. Click on the project dropdown at the top
  3. Click New Project
  4. Enter a project name (e.g., trustgraph-prod)
  5. Note the Project ID - you’ll need this later

Enable Required APIs

Enable the necessary GCP APIs for your project:

gcloud services enable container.googleapis.com
gcloud services enable compute.googleapis.com
gcloud services enable aiplatform.googleapis.com
gcloud services enable iam.googleapis.com

Install gcloud CLI

Install the Google Cloud CLI:

Linux
curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init
MacOS
brew install --cask google-cloud-sdk
gcloud init
Windows

Download the installer from cloud.google.com/sdk/docs/install

Verify installation:

gcloud version

Configure gcloud Authentication

Authenticate with your GCP account:

gcloud auth login
gcloud auth application-default login

Set your default project:

gcloud config set project YOUR_PROJECT_ID

Python

You need Python 3.11 or later installed for the TrustGraph CLI tools.

Check your Python version
python3 --version

If you need to install or upgrade Python, visit python.org.

Pulumi

Install Pulumi on your local machine:

Linux
curl -fsSL https://get.pulumi.com | sh
MacOS
brew install pulumi/tap/pulumi
Windows

Download the installer from pulumi.com.

Verify installation:

pulumi version

Full installation details are at pulumi.com.

kubectl

Install kubectl to manage your Kubernetes cluster:

Verify installation:

kubectl version --client

Node.js

The Pulumi deployment code uses TypeScript/JavaScript, so you’ll need Node.js installed:

  • Download: nodejs.org (LTS version recommended)
  • Linux: sudo apt install nodejs npm (Ubuntu/Debian) or sudo dnf install nodejs (Fedora)
  • MacOS: brew install node

Verify installation:

node --version
npm --version

VertexAI Access

The deployment uses Google’s VertexAI with Gemini Flash 1.5 as the default model. VertexAI is automatically available in GCP projects with billing enabled.

Available Gemini models include:

  • gemini-1.5-flash (fast, cost-effective)
  • gemini-1.5-pro (advanced reasoning)
  • gemini-1.0-pro (production-ready)

Prepare the deployment

Get the Pulumi code

Clone the TrustGraph GCP Pulumi repository:

git clone https://github.com/trustgraph-ai/pulumi-trustgraph-gke.git
cd pulumi-trustgraph-gke/pulumi

Install dependencies

Install the Node.js dependencies for the Pulumi project:

npm install

Configure GCP Project

Set your GCP project ID for Pulumi:

pulumi config set gcp:project YOUR_PROJECT_ID

Configure Pulumi state

You need to tell Pulumi which state to use. You can store this in an S3 bucket, but for experimentation, you can just use local state:

pulumi login --local

When storing secrets in the Pulumi state, pulumi uses a secret passphrase to encrypt secrets. When using Pulumi in a production or shared environment you would have to evaluate the security arrangements around secrets.

We’re just going to set this to the empty string, assuming that no encryption is fine for a development deploy.

export PULUMI_CONFIG_PASSPHRASE=

Create a Pulumi stack

Initialize a new Pulumi stack for your deployment:

pulumi stack init dev

You can use any name instead of dev - this helps you manage multiple deployments (dev, staging, prod, etc.).

Configure the stack

Apply settings for region, zone, and cluster configuration:

pulumi config set gcp:region us-central1
pulumi config set gcp:zone us-central1-a
pulumi config set clusterName trustgraph-gke
pulumi config set nodeCount 2

Available regions include:

  • us-central1 (Iowa, USA)
  • us-east1 (South Carolina, USA)
  • europe-west1 (Belgium)
  • europe-west4 (Netherlands)
  • asia-southeast1 (Singapore)
  • australia-southeast1 (Sydney)

Refer to GCP Regions for a complete list.

Configure VertexAI

Set the VertexAI model and location:

pulumi config set vertexaiModel gemini-1.5-flash
pulumi config set vertexaiLocation us-central1

Refer to the repository’s README for additional configuration options.

Deploy with Pulumi

Preview the deployment

Before deploying, preview what Pulumi will create:

pulumi preview

This shows all the resources that will be created:

  • GKE Kubernetes cluster
  • Node pool with specified machine types
  • VPC network and subnets
  • Service accounts with VertexAI permissions
  • IAM roles and bindings
  • Kubernetes secrets for GCP credentials
  • TrustGraph deployments, services, and config maps

Review the output to ensure everything looks correct.

Deploy the infrastructure

Deploy the complete TrustGraph stack:

pulumi up

Pulumi will ask for confirmation before proceeding. Type yes to continue.

The deployment typically takes 10 - 15 minutes and progresses through these stages:

  1. Creating GKE cluster (6-8 minutes)
    • Provisions GKE cluster
    • Creates node pool
    • Configures VPC networking
  2. Configuring service accounts (1-2 minutes)
    • Creates service account
    • Sets up VertexAI permissions
    • Creates Kubernetes secrets
  3. Deploying TrustGraph (4-6 minutes)
    • Applies Kubernetes manifests
    • Deploys all TrustGraph services
    • Starts pods and initializes services

You’ll see output showing the creation progress of all resources.

Configure and verify kubectl access

After deployment completes, configure kubectl to access your GKE cluster:

gcloud container clusters get-credentials trustgraph-gke --zone us-central1-a

Verify access:

kubectl get nodes

You should see your GKE nodes listed as Ready.

Check pod status

Verify that all pods are running:

kubectl -n trustgraph get pods

You should see output similar to this (pod names will have different random suffixes):

NAME                                        READY   STATUS      RESTARTS   AGE
agent-manager-74fbb8b64-nzlwb               1/1     Running     0          5m
api-gateway-b6848c6bb-nqtdm                 1/1     Running     0          5m
cassandra-6765fff974-pbh65                  1/1     Running     0          5m
pulsar-d85499879-x92qv                      1/1     Running     0          5m
text-completion-58ccf95586-6gkff            1/1     Running     0          5m
workbench-ui-5fc6d59899-8rczf               1/1     Running     0          5m
...

All pods should show Running status. Some init pods (names ending in -init) may fail or be shown Completed status - this is normal, their job is to initialise cluster resources and then exit.

Access services via port-forwarding

Since the Kubernetes cluster is running on Scaleway, you’ll need to set up port-forwarding to access TrustGraph services from your local machine.

Open three separate terminal windows and run these commands (keep them running):

Terminal 1 - API Gateway:

export KUBECONFIG=$(pwd)/kubeconfig.yaml
kubectl -n trustgraph port-forward svc/api-gateway 8088:8088

Terminal 2 - Workbench UI:

export KUBECONFIG=$(pwd)/kubeconfig.yaml
kubectl -n trustgraph port-forward svc/workbench-ui 8888:8888

Terminal 3 - Grafana:

export KUBECONFIG=$(pwd)/kubeconfig.yaml
kubectl -n trustgraph port-forward svc/grafana 3000:3000

With these port-forwards running, you can access:

Keep these terminal windows open while you’re working with TrustGraph. If you close them, you’ll lose access to the services.

Install CLI tools

Now install the TrustGraph command-line tools. These tools help you interact with TrustGraph, load documents, and verify the system.

Create a Python virtual environment and install the CLI:

python3 -m venv env
source env/bin/activate  # On Windows: env\Scripts\activate
pip install trustgraph-cli

Startup period

It can take 2-3 minutes for all services to stabilize after deployment. Services like Pulsar and Cassandra need time to initialize properly.

Verify system health

tg-verify-system-status

If everything is working, the output looks something like this:

============================================================
TrustGraph System Status Verification
============================================================

Phase 1: Infrastructure
------------------------------------------------------------
[00:00] ⏳ Checking Pulsar...
[00:03] ⏳ Checking Pulsar... (attempt 2)
[00:03] ✓ Pulsar: Pulsar healthy (0 cluster(s))
[00:03] ⏳ Checking API Gateway...
[00:03] ✓ API Gateway: API Gateway is responding

Phase 2: Core Services
------------------------------------------------------------
[00:03] ⏳ Checking Processors...
[00:03] ✓ Processors: Found 34 processors (≥ 15)
[00:03] ⏳ Checking Flow Classes...
[00:06] ⏳ Checking Flow Classes... (attempt 2)
[00:09] ⏳ Checking Flow Classes... (attempt 3)
[00:22] ⏳ Checking Flow Classes... (attempt 4)
[00:35] ⏳ Checking Flow Classes... (attempt 5)
[00:38] ⏳ Checking Flow Classes... (attempt 6)
[00:38] ✓ Flow Classes: Found 9 flow class(es)
[00:38] ⏳ Checking Flows...
[00:38] ✓ Flows: Flow manager responding (1 flow(s))
[00:38] ⏳ Checking Prompts...
[00:38] ✓ Prompts: Found 16 prompt(s)

Phase 3: Data Services
------------------------------------------------------------
[00:38] ⏳ Checking Library...
[00:38] ✓ Library: Library responding (0 document(s))

Phase 4: User Interface
------------------------------------------------------------
[00:38] ⏳ Checking Workbench UI...
[00:38] ✓ Workbench UI: Workbench UI is responding

============================================================
Summary
============================================================
Checks passed: 8/8
Checks failed: 0/8
Total time: 00:38

✓ System is healthy!

The Checks failed line is the most interesting and is hopefully zero. If you are having issues, look at the troubleshooting section later.

If everything appears to be working, the following parts of the deployment guide are a whistle-stop tour through various parts of the system.

Test LLM access

Test that VertexAI Gemini integration is working by invoking the LLM through the gateway:

tg-invoke-llm 'Be helpful' 'What is 2 + 2?'

You should see output like:

2 + 2 = 4

This confirms that TrustGraph can successfully communicate with Google’s VertexAI service.

Load sample documents

Load a small set of sample documents into the library for testing:

tg-load-sample-documents

This downloads documents from the internet and caches them locally. The download can take a little time to run.

Workbench

TrustGraph includes a web interface for document processing and Graph RAG.

Access the TrustGraph workbench at http://localhost:8888 (requires port-forwarding to be running).

By default, there are no credentials.

You should be able to navigate to the Flows tab and see a single default flow running. The guide will return to the workbench to load a document.

Monitoring dashboard

Access Grafana monitoring at http://localhost:3000 (requires port-forwarding to be running).

Default credentials:

  • Username: admin
  • Password: admin

All TrustGraph components collect metrics using Prometheus and make these available using this Grafana workbench. The Grafana deployment is configured with 2 dashboards:

  • Overview metrics dashboard: Shows processing metrics
  • Logs dashboard: Shows collated TrustGraph container logs

For a newly launched system, the metrics won’t be particularly interesting yet.

Check the LLM is working

Back in the workbench, select the Assistant tab.

In the top line next to the Assistant word, change the mode to Basic LLM.

Enter a question in the prompt box at the bottom of the tab and press Send. If everything works, after a short period you should see a response to your query.

Simple LLM usage

If LLM interactions are not working, check the Grafana logs dashboard for errors in the text-completion service.

Working with a document

Load a document

Back in the workbench:

  1. Navigate to the Library page
  2. In the upper right-hand corner, there is a dark/light mode widget. To its left is a selector widget. Ensure the top and bottom lines say “default”. If not, click on the widget and change.
  3. On the library tab, select a document (e.g., “Beyond State Vigilance”)
  4. Click Submit on the action bar
  5. Choose a processing flow (use Default processing flow)
  6. Click Submit to process

Beyond State Vigilance is a relatively short document, so it’s a good one to start with.

Select the Vector Search tab. Enter a string (e.g., “document”) in the search bar and hit RETURN. The search term doesn’t matter a great deal. If information has started to load, you should see some search results.

The vector search attempts to find up to 10 terms which are the closest matches for your search term. It does this even if the search terms are not a strong match, so this is a simple way to observe whether data has loaded.

Vector search results

Look at knowledge graph

Click on one of the Vector Search result terms on the left-hand side. This shows relationships in the graph from the knowledge graph linking to that term.

Relationships view

You can then click on the Graph view button to go to a 3D view of the discovered relationships.

Query with Graph RAG

  1. Navigate to Assistant tab
  2. Change the Assistant mode to GraphRAG
  3. Enter your question (e.g., “What is this document about?”)
  4. You will see the answer to your question after a short period

Troubleshooting

Deployment Issues

Pulumi deployment fails

Diagnosis:

Check the Pulumi error output for specific failure messages. Common issues include:

# View detailed error information
pulumi stack --show-urns
pulumi logs

Resolution:

  • Authentication errors: Verify gcloud auth application-default login was run and your project ID is correct
  • API not enabled: Ensure all required GCP APIs are enabled (see “Enable Required APIs” section)
  • Quota limits: Check your GCP project hasn’t hit resource quotas (GKE clusters, CPUs, IP addresses)
  • Billing not enabled: Verify billing is enabled for your GCP project
  • Permissions: Ensure your account has Owner or Editor role on the project
Pods stuck in Pending state

Diagnosis:

kubectl -n trustgraph get pods | grep Pending
kubectl -n trustgraph describe pod <pod-name>

Look for scheduling failures or resource constraints in the describe output.

Resolution:

  • Insufficient resources: Increase node count or machine type in your Pulumi configuration
  • PersistentVolume issues: Check PV/PVC status with kubectl -n trustgraph get pv,pvc
  • Node issues: Check node status with kubectl get nodes
  • Resource quotas: Verify your GCP project hasn’t hit CPU or memory quotas
VertexAI integration not working

Diagnosis:

Test LLM connectivity:

tg-invoke-llm '' 'What is 2+2'

A timeout or error indicates VertexAI configuration issues. Check the text-completion pod logs:

kubectl -n trustgraph logs -l app=text-completion

Resolution:

  • Verify VertexAI API is enabled: gcloud services list --enabled | grep aiplatform
  • Check service account has VertexAI permissions: gcloud projects get-iam-policy YOUR_PROJECT_ID
  • Ensure the Gemini model is available in your selected region
  • Review Pulumi outputs to confirm VertexAI configuration: pulumi stack output
  • Verify billing is enabled (VertexAI requires active billing)
Port-forwarding connection issues

Diagnosis:

Port-forward commands fail or connections time out.

Resolution:

  • Verify kubectl is configured: kubectl config current-context
  • Check that the target service exists: kubectl -n trustgraph get svc
  • Ensure no other process is using the port (e.g., port 8088, 8888, or 3000)
  • Try restarting the port-forward with verbose logging: kubectl port-forward -v=6 ...
  • Check GKE cluster connectivity: gcloud container clusters describe trustgraph-gke --zone us-central1-a

Service Failure

Pods in CrashLoopBackOff

Diagnosis:

# Find crashing pods
kubectl -n trustgraph get pods | grep CrashLoopBackOff

# View logs from crashed container
kubectl -n trustgraph logs <pod-name> --previous

Resolution:

Check the logs to identify why the container is crashing. Common causes:

  • Application errors (configuration issues)
  • Missing dependencies (ensure all required services are running)
  • Incorrect secrets or environment variables
  • Resource limits too low
  • GCP credentials not properly configured
Service not responding

Diagnosis:

Check service and pod status:

kubectl -n trustgraph get svc
kubectl -n trustgraph get pods
kubectl -n trustgraph logs <pod-name>

Resolution:

  • Verify the pod is running and ready
  • Check pod logs for errors
  • Ensure port-forwarding is active for the service
  • Use tg-verify-system-status to check overall system health
  • Check GKE cluster health: gcloud container clusters describe trustgraph-gke --zone us-central1-a

GCP-Specific Issues

GKE cluster creation fails

Diagnosis:

Check GCP quota and permissions:

gcloud compute project-info describe --project=YOUR_PROJECT_ID

Resolution:

  • Request quota increases if needed via GCP Console
  • Verify your account has roles/container.admin permission
  • Check if the zone has available capacity
  • Try a different zone or region
VertexAI quota exceeded

Diagnosis:

Error messages about VertexAI quota or rate limits.

Resolution:

  • Check VertexAI quotas in GCP Console under “IAM & Admin” → “Quotas”
  • Request quota increases if needed
  • Switch to a different Gemini model with higher quotas
  • Implement rate limiting in your application

Shutting down

Clean shutdown

When you’re finished with your TrustGraph deployment, clean up all resources:

pulumi destroy

Pulumi will show you all the resources that will be deleted and ask for confirmation. Type yes to proceed.

The destruction process typically takes 8-12 minutes and removes:

  • All TrustGraph Kubernetes resources
  • The GKE cluster
  • Node pools
  • Service accounts and IAM bindings
  • VPC network resources (if created)
  • All associated storage

Cost Warning: GCP charges for running GKE clusters and node instances. Make sure to destroy your deployment when you’re not using it to avoid unnecessary costs. GKE charges include cluster management fees plus compute costs.

Verify cleanup

After pulumi destroy completes, verify all resources are removed:

# Check Pulumi stack status
pulumi stack

# Verify no resources remain
pulumi stack --show-urns

# Check GCP for remaining resources
gcloud container clusters list
gcloud compute instances list

Delete the Pulumi stack

If you’re completely done with this deployment, you can remove the Pulumi stack:

pulumi stack rm dev

This removes the stack’s state but doesn’t affect any cloud resources (use pulumi destroy first).

Cost Optimization

Monitor Costs

Keep track of your GCP spending:

  1. Navigate to Billing in GCP Console
  2. View cost breakdown by service
  3. Set up budget alerts

Cost-Saving Tips

  • Preemptible Nodes: Use preemptible VMs for non-production workloads (60-90% cheaper)
  • Autoscaling: Configure cluster autoscaling to scale down during idle periods
  • Resource Requests: Set appropriate CPU/memory requests to avoid over-provisioning
  • Committed Use Discounts: For long-term deployments, purchase committed use contracts
  • Regional vs Zonal: Use zonal clusters instead of regional for lower costs (less HA)

Example cost estimates (us-central1):

  • Cluster management fee: $0.10/hour (~$73/month)
  • 2 x n1-standard-2 nodes: ~$100/month
  • VertexAI API calls: Pay per use (varies by model and usage)
  • Total estimated: ~$180-250/month for basic deployment

Next Steps

Now that you have TrustGraph running on GCP:

  • Guides: See Guides for things you can do with your running TrustGraph
  • Scale the cluster: Configure GKE autoscaling or increase node pool size
  • Production hardening: Set up Cloud Armor, Cloud NAT, and private GKE cluster
  • Integrate GCP services: Connect to Cloud Storage, BigQuery, or Cloud SQL
  • CI/CD: Set up Cloud Build for automated deployments
  • Monitoring: Integrate with Cloud Monitoring and Cloud Logging
  • Multi-region: Deploy across multiple GCP regions for high availability
  • Advanced VertexAI: Explore other Gemini models or fine-tuning options

Additional Resources