The Runner: Terraform Multi-Cloud Provisioning

The Runner is the workhorse of PipeOps. When someone clicks “Create Server,” the Runner is what actually makes it happen.

It’s a Go service that provisions infrastructure across multiple cloud providers, manages state, handles failures, and streams logs back in real-time. The same Runner also handles application deployments once infrastructure is provisioned.

The Flow

Controller receives “provision cluster” request
Validates params (region, instance types, node count)
Queues job in RabbitMQ
Runner picks up job from queue
Loads our custom infrastructure modules
Injects customer-specific variables
Executes provisioning workflow
Streams output back to user in real-time
Extracts outputs (cluster endpoints, credentials)
Updates database with results
Triggers k8-agent deployment
Marks job complete

Average time: 10-15 minutes for a new cluster. Faster for updates.

Multi-Cloud Complexity

Supporting AWS, GCP, Azure, DigitalOcean, and Linode meant handling each provider’s quirks:

AWS: VPCs, subnets, security groups, IAM roles. Everything needs tagging. EKS takes 12 minutes to provision.

GCP: Networks, subnetworks, firewall rules, service accounts. GKE is faster (8 minutes) but weird networking defaults.

Azure: Resource groups, virtual networks, NSGs, managed identities. AKS takes forever (15 minutes). Their API is slow.

DigitalOcean: Simple but limited. K8s setup is quick (6 minutes) but features are basic.

Linode: Similar to DO. Fast but fewer regions.

The Runner abstracts all this. User picks “AWS, us-east-1, 3 nodes” and we handle the rest.

State Management

With hundreds of customers and thousands of clusters, infrastructure state management is critical.

For BYOC (Bring Your Own Cloud) deployments, each cluster’s state is stored in your cloud provider account. You bring your AWS/GCP/Azure account, we provision infrastructure in it, and state lives in your bucket.

Each BYOC cluster gets:

Separate state file in your S3/GCS bucket
State locking mechanisms (DynamoDB for AWS, native for GCS)
Automated backups every hour
Versioning enabled (can rollback if needed)

State bucket structure in your account:

s3://customer-state-bucket/
  cluster-production/
    state.tf
    backups/
      state.2024-10-31.tf
  cluster-staging/
    state.tf
    backups/
      state.2024-10-31.tf

For PipeOps-managed infrastructure (Nova multi-tenant clusters), we handle state internally with the same backup and versioning guarantees.

If state gets corrupted (it happens), we restore from backup automatically.

Infrastructure Modules

We don’t provision infrastructure from scratch for each cluster. The Runner uses a modular approach with well-tested components:

Kubernetes Module: Core cluster provisioning (EKS, GKE, AKS) Network Module: VPC, subnets, routing, security groups Essentials Module: Ingress controllers, cert-manager, monitoring stack Orchestration Layer: Combines modules into complete infrastructure stacks

These modules are versioned and locked to prevent breaking changes. Updates are tested in staging environments before gradual rollout to production clusters.

Error Handling

Infrastructure provisioning fails. A lot. Our error handling:

Cloud API failures: Retry with exponential backoff (up to 5 times)
Quota limits: Surface clear error with fix
Invalid config: Validate before execution
Partial failures: Mark what succeeded, offer cleanup or continue
Process crashes: Capture logs, save state, alert ops team

When provisioning fails, users get:

What went wrong
At what step
How to fix it
Option to retry or rollback

No “Something went wrong.” Actual information.

Concurrent Execution

The Runner handles hundreds of concurrent jobs:

Each job gets isolated workspace
Separate working directories
No shared state between jobs
Resource limits per job (CPU, memory)
Job timeouts (30 minutes max)

We can run 50+ Terraform jobs simultaneously. The bottleneck is cloud provider APIs, not our infrastructure.

Streaming Logs

Users watch provisioning output in real-time in the dashboard. It’s WebSocket-based:

Runner → RabbitMQ → Controller → WebSocket → Browser

The provisioning process outputs to stdout/stderr, we capture it, send to queue, controller forwards to connected clients. They see everything as it happens.

Cleanup on Failure

If provisioning fails halfway, we clean up what was created. Otherwise users end up with orphaned resources costing money.

The cleanup process:

Mark resources that were successfully created
Run cleanup procedures on partial state
If cleanup fails, retry up to 3 times
If still failing, alert ops team
Manually investigate and resolve

We log everything for post-mortem analysis.

Variable Injection

Customers don’t write infrastructure code. They fill out a form, we generate the required configuration:

cluster_name: "customer-production"
region: "us-east-1"
node_count: 3
instance_type: "t3.medium"
enable_monitoring: true
backup_schedule: "daily"

We validate types, ranges, and dependencies before execution. Invalid config never reaches the provisioning stage.

Cost Estimation

Before provisioning, we estimate monthly costs using cloud provider pricing APIs and our infrastructure module calculations.

“This cluster will cost approximately $450/month.”

Not perfect (data transfer and storage costs are hard to predict), but close enough for budget planning.

What’s Next

Working on:

Drift detection (alert when manual changes diverge from Terraform)
Cost optimization recommendations
Faster provisioning (parallel resource creation where safe)
Better rollback mechanisms
Support for more cloud providers

The Runner evolved from “provision clusters” to handling all infrastructure operations. Updates, scaling, configuration changes, teardowns - all go through the Runner now.

These infrastructure patterns represent two years of production learnings, edge cases, and optimizations across multiple cloud providers.

Alex Idowu - Cloud Infrastructure & DevOps Engineer

Technical blog about Kubernetes, Terraform, BuildKit, and cloud infrastructure automation. Building PipeOps - production-grade platform engineering.

The Runner: Terraform Multi-Cloud Provisioning

The Flow

Multi-Cloud Complexity

State Management

Infrastructure Modules

Error Handling

Concurrent Execution

Streaming Logs

Cleanup on Failure

Variable Injection

Cost Estimation

What’s Next

Comments

The Runner: Terraform Multi-Cloud Provisioning

The Flow

Multi-Cloud Complexity

State Management

Infrastructure Modules

Error Handling

Concurrent Execution

Streaming Logs

Cleanup on Failure

Variable Injection

Cost Estimation

What’s Next

Share this post

📬 Want more insights like this?

Comments