Skip to main content
CI/CD / INFRASTRUCTURE

Self-Hosted Runners & Hybrid CI for Heavy Workloads

Auto-scaling runners for CPU/GPU workloads with spot instances and on-prem integration

70%

CI Cost Reduction

4x

Faster Builds

99.9%

Runner Availability

Quick Facts

Industry: AI/ML Platform

Scale: 500+ daily builds

Timeline: 8 weeks to production

Stack: GitHub Actions, Kubernetes, AWS EC2 Spot

Infra: Hybrid on-prem + cloud runners

The Challenge

An AI/ML platform running 500+ daily builds — including heavy GPU model-training jobs and CPU-intensive integration tests — was spending over $18K/month on managed CI runners. Builds queued for 20+ minutes during peak hours and GPU jobs were bottlenecked by limited shared runner availability.

On-prem hardware with proprietary FPGA accelerators couldn’t be accessed from cloud runners, forcing engineers to run hardware-validation tests manually. The team needed a unified solution spanning cloud and on-prem with cost-efficient scaling.

Pain Points

$18K/month on managed CI runners with limited control

20+ minute queue times during peak build hours

GPU builds bottlenecked by shared runner scarcity

No access to on-prem FPGA hardware from cloud CI

Build images missing proprietary toolchains — 40% cache miss rate

Zero visibility into per-team and per-project CI costs

Our Solution

🚀

Auto-Scaling Runner Fleet

Kubernetes-based runner controller that scales from 0 to 100+ runners based on job queue depth. New runners provision in under 60 seconds with pre-baked images, and terminate automatically when idle — eliminating both queue wait times and wasted capacity.

💰

Spot Instance Orchestration

Runs 80% of builds on AWS EC2 spot instances with multi-AZ capacity pools and automatic fallback to on-demand. Intelligent instance-type diversification across c6i, m6i, and r6i families ensures 99.9% spot fulfillment rate and 70% cost savings over on-demand pricing.

🎮

GPU-Optimized Build Pipeline

Dedicated GPU runner pools with pre-warmed CUDA toolchains and cached ML framework layers. Parallel test sharding across multiple g5 instances reduces model-validation jobs from 45 minutes to under 12 minutes. Smart scheduling routes GPU jobs exclusively to GPU runners.

🔧

On-Prem / Cloud Hybrid

Unified job routing across cloud and on-premises runners via a single control plane. On-prem runners access proprietary FPGA hardware for validation tests while cloud runners handle standard builds — all orchestrated through the same GitHub Actions workflow definitions.

Results

70%

CI Cost Reduction

$18K → $5.4K/month

4x

Faster Builds

32 min → 8 min average

99.9%

Runner Availability

Zero queued-timeout failures

0 min

Queue Wait (p95)

Down from 20+ minutes

Frequently Asked Questions

When should you use self-hosted runners instead of cloud-hosted CI?

Self-hosted runners are ideal when builds require specialized hardware (GPUs, FPGAs), access to on-prem resources, or when managed runner costs exceed $5K–10K/month. They also benefit teams needing custom toolchains, longer execution times, or compliance-driven isolation.

How do spot instances reduce CI/CD infrastructure costs?

AWS spot instances offer up to 90% savings over on-demand pricing. CI workloads are ephemeral and fault-tolerant, making them a perfect fit. With automatic fallback to on-demand and queue-based retry logic, builds continue uninterrupted even during spot interruptions.

How do you optimize GPU builds in a CI pipeline?

GPU optimization involves dedicated runner pools with pre-warmed CUDA toolchains, layer caching for ML framework images, parallel test sharding across GPU instances, and smart scheduling that routes GPU jobs exclusively to GPU runners — avoiding expensive idle time.

What is the cost comparison between managed and self-hosted CI runners?

Managed runners charge $0.008–$0.016/minute with limited customization. Self-hosted on spot instances cost $0.001–$0.004/minute with full hardware control. At 500+ daily builds, self-hosted typically saves 60–80% while delivering 2–4x faster execution through optimized images and local caching.

Related Resources

Case Study
Unified CI/CD Platform Migration

Consolidating Jenkins and legacy CI tools into a single platform with reusable workflows.

Read Case Study →
Article
CI/CD Pipeline Automation Guide

Complete guide to building automated, cost-efficient CI/CD pipelines at scale.

Read Article →
Service
Cloud & DevOps Services

Kubernetes, CI/CD, and infrastructure engineering expertise.

Learn More →

Ready to Optimize Your CI/CD Infrastructure?

Get a free CI cost assessment and a roadmap to faster, cheaper builds with self-hosted runners.

Get Free AssessmentExplore Cloud Services
EmailIcon

Subscribe to our newsletter

Get monthly email updates about improvements.