Discover How the 'Monster Pod' Revolutionized Our Approach to Scaling Machine Learning Models.

Scaling a complex system of machine learning models while delivering real-time insights is no small feat. ZestyAI’s engineering team reimagined its architecture to overcome these challenges, leveraging NVIDIA’s Triton Inference Server and introducing the “Monster Pod.” This transformation halved API response times, increased throughput by 10x, and cut cloud costs by 75%. Dive into how strategic experimentation and innovative design unlocked efficiency and positioned ZestyAI for future growth.
By Andrew Merski, VP, Engineering
At ZestyAI, we deliver critical insights to insurance clients using machine learning models. Our API processes a significant volume of data, including imagery, geolocation, and structured data, to produce real-time results. The complexity of each request places immense demands on our infrastructure:
In our previous system, each ML model operated as an independent microservice. Each model scaled independently, and each instance required its own GPU. While functional, this architecture introduced critical issues:
This architecture also resulted in significant operational complexity. Each model’s independent deployment meant substantial manual effort in testing, scaling, and troubleshooting. Cloud costs also escalated rapidly as new models were added, creating diminishing returns for each improvement in service quality.
Faced with scaling challenges and rising customer demand, we reimagined the entire architecture. At the heart of the solution was NVIDIA’s Triton Inference Server, a tool designed for efficient multi-model serving.
Triton enabled:
However, Triton required significant investment in layers of customization to meet our needs. Its low-level interface and lack of native autoscaling demanded a tailored implementation.
To maximize Triton’s potential, we introduced the “Monster Pod,” consolidating all models and supporting microservices into a single Kubernetes pod. Key features included:
This project revealed critical insights that extend beyond Triton or even ML systems:
1. The "Microservices vs. Monolith" Debate Isn’t Binary
Architectural decisions don’t have to be all-or-nothing. For instance, while our deployment consolidated models into a single pod, we retained microservices for other aspects of the platform. Evaluating “single vs. many” decisions at multiple levels allowed us to optimize each layer independently.
2. Understand the Bottlenecks Before Designing Solutions
Identifying the root causes of inefficiency—scaling overhead, resource underutilization, network traffic—helped us design a system that addressed these challenges holistically rather than incrementally.
3. The Power of Consolidation
Integrating multiple components into a single deployment reduced complexity, improved performance, and simplified scaling. This approach may not suit every scenario, but in our case, it delivered transformative results.
4. Be Open to Temporary Solutions (Flexibility Leads to Innovation)
The “Monster Pod” started as a quick workaround but became a permanent fixture due to its outsized impact. Being open to experimentation unlocked unexpected benefits, such as easier resource planning and reduced operational complexity.
Rebuilding our ML inference platform was a bold move that paid off. The new architecture produced dramatic improvements across key metrics:
These gains position us to scale with growing demand while maintaining industry-leading performance. Additionally, the simplified architecture has freed up engineering resources to focus on innovation rather than maintenance.
While Triton Inference Server played a critical role, the real success lay in our architectural decisions and willingness to rethink the status quo. This project underscores the value of experimentation and the importance of tailoring solutions to meet unique challenges.
The lessons learned from this journey will continue to inform our approach to system design and scalability as we look ahead. The Monster Pod has not only transformed our current capabilities—but has also set the stage for future growth and innovation.
For a deeper dive into the technical details, check out Andrew Merski’s original blog on Medium.