Microsoft reveals how it scales Kubernetes for OpenAI

Ever wondered how Microsoft runs OpenAI’s massive AI infrastructure on Kubernetes? Jorge Palma, Principal PM Lead for Azure Kubernetes Service, reveals the secrets behind scaling Kubernetes to 75,000 nodes and beyond, all while maintaining complete commitment to open source principles.

In this exclusive KubeCon interview, Palma pulls back the curtain on one of the world’s most demanding Kubernetes deployments. You’ll discover how Microsoft transforms vanilla Kubernetes into an enterprise-grade platform capable of handling millions of cores, and why they refuse to add any proprietary “secret sauce” despite the competitive pressure.

The conversation reveals a fascinating dichotomy in Microsoft’s approach to Kubernetes. On one hand, AKS Standard offers complete flexibility for teams who want to make every infrastructure decision themselves. On the other, AKS Automatic provides a fully opinionated platform where Microsoft makes all the hard choices about networking, ingress, security, and scaling, allowing developers to focus purely on their applications.

What you’ll learn in this video

Palma shares technical insights that will change how you think about Kubernetes at scale. He explains how Microsoft pushed past the traditional 100-node cluster limits to support tens of thousands of nodes, and why those improvements benefit the entire Kubernetes community, not just Azure customers.

You’ll hear about the specific technical breakthroughs in etcd compaction, API server optimization, and controller tuning that made extreme scaling possible. Palma reveals why Microsoft maintains several maintainers on the etcd project and how they work upstream to ensure Kubernetes portability remains intact.

The discussion takes a surprising turn when Palma challenges the narrative that Kubernetes is inherently complex. He argues that AI has fundamentally changed the game, allowing anyone to generate production-ready manifests and Dockerfiles without deep Kubernetes expertise. Microsoft’s MCP servers encode best practices directly into AI systems, democratizing access to enterprise-grade Kubernetes deployments.

Key topics covered

The architectural differences between AKS Standard and AKS Automatic
How OpenAI scales to 75,000+ nodes on Azure Kubernetes Service
Microsoft’s open source philosophy and why 100% of AKS components are upstream
Breaking through traditional Kubernetes scaling limits through community collaboration
The three categories of AI users and their different infrastructure needs
How AI tools are eliminating Kubernetes complexity for developers
Bridging the gap between business decision makers and Kubernetes operations
Microsoft’s Agentic Operations capabilities with Copilot integration

Palma doesn’t shy away from discussing the challenges hyperscalers face. He candidly addresses the period of manual work and investigation required when pushing Kubernetes to unprecedented scales, and explains why Microsoft always brings those learnings back to the community rather than keeping them proprietary.

The conversation concludes with Microsoft’s vision for making Kubernetes accessible to business users, not just engineers. Palma describes how AI agents integrated with Microsoft Copilot allow non-technical users to query cluster capacity, optimize costs, and even deploy applications—all without understanding Kubernetes internals.

Whether you’re managing a handful of Kubernetes nodes or planning for massive AI workloads, this interview provides invaluable insights into the future of container orchestration at scale. Watch now to discover how Microsoft is shaping the next generation of cloud native infrastructure.