General Tech

DevOps for AI: Running LLMs in Production with Kubernetes and Kubeflow

9. Mar 2026

Large Language Models (LLMs) are rapidly becoming part of modern software systems. From chatbots and copilots to retrieval systems and AI agents, organizations are integrating generative AI into real production environments. But while building AI prototypes has become easier than ever, operating LLMs reliably in production remains a serious challenge.

At Kubernetes Community Days New York, Aarno Aukia shared practical insights into what it takes to run LLMs using proven DevOps practices. His talk highlighted an important reality: AI systems still need strong DevOps foundations – perhaps even more than traditional software systems.

Aarno Aukia’s talk at KCD New York

DevOps Meets AI

DevOps has always been about bridging the gap between development and operations. Developers focus on building application logic and managing data, while operations teams ensure that software runs reliably in production. Over the past decade, DevOps practices have matured around automation, observability, and continuous delivery.

In many organizations today, software follows a well-established pipeline: developers commit code to Git, automated CI/CD pipelines build and package the application, and Kubernetes deploys and runs it in production. Monitoring and logging systems then provide visibility into how the application behaves, allowing developers to continuously improve it.

This feedback loop has become the backbone of modern cloud-native development.

When AI enters the picture, however, the model changes in several important ways.

AI Systems Behave Differently

One of the biggest differences between traditional applications and AI-driven systems is determinism. Traditional software behaves predictably: given the same input, it produces the same output every time. LLMs behave very differently.

Large language models are probabilistic systems. They generate responses by predicting the next token based on context, effectively making statistical decisions about what comes next. This means that even small changes in prompts or user input can produce very different results.

A seemingly harmless modification to a system prompt can completely change the behavior of a model. In one example, simply adding a seasonal theme to a chatbot prompt caused the model to fail at answering basic questions.

For operations teams, this creates a new category of complexity. Instead of debugging deterministic systems, they now have to manage systems whose outputs can change subtly depending on context.

Testing therefore becomes significantly more complicated.

The Challenge of Testing AI

Traditional software testing is relatively straightforward. A test provides an input and verifies that the output exactly matches an expected value.

AI systems do not fit into this model. When an LLM generates an answer, the output might be correct even if the exact wording differs from what was expected. At the same time, the response could contain subtle factual errors or hallucinations.

Determining whether an answer is acceptable often requires semantic evaluation rather than strict comparisons. In some cases, organizations even use another LLM to evaluate the output of the first one. This introduces an entirely new testing paradigm that many teams are still learning how to manage.

More Artifacts to Manage

AI systems also introduce additional artifacts that must be tracked and versioned.

In traditional DevOps pipelines, the primary artifacts are source code and container images. With AI workloads, teams must also manage datasets, training artifacts, prompts, and model files. These models are often very large, sometimes tens of gigabytes in size, and must be stored and versioned carefully.

Without proper versioning, it becomes extremely difficult to debug issues or reproduce results later. If a model behaves unexpectedly, teams need to know exactly which model version, dataset, and configuration were used during deployment.

This dramatically increases the operational complexity of AI systems.

Observability Becomes Critical

Because LLMs are non-deterministic, observability becomes even more important than in traditional systems.

Logging must capture far more context than before. Instead of logging only application events, teams may need to record the full prompt, the model response, the model version, and the surrounding configuration. This allows operators to reconstruct what happened when something goes wrong.

Without detailed observability, debugging AI systems can quickly become impossible.

Open Models vs Hosted APIs

Another important operational consideration is the choice between closed and open models.

Hosted AI APIs offer convenience and powerful capabilities, but they also come with trade-offs. In many cases, organizations cannot control exactly when model updates happen or which minor version is running at a given time. This can make debugging and reproducibility difficult.

Open-weight and open-source models provide more operational control. They can be downloaded, versioned, tested locally, and deployed on internal infrastructure. This allows organizations to decide when and how updates are rolled out.

For many regulated industries such as finance, healthcare, or government, this level of control is essential.

Kubernetes as the Foundation

This is where Kubernetes becomes an important part of the AI infrastructure stack.

Kubernetes already solves many of the operational challenges associated with running distributed systems. It provides mechanisms for container orchestration, resource management, autoscaling, and fault tolerance. Importantly for AI workloads, it can also manage GPU resources.

However, operating Kubernetes itself is not trivial – as discussed in our article Best Kubernetes Distributions in 2026 – And Why You Might Not Want to Run Them Yourself, running clusters in production requires significant operational expertise.

Kubeflow and the Machine Learning Lifecycle

Kubeflow extends Kubernetes with specialized components for machine learning workflows. It helps manage the entire lifecycle of AI models, from training to production inference.

Kubeflow Pipelines allow teams to automate workflows for model development and training. These pipelines can orchestrate complex processes such as data preprocessing, training runs, evaluation steps, and packaging models for deployment.

For many organizations using LLMs, however, the main focus is not training models but serving them reliably in production.

This is where KServe plays a key role.

Serving LLMs with KServe

KServe is a Kubernetes-native model serving framework that simplifies deploying and operating AI models. It allows teams to run inference services on top of Kubernetes using standard APIs.

A typical deployment consists of a container running a model server, often based on runtimes such as vLLM. The container loads the model, uses GPU resources for inference, and exposes an API endpoint for applications.

KServe integrates with Kubernetes autoscaling mechanisms and observability tools, making it possible to scale AI workloads dynamically and monitor their behavior in production.

Because everything runs as Kubernetes resources, teams can apply the same DevOps practices they already use for other applications.

A Rapidly Evolving Ecosystem

The ecosystem around AI infrastructure is evolving extremely quickly. New projects are emerging to address the unique challenges of running LLMs at scale.

One example is LLMD, a Kubernetes operator specifically designed for LLM inference. It builds on existing technologies such as vLLM but adds specialized capabilities like request routing, model selection, caching, and intelligent scaling.

These kinds of tools illustrate how the cloud-native ecosystem is adapting to the operational needs of AI workloads.

AI Still Needs DevOps

Despite the hype surrounding generative AI, one lesson is clear: AI systems still require strong operational foundations.

Running LLMs in production involves far more than simply calling an API. It requires careful management of models, infrastructure, observability, and deployment processes.

Kubernetes and Kubeflow provide a powerful platform for addressing these challenges. By applying proven DevOps principles to AI systems, organizations can build platforms that are not only intelligent but also reliable and scalable.

As AI becomes a standard component of modern applications, the ability to operate these systems effectively will become just as important as the models themselves.

This is also where platform approaches come into play. Instead of every team building and operating complex stacks themselves, platforms can provide ready-to-use services on top of Kubernetes. One example is Servala – Sovereign App Store, a Kubernetes-native marketplace that connects organizations with a catalog of managed cloud-native services such as databases, storage, developer tools, and AI-ready infrastructure components.

Markus Speth

Marketing, Communications, People

Contact us

Our team of experts is available for you. In case of emergency also 24/7.

Contact us