Kubernetes OpenShift Tech

Building a Modern Load Balancer and NAT gateway with Fedora bootable containers

18. Juni 2026

In my team, we have a space called “innovation week” where each engineer gets a week to work on a cool idea twice a year. From our internal documentation:

Innovation week is a space for engineers to work on a cool idea that they’ve had but which may not directly contribute to the team’s products or processes. Engineers are free to work on something that’s related to an existing tool or to invest this time to learn a new technology or tool.

I got my first hands-on experience with RHEL 10 image mode at the Red Hat Summit Zurich in January 2026. Since then, I’ve been kicking around the idea to build a modern highly available (HA) load balancer (LB) and NAT gateway appliance based on RHEL image mode (or rather the technology underpinning RHEL image mode: bootable containers).

This idea is a clear fit for innovation week as it ticks all the boxes: it’s a cool idea; it’s not immediately contributing to our products, but related to something we run in production; and it’s a learning opportunity for new tools. However, ultimately, I decided on this project for my innovation week of H1 2026 because I was eager to get more hands-on experience with bootable containers.

In the rest of this blog post, we’ll take a closer look at the design and the various components (both off-the-shelf and custom) that make up the proof-of-concept (PoC) implementation of this idea. Additionally, I’ll cover my learnings from a longer period of hands-on work with bootable containers.

If you’re not familiar with bootable containers yet, I recommend reading Fedora bootc’s “What is a Bootable Container?” and optionally also Fedora bootc’s “Getting Started with Bootable Containers”.

Architecture and Design

The PoC design consists of two main parts:

A bootable container that contains all the software and generic config that’s required to set up a pair of VMs that act as a highly available (HA) load balancer and NAT gateway for a Kubernetes cluster running in a private network.
A service on each VM that fetches all cluster-specific configuration from the cluster in front of which it is deployed. The most dynamic cluster-specific configuration are the backend node IPs for the Kubernetes API and the ingress controller. However, the controller also fetches other configurations from the cluster such as the specific floating IPs to use, SSH public keys, and credentials to manage cloud provider floating IP attachments to the active replica.

I discussed this design, along with a few alternatives, with other VSHNeers over the last few months, but we quickly came to the conclusion that the design outlined above is well suited to our primary use case: a load balancer and NAT gateway appliance for OpenShift clusters deployed in private networks on cloud infrastructure that doesn’t offer suitable managed LB and NAT gateway products.

In particular, we liked that this design allows us to leverage our existing rich tool set for managing Kubernetes clusters (Project Syn) to deploy the load balancer and NAT gateway configuration through the same pipeline as the cluster configuration.

Component choices

For building the bootable container base image, I considered RHEL 10 image mode, the Fedora bootc base images, and the Universal Blue base images. I ended up going with the Fedora 44 bootc base image, since that image strikes a nice balance between minimal and familiar, without introducing additional complexities such as managing RHEL entitlements for building the bootable container.

Initially, I had planned to use Ignition to apply custom config on VM first boot. However, after realizing that Ignition requires a bunch of auxiliary tooling and scripts that are missing from the Fedora bootc base image, I ended up using cloud-init instead.

For the actual load balancing and NAT gateway functionality, I mostly stuck to what we’re running in production: HAProxy for load balancing API and ingress traffic, Keepalived for managing internal VIP attachment with Floaty for cloud provider floating IP attachment, and conntrackd for egress TCP connection state synchronization between the two LB VMs.

For traffic routing, which is implemented using iptables on our production LBs, I decided to try out firewalld. This decision turned out to be quite well-suited to a declarative configuration approach, and I was able to set up fairly solid firewall and routing policies for the PoC with only a handful of firewalld config files.

For network interface management, I leveraged NetworkManager, which is Fedora’s default choice for network interface management. This was an easy choice, since I have plenty of experience with it from managing OpenShift clusters.

Finally, for building the service that fetches the cluster-specific config from Kubernetes, I used kubebuilder to build a custom Kubernetes operator in Go (with the slight twist that the operator’s controller-manager will run outside the cluster). Again, this is familiar technology: we’ve written a few Kubernetes operators already, most recently Espejote (I highly recommend the Espejote blog post if you haven’t read it already).

Building the custom bootable container

I spent the first day or so familiarizing myself with the process of building a custom bootable container outside a carefully curated lab environment. The first breakthrough was my discovery of the Universal Blue image-template repository on GitHub: that repository provides a nice experience for building custom bootable containers out of the box, including a fairly comprehensive set of build commands encoded in a Justfile (just is a modern take on make).

However, I quickly realized that my current local environment (Ubuntu 24.04 LTS) isn’t the ideal platform for some bootable containers workflows. In particular, I ended up using a Fedora Atomic VM to run the bootc-image-builder which allows building a bootable VM image for various platforms from a bootable container. For the PoC, I used bootc-image-builder to generate a qcow2 image that I could boot in libvirt/QEMU.

As already noted above, the second big takeaway from building my own bootable container was that the Ignition first boot experience of Fedora/RHEL CoreOS requires quite a bit more than just installing the ignition package. If you’re curious, feel free to check out how Fedora CoreOS builds its Ignition first boot experience with Dracut. In the spirit of iterating quickly, I ended up picking cloud-init as the tool for applying first boot configuration instead.

You can find the source code I used to build the fedora-bootc-loadbalancer bootable container on GitHub at simu/fedora-bootc-loadbalancer. Notably, this repository currently lacks any CI that regularly builds the container and associated qcow2 base image. For the PoC, I was doing all the container builds locally.

Testing the custom bootable container

While one of the advantages of bootable containers is that you can just run them in podman to test changes, I did most of the testing locally in libvirt/QEMU. The main reason for testing in libvirt was that I didn’t fancy spending a bunch of time trying to replicate a VM with multiple network interfaces when running the bootable container in podman.

There are two main approaches for creating VMs from a bootable container image: you can either boot the VM from any bootc-enabled ISO and use bootc switch to pivot the VM to your custom bootable container, or you can build an ISO or importable disk image of your bootable container with bootc-image-builder.

I used bootc-image-builder to generate a qcow2 image of the bootable container and created VMs from that image. I chose this approach, since provisioning the LB VMs from a custom base image is my preferred option for a potential future production setup. Additionally, once I started relying on the custom Kubernetes controller to render initial cluster-specific configurations during first boot, creating VMs from a custom base image which includes the controller binary was much simpler.

For the PoC, I used our OpenShift Lab cluster’s integrated container registry to store the bootable container. This, in combination with deploying an authentication config for bootc via cloud-init, allowed me to update the VMs with bootc upgrade and also enabled bootc switch to change the deployed image tag. Being able to fetch the latest bootable container from the registry allowed me to test changes to the bootable container relatively quickly even without the rapid iteration that’s enabled by running the bootable container in podman locally.

Note that while I used a private registry for the PoC, the bootable container doesn’t contain any secrets, and an eventual production-ready version would host the bootable container on a public registry. All required secrets are cluster-specific and should be injected at first boot via cloud-init or Ignition or updated through the Kubernetes API once the VM is operational to avoid having to build a separate bootable container for each cluster.

In addition to testing new versions via bootc upgrade, I also regularly built a new qcow2 base image from the latest version of the bootable container. Doing so allowed me to refine and improve the first boot process by creating new VMs based on the latest version of the bootable container.

Building the bootc-loadbalancer-controller-manager

Building and testing the Kubernetes operator that fetches the cluster-specific config from the Kubernetes API and renders appropriate configurations for the various services on the VMs was quite straight forward: I did a lot of the initial prototyping and testing against a local Kind cluster which enabled very quick iteration.

Once the controller was able to render the essential service configurations (HAProxy front and back ends, Keepalived config for the private VIPs, Floaty config for the public VIPs, and conntrackd config) based on the custom resource, I started to integrate the controller in the bootable container.

For the PoC, the integration of the controller into the bootable container build process was quite primitive: I simply copied the compiled Go binary into the repository that I used to build the bootable container and the container build process copied it to /usr/bin/bootc-loadbalancer-controller-manager in the container. Additionally, I added a very simple systemd service which runs the controller to the bootable container.

For the PoC, I side-stepped Kubernetes authentication for the controller: early testing used the Kind cluster’s admin kubeconfig injected into the LB VM through cloud-init. For the integration testing towards the end of the week, I used the service account and RBAC that’s generated by kubebuilder and a manually created long-lived service account token.

In the future, we’d probably want to build a more dynamic authentication mechanism in the controller, possibly using Kubernetes‘ TokenRequest API.

Towards the end of the week, I extended the controller with a mechanism that reloads (or restarts) services on the LB VMs when their configuration is updated. This is implemented by hashing all configurations and tracking the currently deployed hashes for each LB VM in the custom resource’s status field. For a production-ready implementation, we’d probably want a secondary loop which triggers a reconciliation every time it sees one of the managed configuration files get changed on disk.

You can find the code for the Kubernetes controller on GitHub at simu/bootc-load-balancer-controller. Also note that the published version of the bootable container source code at simu/fedora-bootc-loadbalancer clones the controller GitHub repository and compiles the controller binary for each container build instead of relying on a manual copy.

SELinux adventures

Naturally, no blog post about building a custom system on Fedora is complete without some SELinux adventures.

After integrating the Kubernetes controller into the bootable container, I started testing the various service configurations. The initial tests were done on a single VM and with some creative integrations, such as configuring the API HAProxy backend on the libvirt VM to point to the Kind cluster on the host. During the initial testing I quickly found that the Keepalived configuration (which was copied almost verbatim from the working production LBs) didn’t work as expected: SELinux was saying “no” to Keepalived setting up the VRRP notification FIFO and Floaty reading from the same FIFO.

To my pleasant surprise, working with SELinux is quite ergonomic in 2026 (shout out to Computing for Geeks‘ SELinux Survival Guide for Fedora which provides a fairly comprehensive overview of modern SELinux tooling).

First, I verified that SELinux was the root cause for Keepalived VRRP notifications not working correctly by temporarily switching to SELinux permissive mode with setenforce 0. After that, I refreshed my memory on how to work with SELinux, and installed the audit and policycoreutils-python-utils packages in my bootable container, since those aren’t part of the Fedora bootc base image.

At that point, I was able to extract the SELinux audit log entries in a nice format (including suggestions for remediation) with sealert. I then used the command line suggested by sealert, using ausearch and audit2allow to turn the log entries into a custom policy. Once the custom policy allowed Keepalived and Floaty to operate normally, I extracted the plain text policy definition and included a step to compile and install the custom policy in the bootable container build script.

After including the custom policy in the bootable container, VMs provisioned from (or updated to) the latest bootable container image automatically used the custom policy and there was no need to manually adjust SELinux anymore for fresh VMs.

After getting Keepalived and Floaty to work, I also extended the bootable container build script with some semanage port commands to label additional ports as http_port_t. This change allowed HAProxy to bind to all the configured front end ports, and to open connections to all the configured back ends.

Extending the controller to work without a Kubernetes API

While testing the controller on the LB VM, it quickly became clear that we needed a “bootstrap” rendering mode which works without a Kubernetes API. In this bootstrap mode, the controller renders initial configurations from a local YAML file which contains the custom resource and referenced cloud provider credentials secret.

After refactoring the controller’s reconcile logic to make it easier to call when there’s no working Kubernetes API, and introducing some extra command line arguments, the controller gained a “render-from-file” mode. The PoC uses this mode in cloud-init’s runcmd to generate initial configurations during first boot of each VM.

Configuring the network interfaces on the LB VMs

At this point, I had also implemented support for rendering custom NetworkManager .nmconnection configuration files in the controller. This enabled me to disable cloud-init’s network rendering completely and instead use the controller to configure the various network interfaces on each LB:

the public interface where client ingress traffic arrives and on which egress traffic is routed to the internet.
the private cluster network interface on which the LBs manage the default gateway IP and load balance the Kubernetes API.
a dummy interface on which all the cloud provider floating IPs are assigned so that loop back traffic on those IPs works as expected.

In order to generate these NetworkManager configurations, the controller must know which of the VM’s network interfaces are attached to the public internet and the cluster’s private network.

By default, the controller selects the interface that has the VM’s default route as the public interface. However, optionally, the public interface name can be specified as a command line argument. Additionally, the cluster network CIDR is a mandatory command line argument for the controller. The controller determines the private cluster network interface by searching for a network interface that’s assigned an IP in the cluster network CIDR.

The controller needs the interface names to read details about each interface from Linux. However, the controller then extracts the interfaces‘ MAC addresses and uses those in the .nmconnection configuration files as stable identifiers for each interface. This approach ensures that the on-disk NetworkManager configuration doesn’t depend on the actual interface names (which may change in some circumstances in cloud provider environments).

At this point, the VMs were almost fully functional as load balancers (notably the services proxied by HAProxy weren’t yet reachable from outside the VM), but didn’t yet act as NAT gateways.

Configuring the firewall

While researching how to operate firewalld, I found the official firewalld blog post Policy Sets: A Home Router in four commands, which describes how a VM running firewalld can be turned into a functional simple router with only four commands.

After some testing, I started with firewalld’s gateway policy set as the base to define a custom firewalld configuration for the LB VMs. The final firewalld configuration extends the gateway policy to allow the LB VMs to act as NAT gateways/routers and exposes all the services proxied by HAProxy to external clients.

For the PoC implementation, the bootable container build script applies the following cluster-independent firewalld configurations:

It enables the gateway-lan-to-HOST, gateway-lan-to-world and gateway-world-to-host policies (which are part of the firewalld gateway policy set).
It adds a custom firewalld service definition for the OpenShift machine-config-server (TCP port 22623) in /usr/lib/firewalld/services.
It customizes the firewalld internal zone definition to allow traffic from the cluster network to the services exposed by HAProxy (Kubernetes API, machine-config-server, and ingress controller)

Additionally, the Kubernetes controller applies the following cluster-specific firewalld configurations:

It deploys a custom definition for the external zone which allows access to the services exposed by HAProxy (Kubernetes API and ingress controller) and blocks SSH access on the public floating IPs.
It adds the LB’s public interface to the external zone (via NetworkManager .nmconnection file)
It adds the LB’s private cluster network interface to the internal zone (via NetworkManager .nmconnection file)
It deploys a custom SNAT rule to use a floating IP as the source IP for traffic that’s forwarded from the cluster network to the public interface. This rule is deployed using firewalld’s direct mode, since the high-level firewalld configuration doesn’t currently allow configuring custom SNAT rules.

We translate the egress traffic source IP to a floating IP to ensure a failover of the internal gateway IP between the NAT gateway VMs doesn’t change the source IP of egress traffic, which would break established TCP connections to external servers. Keepalived and Floaty ensure that the internal default gateway VIP and the egress source floating IP are attached to the active LB replica. Finally, conntrackd ensures that egress TCP connection state is synchronized between the two LB replicas. All together, these three configurations provide transparent failover for TCP egress traffic.

End-to-end validation of the PoC

Initially, I had planned to set up a temporary OpenShift cluster for the end-to-end validation of the PoC.

However, I was running out of time, so I decided to do the end-to-end validation with a Talos Linux cluster running in a private libvirt network and a pair of VMs acting as LBs and NAT gateways for the cluster. Also, to avoid having to find real public IP addresses to use as floating IPs, I allocated fake cloud provider floating IPs from RFC-6890 TEST-NET-2 throughout the project.

The overall configuration of the end-to-end validation was set up as follows:

I allocated floating IPs for the API (198.51.100.100), the ingress controller (198.51.100.80) and the egress traffic (198.51.100.253) from TEST-NET-2.
I set up two libvirt networks:
- a regular libvirt “NAT” network (192.168.122.0/24). This network acted as the LBs‘ public network (host interface virbr0).
- a libvirt “isolated” network (192.168.100.0/24). This network acted as the cluster’s private network (host interface virbr1).
The two LB VMs were each configured with a network interface in both of those libvirt networks.
- Keepalived on the LB VMs managed 192.168.100.1 as the default gateway IP for the libvirt isolated network.
The Talos Linux VMs were configured with a single network interface in the isolated network (192.168.100.0/24) and received the default gateway IP via DHCP.
On the host machine, traffic from 198.51.100.253 which was received on virbr0 was masqueraded to the host’s primary IP. This established outbound connectivity for VMs in the isolated network.

Setting up this validation environment was not without its own challenges: I had to hack the libvirt dnsmasq config for the isolated network in order to distribute the default gateway IP managed by the LB VMs over DHCP. Additionally, I ended up building a very simple mock floating IP API which allowed the LB VMs to dynamically configure static next hop routes on the host. To use that API, I adapted the Keepalived example VRRP notification bash script to call the host’s mock API instead of using Floaty to handle the Keepalived VRRP notifications. Since this script is essentially a replacement for Floaty, the associated moving parts in the bootable container are called fake-floaty.

After a few iterations and some last-minute additions in the Kubernetes controller to generate HAProxy back ends suitable for Talos rather than OpenShift, I was able to bootstrap the Talos cluster with 198.51.100.100 as the Kubernetes API IP. Once the Talos cluster was bootstrapped, I also successfully validated the dynamic functionality of the PoC, watching the HAProxy configurations on the LBs get updated and HAProxy reloaded instantly when Talos nodes joined the cluster.

Conclusion and outlook

I learned a lot throughout the week. In particular, I gained a much deeper understanding of bootable containers and firewalld.

If you’re interested in digging into the implementation, you can find all the code on GitHub: the custom bootable container is built from simu/fedora-bootc-loadbalancer and the Kubernetes controller is built from simu/bootc-load-balancer-controller.

Overall, I believe the PoC showed promise. However, quite a bit of work is still needed to bring the implementation up to production quality. Currently, the happy path mostly works in a local environment, but I wasn’t able to verify the PoC on a real cloud provider due to time constraints and there are various failure paths where the controller hard crashes. Additionally, there are still a number of open questions, in particular in regard to the OpenShift bootstrap process. Further, some optional features which are available on our production LBs (such as the ability to configure HAProxy to expose node port services) aren’t implemented yet.

Finally, while there are currently no concrete plans to move forward with this project in the short term, the results we were able to show after a week of work on the PoC implementation are very promising. To conclude, we’ll definitely keep the bootable container approach in mind for any future products.

Simon Gerber

Simon Gerber ist ein DevOps-Ingenieur bei VSHN.

Kubernetes
OpenShift
Tech