Can AKS break the Hub & Spoke goals?
As the first part of a series of articles about building, running, operating services within AKS clusters I want to dive into some core theory. I hope to give insights to some of the pitfalls we had to overcome in building large scale infrastructure over the past 2 years in Azure, how to solve them and how to make the decisions that could save you a lot of time and energy down the road. We are going to get right into the details concerning security principles, problem solving and controlling outbound traffic down to pod the service level and incorporating eBFP.
Internet
|
v
[Ingress IP]---->[SLB]---->[Egress IP]
| | |
| v |
| [AKS Control Plane]
| |
| |----------->[AKS Cluster]
| / |
| / |
| / |
| [NSG]<--/ |
| | |
| v |
+---------------->[aks-subnet]<--+
|
v
[aks-egress-vnet]---+
| |
v |
{Outbound Traffic} |
| |
v |
+---------+ |
| [hub vnet] |
| [Azure Firewall]<-----+
+---------+
|
v
Azure required services
More often than not your infrastructure will mirror the understanding, competence and goals of the people building it, and contain their nuanced areas of understanding in more detail than areas outside their interest of confidence realms. How well do they understand the goals and requirements? Do the people writing the requirements understand how the underlying technology actually works or are they just focused on business goals and logic? How many engineers in your team really understand core Azure networking? Often; simply making things work within budget and timescale can cost your team and company in the long run. Herein there is the loaded term of technical debt. Which of course complexly can compound.
Adhering to an iterative process is not anathema; the refinement and reconstruction form the core of our practice as architects and engineers- this is the way. However AKS can be a different beast, there are so many ways to build and a lot of considerations for plugins and networking that will be massive gotchas in your near future, the idea of these articles is to lessen some of the more major pitfalls by setting out the terms you need to plan for early on. In the next paragraph lets look at the Hub & Spoke principles and address the first concern.
Hub & Spoke Architecture
At its core, the Hub & Spoke model is a paradigm for dataflow management. It necessitates the capacity to oversee, document, and scrutinise data traversing in all directions—be it north-to-south or east-to-west. This framework employs virtual networks, termed «Spokes,» each tailored to specific workloads, while central «Hubs» administer control functions, regulating and examining traffic both within the confines of the network and, when necessary, between the Spokes. Microsoft extensively delineates this model, highlighting particular services such as secure virtual WAN and DNS proxy across virtual networks, which will be examined in further depth subsequently.
- The Azure Firewall or NVA (network virtual appliance), with Palo Alto standing as the preferred choice, is integrated via VNET for the meticulous management of outbound traffic.
- For inbound traffic, the architecture incorporates virtual network peering and subnets, complemented by Service endpoints and/or Private Link.
- The architecture is seamlessly woven with essential services like CosmosDB, Event Grid, Functions, Storage, Azure API Management, and the pivotal AKS.
- Routing protocols are centralised, with current flexibility found in integrating VNET routes through the Secure Hub’s virtual router. Alternatives include a blend of UDR and virtual hub routing for a more bespoke traffic management
This model’s escalating popularity is attributed to its fortified control over internet-facing workloads, effectively insulating them from direct internet exposure and providing a monitored and restricted common exit point. It affords businesses the means to direct traffic, implement cutting-edge firewall technologies, leverage intrusion detection and prevention systems, and conduct network segmentation and workload isolation. Moreover, it facilitates centralised security logging and analysis, employing tools like Sentinel or alternative SIEM systems. While this model presents a robust theoretical framework, it can provide tangible challenges and boundaries when integrating AKS as the principal orchestration platform.
Issue 1; Outbound type
Lets take a look at the first and major consideration when adding AKS to a spoke VNET, outbound traffic routing and default AKS out of the box (and often recommended) config, this causes the following major headaches:
- Bypassing Centralised Inspection: If an AKS cluster is configured in a spoke with its default outbound settings, the outbound traffic from the AKS cluster will bypass the firewall in the hub. This means that the security layer provided by the firewall for traffic inspection and control is circumvented.
- Outbound Rules Conflict: The default outbound rules for AKS might conflict with the routing rules established in the hub and spoke architecture. AKS, by default, sets up its own Network Security Group (NSG) rules which can interfere with the centralised traffic routing.
- Egress IP Address Management: By default, AKS uses a set of egress IP addresses that are determined by the underlying Azure managed infrastructure. When using a hub and spoke model with Azure Firewall, you typically want to control your own egress IP addresses and SNAT.
So it goes without saying that you should be routing all INET traffic over your firewall/NVA, as you can see its not just about inspecting and seeing where your pods are reaching out too, but keeping a subset of managed outbound IPs so your team can provide whitelists for 3rd parties you work with as well.
In our next article we will use Terraform and configure the main networking outbound type to use UDR, network plugins/CNI and address DNS proxy on the spoke VNET level.