How to setup your Anka Build Cloud for High Availability using Kubernetes
As the popularity and maturity of Kubernetes grows, we’ve also received an increase of questions from customers on using it to deploy the Build Cloud Controller & Registry. We’ll describe what High Availability looks like for the Anka Build Cloud Controller & Registry software.
There are many ways to deploy your Anka Build Cloud into Kubernetes. We have a helm chart built for AWS that will get you up and running quickly. You can also use it for reference when creating your own, should you wish to have a more advanced configuration: https://github.com/veertuinc/helm-charts/tree/main/charts/anka-build-cloud
Notes on ETCD
Running ETCD on Kubernetes can be problematic for many reasons. Here are a list of things to consider before you design your deployments:
- ETCD is highly sensitive to disk IO. If your PV is slow, etcd will timeout queries and things will not work as expected – especially under heavy load which you may not see until production testing, so be sure to address this sooner than later.
- ETCD pod[s] running on a cluster with multiple nodes must either have
nodeAffinityor distributed data across the nodes. If the etcd pod is rescheduled on a different node in the cluster, it will lose its data, and that data is critical for the Controller to function. A sign that the pod was rescheduled: you’ll wake up one day and Anka Node and permission groups, auth keys, etc will be gone.
Answers to Frequently Asked Questions
- From our experience, the default nginx ingress controller doesn’t handle large file transfers (pushing your VM template to the registry for example). You will need to tweak the configuration of nginx to allow for large file transfer and so that the default timeouts are not met for long running transfers. Example:
metadata: annotations: nginx.ingress.kubernetes.io/client-max-body-size: "0" nginx.ingress.kubernetes.io/proxy-body-size: "0" nginx.ingress.kubernetes.io/proxy-max-temp-file-size: "0" nginx.ingress.kubernetes.io/proxy-buffering: "off"
- When using an AWS NLB, there is an immutable idle connection timeout value of 350s. This can cause registry push/pull actions to timeout. You’ll need to create an ingress ALB that accepts longer connections.
- If the registry pod has resource limits which are hit, it can be rescheduled on another host while it’s performing actions. We recommend avoiding this as much as possible.
- Per https://github.com/kubernetes/kubernetes/issues/43916, memory request limits for the registry may cause it to frequently restart due to the linux kernel defaulting to using lots of cached memory for disk IO operations. It’s best to avoid placing memory limits for the registry when running in Kubernetes.