“Imagine that you’ve worked for more than a year on the infrastructure of a complex payments platform and within less than 30 minutes into a penetration test the whole production cluster is completely compromised. We’re talking being able to bypass firewall rules and gaining access to Google Cloud Services that you don’t have access to — all from within a tiny, teeny container.”
This is how Ana Calin kicked off her recent QCon London talk about the risks and payoffs of powering flexible payments in the cloud via Kubernetes. Calin is a Systems Engineer at Paybase, an API-driven payments service provider, which focuses on marketplaces, the gig-sharing economy, and cryptocurrencies. With a hard-to-get PCI DSS (Data Security Standard) Level One compliance, Paybase is built to make regulation easier for its customers. Their solution is handy when these businesses are deciding if they want to become a regulated payment institution themselves or save money by integrating with this third party.
Calin offers ways they were able to overcome some by-default security hurdles in Kubernetes managed solutions to achieve this top level of compliance — despite regulators generally lacking an understanding of containers technology.
Two Ways to Prevent a Kubernetes Compromise
This compromise came about from a planned internal infrastructure penetration test of Paybase’s production cluster, before it was actively used by customers. The tester ended up gaining access to Google Cloud Services that he shouldn’t have had access to, through the access they were provisioned within a container.
Calin broke down what caused this security compromise and what mitigations were applied to prevent future attacks. This isn’t an exhaustive Kubernetes security primer, but one team’s story of faults they found by default.
Weak Link #1: Google Kubernetes Engine
Google Kubernetes Engine (GKE), which is Google Cloud Platform’s Kubernetes managed service, versions before 1.12.0 came with some insecure-by-default options:
- Compute engine scope — this is a read-write access scope from GKE to Google Compute Engine (GCE)
- Default service account — this is by default associated with the Editor role in GCP. A service account in GCP allows for programmatic access between a given GCP service and other services. The default service account that comes provisioned with any Google Project has by default the role of editor and programmatic access to edit any service within Google Cloud including changing firewall rules.
- Legacy metadata endpoints are enabled by default. By querying the metadata endpoint within a GKE cluster, you gain access to the Kubernetes API as the kubelet — an agent that runs on each Kubernetes node in the cluster that makes sure that containers are running in a pod. You can use this to read secrets from any node or scheduled pods. This is now disabled by default in version 1.12.0, but any previous versions are still at risk.
Calin warned that if you don’t specify the non-default service account and you don’t disable the metadata endpoints of a node, someone can gain access to certain secrets of your Kubernetes APIs.
Weak Link #2: Tiller
Helm is an open-source tool for streamlining the installation and management of Kubernetes applications. Tiller is the service-side component of Helm that manages the releases of your charts, which are your Kubernetes-templated resources packaged together.
Helm’s documentation already advises that you should never provision Tiller in a cluster with all the default options on, but, at Paybase, while the platform was still under development with no users, they left those defaults on with plans to change them before production went live.
“We did this but we said we would change it later on and decided to leave the defaults as they were for the penetration test to see to what extent the setup could be compromised,” Calin said.
Tiller comes with mTLS (mutual transport-level security) key-based authentication between the client and server, disabled by default. Anyone who then compromises Tiller — like Calin’s pen tester — has cluster admin permissions through the cluster-admin Kubernetes role-based access control (RBAC) role required for Tiller’s proper functionality. They can then can remove or provision anything at will.
To mitigate this, Calin says to enable Helm with mTLS. Alternatively, you can run Helm without Tiller — “Tillerless Helm” — within your cluster.
At the very minimum, she recommends binding Tiller to localhost, which means Tiller will only listen to the IP address of the pod that it runs in.
Security and Resilience: A Secure Kubernetes Cluster
After offering the two configurations that could put your cluster at risk, Calin went into the basics they’ve found are necessary for a more secure Kubernetes cluster — because nothing is impenetrable.
Referencing the Principle of Least Privileged, she said that you should only give engineers permission to what they absolutely need to perform their duties.
Calin also recommended that you should use Kubernetes Network Policies for restricting traffic access to a certain group of pods, as needed, or apply an Istio set up with authorization rules enabled that also restricts traffic access.
“Whatever you’re building, you should always assume it’s going to fail and someone is going to break it.” — Ana Calin, Paybase
She continued that you should write Pod Security Policies, so that “in the event your cluster is compromised via a container, the attacker can only deploy pods with certain privileges and are restricted from mounting the host filesystem into a newly created pod.”
A good assurance for all software is to have role-based accessed controls (RBAC) enabled. The main cloud providers that offer Kubernetes Managed Services, Google Cloud, Azure and AWS have Kubernetes RBAC enabled by default, however, this has only been enabled recently under Azure’s Kubernetes Services.
And always use security-scanned images within your clusters.
Calin said a resilient Kubernetes cluster should:
- Be architected with failure and elasticity in mind,
- Have a stable observability stack,
- Be tested a lot, including chaos engineering.
Overcoming Kubernetes Challenges to PCI DSS Compliance
Calin described Paybase’s Level One Payment Card Industry Data Security Standard (PCI DSS) compliance as a huge achievement. It’s something required of Payments Service Providers (PSP), yet most financial institutions choose to pay annual fines rather than achieve this level of compliance. In 2017, more than 80 percent of businesses remained non-compliant.
This is especially true when financial regulatory institutions don’t have a full understanding of distributed systems since most financial institutions still operate on traditional legacy architectures.
PCI Req. 6.5.1 states that your app can’t be vulnerable to injection flaws. In order to check this box, Paybase leveraged PQL (Paybase Query Language), a domain-specific language, developed in house and inspired by SQL. Calin said PQL is injection-resistant because it doesn’t allow for mutative syntax and has the added benefit of being database agnostic.
PCI Req. 2.2.1 says as a compliant PSP, you are required to implement only one primary function per server to prevent functions that require different security levels from existing on same server.
Of course, as Calin said, “We don’t have any actual servers and the standards don’t say anything about virtual machines, let alone containers.”
For this requirement, Paybase translated “server” to mean “a deployable unit” like a pod or container. If this interpretation is broadly accepted, it then makes containers a logical solution to aid in compliance. For Paybase it was. They met the requirement using Network Policies that restricted traffic to different services, applying different Pod Security Policies, and only using internal trusted, scanned, and approved images.
PCI Req, 6.4.4 requires a PSP to remove all test data and accounts from systems components before the system goes live or into production.
Calin described the “Normal way for cloud providers to have one organization, then one main project, and then under that project [account for Amazon Web Services], you have all the services. AWS recommends you should have one account for billing and then another for the rest of your application. Then you could split it at a Kubernetes namespace level within the same cluster.”
With this, under the PCI compliant point of view, your scope would be everything in the Google Cloud Platform because everything is encapsulated under the same virtual private network or VPC.
Instead, for each environment, Paybase has a different project. And then they have a few extra projects for important things like image repository, backups and the Terraform state. This results in reducing the scope of the PCI to the production project, which creates a compliance-friendly separation of concerns, easier RBAC, and risk reduction.
Finally, PCI Req. 11.2.1 requires a PSP to perform quarterly internal vulnerability scans, to address those vulnerabilities, and then to rescan, proving that all high-risk vulnerabilities are eliminated.
“When you are running on containers, you don’t have internal infrastructure as such. Instead, we make sure all of the images that are ever used within the cluster have been scanned, and no image that hasn’t passed a vulnerability scan will ever get built,” Calin said.
“When a dev pushes code to our Source Code Management, some integration tests run and a build image step is triggered. Then if the image build is successful, then it scans every package in that image and, if the image doesn’t have any vulnerabilities other than low [risk], it successfully pushes the image tag into GCR [Google Cloud Repository]. If the scan fails [vulnerabilities higher than medium risk are found], our devs don’t have any way to publish that image into GCR,” she continued.
At Paybase new image builds happen daily, as they are running a distributed monolith, separated by entities, thus with all their different application services run on the same image. This ensures we scan the packages in an image on a daily basis which means the likelihood of having an old package with a vulnerability hovering around in a cluster is very low.
Open Source Security is an Ongoing Journey
“Security is not a point in time but a never-ending ongoing journey. It just means we’re working more securely, not that we are 100 percent secure — there is no such thing as complete security,” Calin said.
She continued that you can definitely use open source software like Paybase is and still achieve a good level of security.
“It just requires a certain amount of work, but we are engineers and we like a challenge,” she said.
While she likes a good challenge, Calin did argue for trying to change the PCI compliance status quo. She shared the story of how Paybase had hired an auditor who had a general knowledge of containers and Google Cloud Services, but who they still had to spend a lot of time explaining how it all worked.
“This shouldn’t be our job. It should be the job of people who are training the QSAs [qualified security advisor] to do their job,” Calin said.
Finally, she recommended that engineers and really everyone on your team should be involved in making your company compliant.
“They [engineers] learn how much of a pain it is to become compliant, and they will appreciate more what they have to do to stay compliant,” Calin said.
The Cloud Native Computing Foundation, which manages the Kubernetes project, is a sponsor of InApps.