Home
>
Software Development
>
Red Hat Code Sleuths Uncover Mysterious Bug in Registry Service – InApps 2022

March 23, 2022 by Phu Nguyen

Red Hat Code Sleuths Uncover Mysterious Bug in Registry Service – InApps 2022

Main Contents:

Red Hat Code Sleuths Uncover Mysterious Bug in Registry Service – InApps is an article under the topic Software Development Many of you are most interested in today !! Today, let’s InApps.net learn Red Hat Code Sleuths Uncover Mysterious Bug in Registry Service – InApps in today’s post !

Growing to 10-Digit Scale

Red Hat’s Quay.io is a very large hosted service. There’s been a lot of news recently about the business of hosting container images at scale for enterprise cloud users, and Quay.io has quietly been performing that function since 2013 and growing steadily. In the month of August 2020 alone, Quay.io served 1 billion container pulls and had 100% uptime.

Back in 2014, when Quay.io was acquired by CoreOS, a decision was made to build an App Registry into the service. This predated the modern methods of cloud native artifact bundling that we use today in Kubernetes, with solutions like OCI, but the functionality was nonetheless included into Quay’s codebase. Because this feature wasn’t what most users adopted Quay.io to do, it wasn’t highly used and so it didn’t get a lot of engineering scrutiny.

App Registry is a lesser-known feature of quay.io that allows objects like Helm charts and containers with rich metadata to be stored. While most quay.io customers don’t use this feature, Red Hat OpenShift is a large user. The OperatorHub within OpenShift uses App Registry to host all of its Operators.

Every OpenShift 4 cluster uses Operators from the embedded OperatorHub to serve a catalog of available Operators, to install and provide updates to already installed Operators. As OpenShift 4 adoption has increased, so has the number of clusters globally. Each one of those clusters needs to download Operator content to run the embedded OperatorHub, using the App Registry inside quay.io as a backend.

The Outage

Fast forward to this summer and Quay.io is processing over one billion image requests per month, a rate of over 1.5 million per hour. It’s a large scale data distribution and retention service depended upon by enterprises around the globe. It’s also hosted on Red Hat OpenShift, an open hybrid cloud platform for container-based IT teams around the world.

After updating to OpenShift 4.3.19 from OpenShift 4.3.18, Quay.io’s database froze and the service stopped working, resulting in services that were intermittently disrupted. During these periods, users experienced a range of outcomes, including slow container image access times and inability to retrieve container images. The team quickly rolled back to 4.3.18, restoring service and steadying the waters, but everyone involved was now taking part in a murder mystery as their very own Inspector Lynley.

But the culprit has already been mentioned: the app registry. Turns out it had become the way internal teams at Red Hat were building Kubernetes Operators. The code behind app registry had never been pushed to work at this scale, and thus, the entire system suffered because of it.

We’re not here to discuss the end results: they’re almost boring compared to the giant bug hunt which ensued, and which shows just how CSI-style procedural such a search can get when Red Hat is involved.

Instead, we’re here to discuss that bug hunt. The twists and turns, the insane breadth of possibilities, and the methods used to track it down. The ensuing weeks after the crash saw Red Hat employees working on Quay, OpenShift, the Linux Kernel, and all manner of other systems, attempt to eliminate possibilities and identify the exact culprit.

William Dettelback is an engineering manager on the Quay engineering team. When it came to the Quay.io outage, the first thing he saw was the Red Hat SRE team, run by Jay Ferrandini and Jonathan Beakley, isolate the changes that had taken place between the service functioning properly and its newly degraded state.

Dettelback says it’s important to have this type of monitoring and performance measurement in place to start; otherwise, when things go sideways, you cannot actually tell. Without a baseline of system behavior, pinpointing when exactly the problem started is nigh impossible.

A Mile Wide, an Inch Deep

Fortunately, the number of changes across the systems involved were minimal. Unfortunately, they went deep. The OpenShift 4.3.18 to OpenShift 4.3.19 upgrade included not only OpenShift updates, but also some updates to the fundamental Linux systems and kernel used to power containers.

That’s because the OpenShift platform is not just some PaaS, or some framework, or even simply some implementation of Kubernetes. Instead, it is a harmonizing of thousands of open source projects, from the very bottom at the Linux kernel all the way up to the support for serverless applications running on top with Knative. Red Hat engineers have first-hand expertise across the entire open source stack.

In OpenShift 4, the Linux operating system is delivered as a feature of the platform through Red Hat Enterprise Linux CoreOS. Each instance of this OS is provisioned and updated by Kubernetes itself, using the Kubernetes declarative API machine controllers as part of the OpenShift installer. The entire stack embraces the concepts of fully immutable infrastructure.

Red Hat engineers were able to quickly narrow down what had changed in the kernel to just a few networking packages. It turned out, those were only a few commits worth of changes, but Bill said the team was able to learn this fact in a day — rather than spending their time researching the vagaries of the Linux kernel.

Stephen Cuppett, director of engineering for Red Hat OpenShift said that the Quay team, the OpenShift teams, and the Linux teams all tried to root out possible causes, narrowing the problem space as quickly as possible. But that wasn’t as easy as it sounded, as the problem only manifested at tremendous scale, making replication difficult in the lab.

Compounding matters, the Telemeter service, remote debugging data stream, had been experiencing network-based outages after the 4.3.18 to 4.3.19 update, so both the Quay.io and Telemeter teams were initially convinced that they were tracking down the same bug.

“As a macro-level service failure,” said Dettelback, “there were a lot of avenues to chase down. We had application things to chase down, infrastructure things to chase down, we had OpenShift things, then the RHEL side of things for this. We knew we had a small number of deltas we were dealing with. After quite a bit of investigation, we figured out that the telemeter issues were networking related, but [it was] not the same issue Quay saw.”

This is when proper logging of performance metrics became important. When the outage occurred, the clusters’ performance metrics were captured and saved using a synthetic benchmark test against a smaller version of Quay in the staging environment. Since the bug was nearly impossible to reproduce in the lab, this data would be a lifeline to figuring out the cause. The team couldn’t simply spin the updated version of Quay.io back up and wait for it to fail again, as that would interrupt services for users who had built critical systems based on Quay.

Thus, the data from the initial issue conditions was critical to troubleshooting. Said Dettelback, “We found Quay on OpenShift 4.3.18 versus 4.3.19 behaved very differently at that breaking point. That was the clue. We knew 4.3.19 wasn’t the smoking gun, but it was the thing we were concerned about. It didn’t explain why we went down, but we knew when we had to do the upgrade, [that] we had to be careful.”

The Usual Suspects

At first, the backup system was suspected as the cause, as the database had been running backup calls prior to the outage. That turned out not to be the case, however, closing entire avenues of possibility in the process, narrowing the list of suspects.

Unfortunately, the initial list of suspects was as long as one in an Agatha Christie novel. Cuppett said “We have different teams at all levels of the stacks, so none of my folks had to investigate all of them. It could have been a very protracted path. It’s complicated, this crosses skills boundaries. From Web services on Quay in Python, to Kubernetes in Go, to the Linux Kernel in C, and then there’s networking… These are all different teams that have multiple engineers at Red Hat.”

That means, said Cuppett, “We had gone wide across the different layers with multiple teams. That way, when one team found conclusive evidence, other teams could quickly abandon other costly and deep paths of investigation. And there were plenty of wrong roads to choose from, so narrowing it down across the many teams helped to prevent any one team from wasting their time, or blocking the other investigations. ”

In the end, the problem stemmed from the increased demand on the App Registry in Quay, a new feature that had never been tested at that scale, and was experiencing unexpected increased usage from development teams over time. That underlying App Registry code has since been optimized and the teams using those features are also being accommodated in other ways, reducing demand.

Said Dettelback, “The correct solution was multiple factors: it wasn’t one thing that took down Quay.io, it was a lot of traffic on a fairly vulnerable portion of Quay’s codebase that wasn’t designed to take the load it was taking. At a technical level, DNS resolution was slower on 4.3.19, but the way we determined that was that the team was able to build a reproducer in Python.”

The Abyss, Avoided

This could have been an endless dive into every avenue of possible issues. While the idea of possibly coming up against a bug that’s in the Linux kernel might sound like being knighted as a new open source warrior, is that really what your developers should be spending their time on if they’ve never touched the kernel before? This is the expertise of Red Hat engineers, and their work on issues like this is one of the advantages of Red Hat support. And if your teams really do encounter a kernel bug and want to take on the challenge of fixing it, we’ll help them do just that. We love bringing new contributors to open source!

“The kernel was one of those things where it looked very likely. ‘Oh, there was a kernel change! That could have had an upward effect on the stack.’ But we ruled that out very quickly,” said Dettelback.

So what’s the long-term solution? “I’d say the long-term solution is not the removal of app registry (that’s a tactical fix), but continuing to strengthen our cross-team collaboration across SRE, OCP and RHEL so we can fix these sorts of things faster. Because we have experts across the value chain and we work in an open manner, it’s easy to get the right people looking at the problem when you suspect it may be in their backyard. If we were a closed source shop or a less open organization, it would have been nearly impossible to get the collaboration and insight into chasing down what was going on when quay.io went down,” said Dettelback.

Feature image via Pixabay.

At this time, InApps does not allow comments directly on this website. We invite all readers who wish to discuss a story to visit us on Twitter or Facebook. We also welcome your news tips and feedback via email: [email protected].

Source: InApps.net

List of Keywords users find our article on Google:

red hat jobs

red hat

red hat linux

ocp redhat

redhat

ocp red hat

red hat ocp

openshift 4 installation

red hat openshift installation

openshift performance monitoring

redhat openshift cluster manager

openshift ocp

openshift rhel

install red hat openshift

openshift installer download

openshift 4

inspector lynley

quay container registry

murder mystery team building game atlanta

openshift container platform 4

red hat running

working at red hat

red hat openshift free hosting

openshift logging

openshift power

ocp openshift container platform

openshift 4 documentation

openshift concepts

red hat backup

openshift

registry redhat

red hat container catalog

redhat jobs

red hat openshift download

murder mystery team building atlanta

red hat api

quay io

openshift auf power

openshift backup

openshift container registry

red hat operators

red hat openshift 4

openshift os

redhat support

red hat enterprise linux

operatorhub

rhel container

red hat openshift

openshift supported platforms

ocp kubernetes

openshift software

operatorhub travel

api red hat

redhat container catalog

red hat openshift support

redhat careers

openshift outage

red hat monitoring

twitter unc0ver

redhat operators

openshift service

openshift svc

what is red hat ocp

red hat coding

red hat openshift cluster manager

redhat ocp

openshift 4 x

red hat openshift container platform ocp

openshift 4 monitoring

red hat openshift dedicated

work at red hat

openshift image registry

install openshift 4

ocp container platform

openshift 4 api

what is redhat ocp

openshift operators

redhat openshift installation

openshift packages

red hat open shift

how to start openshift

backup openshift

openshift installer

red hat work culture

artifact reading inspector

redhat openshift

abyss solutions

red hat linux price list

red hat number of employees

red hat openshift cloud

uncover

openshift image stream

red had

linux openshift

teams machine wide installer

rhel kubernetes

openshift getting started

redhat enterprise linux

made in abyss wikipedia

red hat openshift troubleshooting

the way we were wikipedia

openshift.io

red hat openshift crash

the usual suspects wikipedia

wikipedia made in abyss

murder mystery wikipedia

redhat aquires backup

red hat container registry

openshift 4 logging

openshift logging operator

openshift red hat documentation

red hat quay

openshift container platform documentation

red hat enterprise linux documentation

red hat openshift response time

red hat openshift traces

red hat recruitment

red-hat

teams machine wide installer para que serve

red hat openshift issues

red hat shop

openshift 4 backup

wikipedia bugs

openshift monitoring operator

redhate

lifelines trackit

openshift cluster logging

openshift redhat

red ht

openshift remote debugging

red hat list users

redhat kernel

teams registry

openshift api

openshift documentation

openshift registry

red hat support levels

vacatures full-time redhat system administrator

oci monitoring api

openshift traces

red hate

application monitoring openshift

ocp openshift cloud platform

openshift cluster monitoring

openshift crash

red hat operator hub

openshift database as a service

all red hat

lifeline game wiki

quay.io registry

how to install openshift on rhel 7

ocp openshift

openshift on cloud

the inspector lynley mysteries

open shift cluster

openshift metrics

red hat openshift service

redhat review

“red hat enterprise linux

coreos wiki

openshift 4 install

openshift cluster

red hat learning

red hat linux”

red hat openshift kubernetes service

red hat update kernel

redhat openshift kubernetes service

what is cluster in openshift

openshift installera

openshift troubleshooting

redhat openshift on microsoft azure

what is red hat linux

appregistry

openshift backup project

openshift cluster monitoring operator

openshift performance

red hat backup solutions

red hat openshift overview

red hat software

red hat cloud

services in openshift

what is openshift

红帽 api

openshift4

red hat employees

red hat openshift data

redhat paas

openshift container platform cost

openshift review

openshift hosted

openshift run as user

redhat softwares

causes of smoking wikipedia

linux red hat

ocp cluster

red hat version

helm openshift

how much does red hat linux cost

openshift coreos

rhel insights

csi tactical shop

helm registry

openshift 4 new features

openshift free version

red hat linux support

artifact registry

cluster openshift

openshift on power

openshift plus

rhel wiki

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.