Downloading tens of millions of container images daily from the Serverless optimized Artifact Registry

Entering the Serverless era

In this blog, we share the journey of building a Serverless optimized Artifact Registry from the ground up. The main goals are to ensure container image distribution both scales seamlessly under bursty Serverless traffic and stays available under challenging scenarios such as major dependency failures.

Containers are the modern cloud-native deployment format which feature isolation, portability and rich tooling eco-system. Databricks internal services have been running as containers since 2017. We deployed a mature and feature rich open source project as the container registry. It worked well as the services were generally deployed at a controlled pace.

Fast forward to 2021, when Databricks started to launch Serverless DBSQL and ModelServing products, millions of VMs were expected to be provisioned each day, and each VM would pull 10+ images from the container registry. Unlike other internal services, Serverless image pull traffic is driven by customer usage and can reach a much higher upper bound.

Figure 1 is a 1-week production traffic load (e.g. customers launching new data warehouses or MLServing endpoints) that shows the Serverless Dataplane peak traffic is more than 100x compared to that of internal services.

Figure 1: Serverless traffic is very bursty.

Based on our stress tests, we concluded that the open source container registry could not meet the Serverless requirements.

Serverless challenges

Figure 2 shows the main challenges of serving Serverless workloads with open source container registry:

Not sufficiently reliable: OSS registries generally have a complex architecture and dependencies such as relational databases, which bring in failure modes and large blast radius.
Hard to keep up with Databricks’ growth: in the open source deployment, image metadata is backed by vertically scaling relational databases and remote cache instances. Scaling up is slow, sometimes takes 10+ minutes. They can be overloaded due to under-provisioning or too expensive to run when over-provisioned.
Costly to operate: OSS registries are not performance optimized and tend to have high resource usage (CPU intensive). Running them at Databricks’ scale is prohibitively expensive.

Standard OSS registry setup and the risks — Figure 2: Common OSS registry setup and the risks.

What about cloud managed container registries? They are generally more scalable and offer availability SLA. However, different cloud provider services have different quotas, limitations, reliability, scalability and performance characteristics. Databricks operates in multiple clouds, we found the heterogeneity of clouds did not meet the requirements and was too costly to operate.

Peer-to-peer (P2P) image distribution is another common approach to reduce the load to the registry, at a different infrastructure layer. It mainly reduces the load to registry metadata but still subject to aforementioned reliability risks. We later also introduced the P2P layer to reduce the cloud storage egress throughput. At Databricks, we believe that each layer needs to be optimized to deliver reliability for the entire stack.

Introducing the Artifact Registry

We concluded that it was necessary to build Serverless optimized registry to meet the requirements and ensure we stay ahead of Databricks’ rapid growth. We therefore built Artifact Registry – a homegrown multi-cloud container registry service. Artifact Registry is designed with the following principles:

Everything scales horizontally:
- Do not use relational databases; instead, the metadata was persisted into cloud object storage (an existing dependency for images manifest and layers storage). Cloud object storages are much more scalable and have been well abstracted across clouds.
- Do not use remote cache instances; the nature of the service allowed us to cache effectively in-memory.
Scaling up/down in seconds: added extensive caching for image manifests and blob requests to reduce hitting the slow code path (registry). As a result, only a few instances (provisioned in a few seconds) need to be added instead of hundreds.
Simple is reliable: unlike OSS, registries are of multiple components and dependencies, the Artifact Registry embraces minimalism. Behind the load balancer, As shown in Figure 3, there is only one component and one cloud dependency (object storage). Effectively, it is a simple, stateless, horizontally scalable web service.

Figure 3: Artifact Registry, a minimalism design reduces failure modes.

Figure 4 and 5 show that P99 latency reduced by 90%+ and CPU usage reduced by 80% after migrating from the open source registry to Artifact Registry. Now we only need to provision a few instances for the same load vs. thousands previously. In fact, handling production peak traffic does not require scale out in most cases. In case auto-scaling is triggered, it can be done in a few seconds.

Figure 4: Registry latency reduced by 90%.

Figure 5: Overall resource usage dropped by 80%.

Surviving cloud object storages outage

With all the reliability improvements mentioned above, there is still a failure mode that occasionally happens: cloud object storage outages. Cloud object storages are generally very reliable and scalable; however, when they are unavailable (sometimes for hours), it potentially causes regional outages. At Databricks, we try hard to make cloud dependencies failures as transparent as possible.

Artifact Registry is a regional service, an instance in each cloud/region has an identical replica. In case of regional storage outages, the image clients are able to fail over to different regions with the tradeoff on image download latency and egress cost. By carefully curating latency and capacity, we were able to quickly recover from cloud provider outages and continue serving Databricks’ customers.

Figure 6: Serverless VMs failover to other regions to survive cloud storage regional outages.

Conclusions

In this blog post, we shared our journey of scaling container registries from serving low churn internal traffic to customer facing bursty Serverless workloads. We purpose-built Serverless optimized Artifact Registry. Compared to the open source registry, it reduced P99 latency by 90% and resource usages by 80%. To further improve reliability, we made the system to tolerate regional cloud provider outages. We also migrated all the existing non-Serverless container registries use cases to the Artifact Registry. Today, Artifact Registry continues to be a solid foundation that makes reliability, scalability and efficiency seamless amid Databricks’ rapid growth.

Acknowledgement

Building reliable and scalable Serverless infrastructure is a team effort from our major contributors: Robert Landlord, Tian Ouyang, Jin Dong, and Siddharth Gupta. The blog is also a team work – we appreciate the insightful reviews provided by Xinyang Ge and Rohit Jnagal.

Downloading tens of millions of container images daily from the Serverless optimized Artifact Registry

Entering the Serverless era

Serverless challenges

Introducing the Artifact Registry

Surviving cloud object storages outage

Conclusions

Acknowledgement

Rehan

Leave a Reply Cancel reply

Entering the Serverless era

Serverless challenges

Introducing the Artifact Registry

Surviving cloud object storages outage

Conclusions

Acknowledgement

Rehan

Leave a Reply Cancel reply

You May Like

Palantir and Databricks Announce AI Product Partnership

Apache Cassandra vs. Hadoop Distributed File System: When Each is Better

Changes to Make the NFL More Exciting! » BIG DATA TO BIG PROFITS