TLDR
At the end of 2017, we swapped out the job engine that CloudBolt had built in 2011 for a new Celery-based one, but we ran into a raft of trouble with that approach and were surprised at how poor a fit it was for our needs. We are now moving back to the CloudBolt-engineered Job Engine as, at least for our use cases, it is more stable, scalable, and sustainable. CloudBolt 8.4-Tallman will ship running the CB Job Engine, including a set of enhancements to make it everything we need it to be (including HA-ready in an active-active configuration, horizontally scalable, etc).
We are sharing our experience with this publicly for two reasons:
- To illuminate more details for customers who have experienced instability as a result of our change to Celery
- To provide more data points for others who are deciding between a handcrafted task management system and Celery
Introduction
When building an enterprise software application, one of the hardest types of decisions to make can be whether to use a third-party product to fulfill a particular need, or to build that functionality from scratch. Third party solutions can offer a valuable shortcut and may be more feature-rich than what a team could afford to build on their own. On the other hand, they may not meet the specific needs as perfectly as a custom-crafted tool would.
When we created the initial version of CloudBolt in 2011, we needed something that would execute tasks asynchronously. We surveyed the third party solutions available at the time and decided to write our own simple task management program, thinking we would swap it out for a third party one before long when they had matured further.
It took until late 2017 for us to get to that, by which time Celery had become the most widely-accepted, industrial-grade, and scalable solution. Celery is a modular library that supports multiple message brokers and concurrency models. It can be architected to process large quantities of tasks via distributed workers and solves a number of difficult problems that asynchronous processing can hit at scale. For CloudBolt version 7.7, we converted our job engine into one based on Celery.
A Little About Our Use Case
For those unaware, CloudBolt is a Django-based application that gives enterprise IT shops a single interface for self-service hybrid cloud provisioning and management. It spawns asynchronous jobs, on user demand, on a recurring schedule (e.g., once an hour), and one-time scheduled jobs. These jobs are used for provisioning VMs, creating full application stacks in multiple environments, deploying storage, refreshing inventory, powering off workloads on a schedule to save money, and so on. For some of our customers, CloudBolt needs to scale to run thousands of concurrent jobs. Ours is also a relatively large application, consisting of 77 different Django apps within the project, and consuming about 150MB of RAM when the full set of CloudBolt models are loaded.
CloudBolt jobs have a number of properties that have proven to be not well suited for Celery. For example, Celery’s support for automatic retrying of tasks cannot be used as most CloudBolt jobs are not guaranteed to be idempotent. Additionally, some CloudBolt jobs can take hours to run, which is far longer than most Celery installations are configured for. Finally, CloudBolt jobs are regularly and dynamically dependent on other jobs. Those relationships can make safe and predictable execution extremely difficult and is not something Celery is designed to handle efficiently.
Another aspect of CloudBolt’s use case is that the product has the rare (and highly appreciated) feature of allowing customers to provide arbitrary plug-in code, which is run by the job engine within the same memory space and with all the same power of CloudBolt-built jobs. If all of the code run by the job engine was owned by CloudBolt engineering, we could re-architect the code to streamline the dependencies on the underlying data model and reduce the memory footprint. But our customers love the automatic, full access they get with this model and we see value in empowering them to write a wide and deep range of plug-ins to extend and connect CloudBolt with other systems.
The Fallout
Celery has support for multiple worker pools, and the two that we were the best candidates for us were:
- Pre-fork, where each task spawns its own separate process.
- Eventlet, where each task spawns a new thread (lighter weight than a whole process).
We initially experimented with the pre-fork mode, which seemed to be the most used, but we found that the memory footprint of CloudBolt’s ORM quickly utilized more memory than we wanted to require our customers to allocate.
1,000 concurrent jobs x 150MB = 150GB of RAM
Knowing that our customers would eventually be scaling to tens of thousands of jobs, this memory footprint was infeasible.
Because of this, we switched to eventlet, which solved the memory footprint problem. However, this approach led to a series of problems that required more eventlet expertise than we have, and the learning curve was steeper than we anticipated. Some of the problems we encountered were database connections not being properly shared, then not being properly cleaned up, Celery worker processes unexpectedly dying when CPU load was high, eventlet pools not being able to cancel jobs (requiring us to write our own cancellation logic), an exception in eventlet DNS resolution that we were unable to consistently reproduce, database timeouts that lasted until we implemented eventlet database pooling, and the inability to store data on threads with eventlet’s greenlets. We worked through all of these problems, but more were continuing to arise.
The War of Attrition
When working on each CloudBolt release from 7.7 to 8.3, we identified a set of scalability & stability problems with our usage of Celery, tracked them down, and issued fixes for them. Many signs and testing led us to believe that they would eliminate most if not all of the problems for our customers. But with various new customer cases and test scenarios in our labs that utilized job execution in a different way, we continued to encounter new sets of problems that often required a significant effort to recreate, troubleshoot, and then fix.
After four minor releases and numerous patches, we stepped back to re-evaluate the pros and cons of using Celery vs a homegrown job engine. With our increased knowledge of and experience with Celery, the limitations of our handcrafted job engine seemed more like trivial inconveniences, and the decision was clear for us to move back to our CloudBolt-engineered Job Engine in CloudBolt 8.4-Tallman.
Conclusion
At each step along the way, our engineers made good decisions based on the information we had but failed to anticipate how what we thought was an enterprise-ready, battle-tested execution framework could have been so problematic. We have conducted two post-mortems
and put measures in place to ensure we detect problems in scalability and performance before those issues ever make it into even an alpha release of CloudBolt. These measures include automated nightly scalability tests that test several approaches to running large numbers of concurrent jobs and orders, cultural changes to encourage creativity in brainstorming what could go wrong and how to simulate problematic scenarios, incorporating pre-mortems into our planning process, and hiring engineers dedicated to writing automated tests.
Speaking for all of CloudBolt, we feel horrible about the regressions in stability that affected some of our customers as a result of switching our execution engine to Celery. We take pride in our product and the fact that this has been the only significant regression since we released our first GA product seven years ago. We will continue brainstorming and enacting measures to prevent anything like this from happening again.
Going forward, we are excited to add full, active-active HA to the CloudBolt Job Engine in 8.4-Tallman, improve locking mechanisms, test scalability and stability extensively, document it in full, and restore the stable foundation our customers need and deserve.