The likelihood of dealing with enterprise IT gremlins1 is heightened during certain times of the year for any DevOps team. My brother, who works in IT Disaster Recovery for a healthcare agency, reminded me of this during our most recent Thanksgiving gathering. He had to address four hours of downtime right before the holiday, as something DevOps related pushed a change to the production system instead of in a test environment. Sound familiar?
Whether it’s a holiday, close of the quarter, or “go live” day, any number of factors can put a little extra stress on IT staff with more of a chance for network gremlins to plague any enterprise. Although not as mischievous as mythical gremlins, sloppiness causes trouble, difficulties, or unexpected failures—threatening security as well as contributing to downtime and poor performance.
Self-Service Resources and IT Automation
Keeping gremlins at bay can be achieved with a solid plan for self-service options and IT automation. End users need to have access to hardened resources and processes when others who have the keys to these resources are on PTO or swamped by other high priority projects.
Leaving users in the dust while waiting for resources or an update can make them turn to workarounds or short cuts. The idea is that you don’t want anyone in your organization going rogue during the stressful times. The more that enterprise IT and DevOps teams have self-service IT enabled, the less likely the chance for folks to fend for themselves.
Making any DevOps practice or IT process bulletproof for occasional mishaps is nearly impossible, but reducing the likelihood is worth the effort needed by using the following approaches:
- Eliminate bottlenecks
Consider a typical workflow from start to finish and make sure that if there are dependencies that require manual input, you have taken that into consideration and have an alternative method for achieving the end result. One way to do this is to make sure that administrator access is enabled for trusted individuals who can step in if the primary admin is not available. In some cases, this person could be above or just below the person on the org chart. Get your boss’s boss to intervene when necessary and you’ll be guaranteed to move the bottleneck issue along a little faster. - Automate approvals
Always consider routine approval processes and automate them whenever you can. That does not mean to approve any request automatically but rather to set up automated checklists so that if the request meets those requirements, there’s no need to have a manual approval. You could also set up specific sets of resources that meet the requirements without an approval. This is particularly useful when you want to have self-service resources but not an open faucet. IT automation eliminates the unnecessary manual errors. - Consolidate resources
Another way to reduce mishaps or what is considered the “who’s on first?” effect is to make sure that resource management is centralized to specific teams with defined roles and plans for backup coverage. When resources and roles are scattered throughout the whole organization and someone with a key role is out on PTO, you’ll be scrambling to figure out where to get the IT resources you need—just like the old Abbott & Costello skit. - Embed security
Security must be part of the whole process from start to finish. When provisioning IT resources on premises for both private and public cloud environments, there’s special consideration for containerization and other virtualized environments in the cloud. Here’s a quick reference for security concerns and DevOps resources: Enterprise Hybrid Cloud Containerization and Rugged DevOps and DevSecOps for Security. This post drills down to these two manifestos, which are also helpful in hardening security issues.
A centrally managed platform like CloudBolt can get any IT organization on the right path to avoiding the “gremlin” effect, especially as we approach another holiday season and schedules and priorities will undoubtedly be different for many enterprises.
1—Gremlins are unexplained problems or faults (↑BACK↑)
Over time computing has gone from mainframes to bare metal servers to on-premises virtualization to cloud server instances and containerization to serverless computing. What’s next, codeless computing? Probably not, but luckily we’re not talking about something as bizarre as that with serverless computing. The server element for executing code is essentially abstracted away from its developers, and it’s new enough that we’re in the Wild West.
Serverless Computing Explained
Serverless computing is a fancy way of saying that you don’t have to worry about the servers when you want to execute code—often referred to as a Function-as-a-Service (FaaS). Major cloud providers have compute capacity ready for anyone to reserve and run virtual machines (VMs) and containerization of microservices.
For public cloud providers, why not take it one step further and isolate running code on demand as a way to make more money? This is great for developers who need to continuously add services and features to their application stack but don’t want to fuss with managing the infrastructure.
Major cloud providers offer these serverless computing options with an emphasis on the payment model:
- Amazon Web Services (AWS)—AWS Lambda
Run code without thinking about servers. Pay only for the compute time you consume.” - Microsoft Azure—Functions
Accelerate your development with an event-driven, serverless compute experience. Scale on demand and pay only for the resources you consume. - Google Cloud Platform (GCP)—Google Cloud Functions
Event-driven serverless compute platform. - IBM Cloud—IBM Cloud Functions (aka, OpenWhisk)
…executes functions in response to incoming events and costs nothing when not in use.
As great as these services are, though, we still have to contend with The Good, The Bad, & The Ugly
The Good
The good is the on-demand nature of this computing strategy at low cost. Suppose an application developer wants to give their aging application architecture a quick lift with a small feature that checks an Internet of Things (IoT) sensor in a smart home, like air quality to automatically suggest or order a new air filter. Instead of adding the compute power of infrastructure needed for many thousands of subscribers to the application, they can develop this on-demand function that only needs to run occasionally.
The Bad
The bad is that these functions can get complicated and hard to manage, especially if they must run for more than five minutes at a time in an application process. They must also be accessed by a private API gateway and will require the dependencies from common libraries to be packaged into them. This can be terribly inefficient compared to containerization. The more complicated the coding required the less likely a serverless function is going to suit the application architecture well. For more information, see What is Serverless Architecture? What are its Pros and Cons?
The Ugly
The ugly is that there is currently no standardization of serverless computing across the different public providers. Vendor lock-in will be at risk when these enticing functions as code—with low prices—become addicting to some developers and enterprises. They cannot be easily ported around like in the same was as containers can.
As Rick Kilcoyne, VP Solutions Architecture at CloudBolt stated in a recent article:
“…tantalizing as serverless computing is, one must be fully aware that moving code between serverless platforms is extremely difficult and only made more so by cloud vendor specific libraries, paradigms, and IAM. Serverless computing is the technological equivalent of a snare trap as there’s virtually no way to easily migrate from one platform to another once committed.”
Roundup
Serverless computing should definitely be a part of any enterprise hybrid cloud strategy. Just as a hybrid cloud application has a mix of public and private clouds, it can also have a mix of infrastructure technologies such as virtualization, containerization, and serverless computing with functions. Our CloudBolt hybrid cloud management platform helps you manage it all from one place.
To see how CloudBolt makes serverless computing easier, check out a demo.
At CloudBolt, we believe that software solutions should be easy to maintain, manage, and understand. We also believe they should be self-regulating and self-healing, when possible. You will see a focus on this starting in 8.4—Tallman but also continuing through our 9.x releases, which will give you better visibility into CloudBolt’s internal status, management capabilities directly from the web UI, and reduce the number of times you need to ssh to the CB VM to check things or perform actions.
CloudBolt 8.4—Tallman introduces a new Admin page called “System Status” which provides several tools for checking on the health of CloudBolt itself.
The System Status Page in 8.4—Tallman
To see the System Status page in your newly installed/upgraded CloudBolt 8.4-Tallman, navigate to Admin > Support Tools > System Status. You will see a page that looks a bit like this:
There are three main parts of this page.
1. CloudBolt Mode
This section provides a way to put CloudBolt into admin-only maintenance mode. This prevents any user who is not a Super Admin or CloudBolt admin from logging in or navigating in this CloudBolt instance. This is useful for times when you need to perform maintenance on CloudBolt (eg. upgrading it, making changes to the database, etc), and you want to prevent users from accessing it while in an intermediate state, but you yourself need to perform some preparation and verification within the CB UI before and after the maintenance.
2. Job Engine
This section shows the status of each job engine worker, each running on a different CloudBolt VM now that active-active Job Engines are supported. It also shows a chart of all jobs run in the last hour and day per job engine. When things are healthy, and the job engines are not near their max concurrency limit, there should be a fairly even split of how many jobs are being run by each worker.
3. Health Checks
This section has several kinds of checks:
- Indications of the health of a specific service, as would be seen from the Linux command line when running `service <name> status`
- Tests of OS-level health, such as a check of available disk space on the root partition
- Functional tests, which perform some basic action to make sure systems are working properly. Functional tests in 8.4—Tallman include writing a file to disk and deleting it, creating an entry in the database and deleting it, and adding an entry to memcache and deleting it.
Ensuring the health of the systems that underlie CloudBolt can help you quickly hone in on the root cause of an issue, and we hope that the system status page will help narrow the time it takes to troubleshoot and resolve issues with CloudBolt.
What’s Next for the System Status Page
We have some ideas for what we might add next:
- Uptime metrics for each job engine worker
- The average time for jobs to complete for each worker
- Disk space checks for all partitions on the CB VM
- CPU, memory, I/O, and network utilization for the CB VM
- Uptime for the CB VM as a whole
- Network health checks, including:
- testing DNS lookups
- testing pinging the gateway
- testing connections to any configured proxies
If there are any of these that seem like they would be especially useful to you, we’d love to hear that to help us prioritize. We’d also love to hear any additional ideas you have for this new page!