Database Hibernation and Bursting

This week we’re talking for the first time about a lot of cool things we’ve had in the pipe-line for a while.

This week we’re talking for the first time about a lot of cool things we’ve had in the pipe-line for a while. These are less about the nature of what’s inside a single database and more about the future direction of the product from a management and automation point of view.

We’ve always believed that being “cloud-scale” means many things. It’s about scaling out databases by adding resources on-demand. It’s also about being agile, supporting the unexpected spikes that happen in the real-world and taking advantage of those on-demand resources as efficiently as possible.

In a previous post I talked about NuoDB’s management model in relation to handling multiple databases through some simple command-line operations. In this post I’m going to take the next step, and talk about how that model starts to become more powerful through some simple policy definitions and automation.

Building Blocks

Before I talk about some new features we’re testing, I want to explain what we already have to build on. Running on each provisioned host is an Agent which can track everything that’s happening locally. This means it knows about all the local database processes (Transaction Engines and Storage Managers). The Agent doesn’t know anything about the content of those databases but it can listen to all the statistics being reported by the processes (memory use, CPU averages, SQL statements, connected clients, IO activity, queue sizes, etc.). Between this, the set of management messages it hears and host-local statistics the Agent has a pretty good idea of what’s happening on the local host.

Recall that at least one of these Agents must be acting as a Connection Broker (in practice, you want more than one for availability reasons). In addition to knowing what’s happening on its host, a Broker also has global view of the Domain. This means that it knows about all hosts and the processes that are running on them, and can use that global knowledge to make decisions on things like load-balancing.

We haven’t exposed it yet (that will happen soon) but the Agent has an extensible API. One of the features this gives you is the ability to write a new Service. That Service API gives you access to local or global events and updates, and therefore makes it pretty easy to introduce new functionality with either local or global control. This is the key management feature that we’re using to test out two new ideas around automation of resource optimization.

Database Hibernation & Waking

Often you have to run many databases, but at any given time some of those databases are idle. For instance, think about a blog, wiki or other web application that is backed by a database. Maybe this is a production app, maybe it’s something in your testing lab or it could even just be something on your home system.

In any case, you need to keep the database running in case someone uses the application. When not in use, however, the database software is still running and using CPU, memory and IO resources. In the case of a single database that might not be too bad, but when you start having to run more than a few databases this can get pretty expensive pretty quickly.

This is what hibernation addresses. Each management Agent is run with a new Service that monitors local activity. If it sees that a process is inactive (e.g., a TE that isn’t serving any SQL requests or has no connected clients) then it has the option to shut it down. Automatically. Recall that in NuoDB a running database is just a collection of TEs and SMs. If all the processes are local then the Agent can just shut down the database. If some of the processes are local and some remote then shutting down the local processes has the effect of shrinking the footprint of the database to the resources it actively needs.

When all processes are shutdown completely the database is in hibernation. It still has an on-disk representation in one or more archives, but it’s no longer taking up any active system resources, which in turn can be used to support other databases that may be doing work. We can make the decision to shut down a process based on any policy (see below).

It’s only safe to shut down processes like this because we can also re-start, or wake a database very easily. As long as we know the configuration for a database (again, see below) it’s just a matter of reacting at a Connection Broker to a database that isn’t running and starting the associated processes via our new Service. True, you’re losing cached data when you start cold, but as long as you’re choosing to shut down processes that seem to be pretty idle it’s minimal cost to re-populate the caches versus continuing to use system resources. In practice we see the startup cost for a TE/SM pair on a localhost is around 35 milliseconds, so there’s little overhead from pure process management.

Collectively, the ability to dynamically shrink a database, possibly down to nothing, and then re-start it is very powerful. It answers the question above about how you handle available resources in an agile fashion. To handle sudden unexpected spikes, however, we need another piece.

Database Bursting

Let’s say you have a set of servers provisioned to run your databases. They’re running along just fine when suddenly one of the databases spikes in load. Taking the above example of hosted blogs, maybe some blog suddenly got very popular and now there are orders of magnitude more requests for its content. What do you do?

If you know all of the hosts that are provisioned and available, and you know something about the load on each host, you can choose to move processes around. If you really want to be ready for the worst-case then you keep a couple of “larger” systems provisioned explicitly to kick-in when extra capacity is needed. We call this bursting.

If a blog suddenly gets wildly popular, our new Service sees this and looks for a way to react. It could choose to re-balance many processes, but that’s typically going to cause too much disruption. Instead, it can see that a server is reserved but available for exactly this case (or, in an environment like EC2, it could bring a server on-line on-demand). It’s simply a matter of starting a new TE on that host, and as soon as its online shutting down the local TE that is causing the spike in resources. Depending on the storage configuration and the policy in-play we might also move the SM.

What we’ve just done is automatically react to resource spikes and moved to a more capable system. On a temporary basis we “bursted” off the host where the database was being served. At this point either the spike settles down and we can move the TE back to its original location or we notify the domain administrator that a database needs to be moved permanently to a more capable home.

Source: http://www.nuodb.com/techblog/2013/04/08/database-hibernation-and-bursting/

License: You have permission to republish this article in any format, even commercially, but you must keep all links intact. Attribution required.