Distributed Systems ConfigurationConfiguration Management and Distributed Systems
Configuring distributed systems can be complex and difficult to manage. Reaching agreement across disciplines is often a challenge in implementing suitable configuration management systems. This article gives examples of how to address these challenges, particularly for large distributed systems.
I'm currently working on a large project involving multiple teams building relatively small, focused services. These services are deployed into an ecosystem of related applications that collectively form 'the system'. They operated in a shared-nothing environment and are designed for horizontal scalability - something that is regularly exercised to cope with fluctuating demands on the system.
Among the many challenges involved in designing and implementing systems such as these is configuration management. Although often seen as one of the less interesting (and to some, even boring) aspects of software engineering, ensuring that each application has the correct configuration across hundreds of nodes can be tricky. I wanted to demonstrate an approach that addresses all our requirements, and shows the potential to address similar challenges in other projects.
This may be directly applicable to you, or perhaps just an idea that you can modify to suit your situation. Either way, I hope it helps. Please keep in mind that this is written from the perspective of my current project, and these are not necessarily universal truths - so consider it in context.
Kinds of Configuration
There are many different kinds of configuration that we have to deal with. Some applications only contain one kind of configuration, while others contain every kind.
- Values are sensitive and should not be accessible to unauthorised users. A good example of this is credentials or encryption keys. We don't really want these distributed as files on individual nodes, but prefer they be requested from a service that includes some protection (e.g. access control, encryption, etc.)
- Values that do not change at runtime. These are typically read at startup and used to initialise the application. These should generally be available locally on the node to allow the app to start-up when external resources (e.g. configuration services) might not be available.
- Values that change at runtime and must take effect without needing an application restart.
While our application must be able to handle these different kinds of configuration, we want to limit the impact this has on the programming model. We also want to use an appropriate mechanism to store, maintain and deliver this configuration to the relevant application without adding too much of a burden to the development or operations teams.
The next question we need to address is one of configuration delivery. Once we have our config defined somewhere and maintained by whoever is responsible for it, how to we deliver it to our application?
There are many different approaches to this, but they all essentially resolve to one of two options:
- Whenever configuration changes, push it to the application that needs it. This can be achieved in a number of ways, including having the application register for configuration change events, or webhooks.
- In this scenario, applications request the latest configuration from the source. More often than not, this is done using frequent polling.
Of course, there are pros and cons to each approach. Polling is often considered wasteful - especially for configuration, which tends to change infrequently. But the converse is that it is often too late to the party when critical configuration changes are not immediately reflected in the application. Push delivery can be more effective, but you need to be aware of lost notifications and how to handle those situations (this also depends on the delivery mechanism you choose and whether it provides automatic reconnection, retries, etc.).
We might also want to combine the two approaches if we have different kinds of configuration in different stores. For example, we might have static configuration loaded once (because we know it won't change), but dynamic configuration pushed from the source as it changes. This should be a possibility; and again, it shouldn't impact the programming model if we can help it.
Setting that aside, we also need to address configuration management. This is often the point at which many people switch off (maybe its the word 'management'), but it is crucial to get this right. What good is a configurable system if no one can manage the configuration?!
Configuration management has many facets, and I won't attempt to cover them all. The main aspects I want to address are:
- Selecting an appropriate data store is essential. Think about your requirements for multiple data centres, data consistency, store availability, resilience, failure modes, etc.
- As well as operating the applications themselves, the configuration system will also need to be managed. Depending on your environment, you may want to be conservative and choose something the team is already familiar with; or perhaps you have a bit more flexibility in making this decision. Either way, it must be easy to operate as it is a critical component to all your other applications.
- Configuration must be visible to be effectively managed. How do you know whether a circuit breaker is open or closed? How do you know what behaviour to expect of your applications? Configuration must be visible (and modifiable, if dynamic) to be effectively used. This also ties in closely with application monitoring (although that's a massive topic of its own).
Storage & Operations
Where are we going to store this configuration? This question often raises a good debate. There are many options available, and you should be free to use whatever is most appropriate to your circumstances. I'll list a few that I've looked at (some of which are contained in the demo app shown later).
To quote from their website:
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
ZooKeeper both the push and pull synchronisation models depending on how you configure your application. It also provides security at a configuration item level through ACLs, which is particularly useful for the sensitive configuration values we spoke about previously.
Any resilient, highly available system must consider multiple data centers, and the same applies here. ZooKeeper works across data centres, but the specifics of how to set it up in this way are beyond the scope of this article. Take a look at Leaders, Followers and Observers in the docs if you want to know more.
Cassandra fits the pull synchronisation model only, and does not provide built-in security at a configuration item level. In that sense, it lacks some useful features offered by ZooKeeper. That said, Cassandra's multi data centre support is second to none and it is incredibly easy to operate.
Of course, as great as these systems are, sometimes you just want a simple configuration file. The application should be able to handle local and remote configuration files.
Our final option is basically our "catch-all" scenario. Sometimes you want to build something new and proprietary that suits your environment specifically. Sometimes you have an existing configuration management system that you need to integrate with (whether you like it or not). Regardless, you should be able to use configuration from these sources in the same way as from any other source.
There are loads of tools available for application monitoring, and configuration is no different. If nothing suits your needs, it's simple enough to build your own. A good reference is Hystrix and Turbine , built by Netflix to work together to monitor your application. While neither of these tools visualise your configuration, they provide a useful view into how your application is responding with its current configuration; and extending them to facilitate configuration changes is not insurmountable, as proven by Yammer's Breakerbox .
Ideally, developers should be able to write their applications consuming configuration in a consistent manner, regardless of the source from which we retrieve the config. Fortunately we're very familiar with this kind of problem; it's called abstraction. As is often the case, someone else has already felt the pain and created an abstraction for you. In this case, Apache Commons Configuration is a good candidate for evaluation. Once implemented, your use of configuration is isolated from the definition of where it comes from. Think of it as the JDBC of configuration sources. You can independently change your configuration sources without impact to the code that uses particular values.
It also provides composite configuration sources, allowing you to define multiple sources that are checked in order, before falling back to some default value. Netflix has created Archaius on top of Apache Commons Configuration, supporting dynamic properties and providing a few additional configuration sources (like ZooKeeper).
Ops want a configuration system that is easy to support (remember, they need to support your apps as well as whatever system you choose for configuration). It must be reliable, so your apps always get the config they need to exhibit the correct behaviour. It must be easy to view and edit configuration, whether by scripts or a purpose-built UI.
By abstracting the configuration source from the application programming model, we give both Dev & Ops the freedom to choose the most appropriate configuration sources for each application; and importantly, the ability to change our minds without having to change the applications!
Finally, we need to keep in mind that our configuration system should support all environments from development through to production without adding unnecessary complexity. How you achieve this is very much an implementation choice, but the approach you take should be able to handle the many environments your application goes through before reaching production.
An approach offered by Archaius is a
that allows you to describe the context in which you application is running. This is a very flexible option, but not the only one.
I've been working on this problem with a colleague of mine and here's a small application we put together to demonstrate the possibilities to an internal audience at work. Bear in mind that we're just pulling together existing tools and not inventing our own solution. The only "custom" code here is our Cassandra configuration source and the basic Dropwizard app that runs the demo... so barely anything.
All we wanted to demonstrate is how easy it is to change configuration sources without modifying the application. We've had numerous discussions about how and where to manage our configuration but have not yet come to a final decision. This spike was intended to demonstrate that we can be building our applications correctly now to consume configuration from a yet-to-be-defined configuration system in the future.
There are many other issues to consider when tackling this problem, and I've not really done justice to the complexity of the problem. I hope this has highlighted some of the issues and potential solutions to this problem, in case you're facing something similar.
To close, here's a brief list of other things you may want to investigate when trying to solve this for your own projects:
- How do you support multiple versions of your configuration at the same time? Perhaps across different environments, or even different versions of the same app in the same environment (e.g. during a rolling release).
- How do you ensure that all instances of your app have the same configuration? Referring once again to Breakerbox, which displays the config synchronisation status of each node.