What Makes a Monolith Monolithic?

Photo of Stonehenge, 1877


It seems like everybody throws around the term “monolith”, but what do we mean by that?

Sam Newman started the ball rolling yesterday with this tweet:

My first response was a (semi) joke:

I say semi joke because, in truth, semantics (i.e. meaning) is critical. The English language has a horrible tendency to overload terms as it is, and in our line of work we tend to make it even worse. Lack of specificity obscures, rather than enlightens. The problem with the term “monolith” is that, while it’s a powerfully evocative term, it isn’t a simple one to define. My second response was closer to an actual definition:

The purpose of this post is to expand on that a bit.

The “mono” portion of the term is, in my opinion, the crucial part. I believe that quality of oneness is what defines a monolithic system. As I noted in the second tweet, it’s a matter of meta-coupling, whether that coupling exists in the form of deployment, data architecture, or execution style (Jeppe Cramon‘s post “Microservices: It’s not (only) the size that matters, it’s (also) how you use them – part 3” shows how temporal coupling can turn a distributed system into a runtime monolith). The following tweets between Anne Currie and Sam illustrate the amorphous nature of what is and isn’t a monolith:

Modules that can be deployed to run in a single process need not be considered monolithic, if they’re not tightly coupled. Likewise, running distributed isn’t a guarantee against being monolithic if the components are tightly coupled in any way. The emphasis on “in any way” is due to the fact that any of the types of coupling I mentioned above can be a deal killer. If all the “microservices” must be deployed simultaneously for the system to work, it’s a distributed monolith. If the communication is both synchronous and fault intolerant, it’s a distributed monolith. If there’s a single data store backing the entire system, it’s a distributed monolith. It’s not the modularity that defines it (you can have a modular monolith), but the inability to separate the parts without damaging the whole system.

I would also point out that I don’t consider “monolithic” to be derogatory, in and of itself. There is a trade-off involved in terms of coupling and complexity (and cost). While I generally prefer more flexibility, there is always the danger of over-engineering. If we’re hand-carving marble gargoyles to stick on a tool shed, chances are the customer won’t be pleased. The solution should bear at least a passing resemblance to the problem context it’s supposed to address.


Plug and Play or Punt and Pray?


On August 1, 2012, Knight Capital Group had a very bad day, losing $440 million in forty-five minutes. More than two weeks later, there has been no official detailed explanation of what happened. Knight CEO Thomas Joyce has stated “Sadly it was a very simple breakdown — a very large breakdown — but a very simple breakdown…”, but exactly what that “simple breakdown” was remains unknown.

In the absence of facts, anonymous statements and speculation about the cause of the disaster have been rife. In a Dr. Dobb’s article, “Wall Street and the Mismanagement of Software”, Robert Dewar, president and CEO of AdaCore, blamed testing:

It’s clear that Knight’s software was deployed without adequate verification. With a deadline that could not be extended, Knight had to choose between two alternatives: delaying their new system until they had a high degree of confidence in its reliability (possibly resulting in a loss of business to competitors in the interim), or deploying an incompletely verified system and hoping that any bugs would be minor. They did not choose wisely.

Other articles have focused on deployment issues. According to an August 14 article on Businessweek, the problem stemmed from an “old set of computer software that was inadvertently reactivated when a new program was installed”. On August 3, Nanex, LLC published “The Knightmare Explained” with the tagline “The following theory fits all available facts”:

We believe Knight accidentally released the test software they used to verify that their new market making software functioned properly, into NYSE’s live system.

In the safety of Knight’s test laboratory, this test software (we’ll call it, the Tester) sends patterns of buy and sell orders to its new Retail Liquidity Provider (RLP) Market Making software, and the resulting mock executions are recorded. This is how they could ensure their new market making software worked properly before deploying to the NYSE live system.

When the time comes to deploy the new market making software, which is likely handled by a different group, the Tester is accidentally included in the release package and started on NYSE’s live system. On the morning of August 1st, the Tester is ready to do its job: test market making software. Except this time it’s no longer in the lab, it’s running on NYSE’s live system. And it’s about to test any market making software running, not just Knights. With real orders and real dollars. And it won’t tell anyone about it, because that’s not its function.

Last December, I posted “Do you have releases or escapes?”, discussing the importance of release management. In that post, I stated that excellent code poorly delivered is effectively poor code. A professional release management practice is essential to creating and maintaining quality systems.

Obviously there will be configuration differences between environments and these represent a risk that must be managed. However, failing to standardize the deployment of code is needlessly introducing a risk. An effective release management process should promote repeatable (preferably automated) deployments across all environments. Deployments should be seen as an opportunity to test this process, with the goal of ensuring that the release to production is thoroughly uneventful.

If Nanex’s assessment is correct, either Knight Capital failed to have one standard release process or their process allowed the test harness access to the real world. Either case would make the events of August 1 possible. Avoidable errors are bad enough; one that costs $10 million per minute is epic.

Do you have releases or escapes?

When thinking about improving the quality of a development process, the mind naturally heads in certain directions: requirements gathering and tracking, design, coding practices, and testing and quality assurance. All of these are vital components, but without a solid release management process, they can be insufficient.

Excellent code that is poorly delivered will be perceived as poor code. Faulty release management will even cause problems prior to go-live in that time spent correcting release issues will likely eat into that scheduled for testing efforts.

I won’t attempt to create the definitive work on release management and environments, but I will outline the system I helped create and have used over the last eleven years. It’s appropriate for development groups creating in-house applications, both internal and customer-facing. It works equally well with traditional desktop applications, smart clients, and web applications. It does not, however, encompass performance testing, which is outside the scope of this post. Performance testing will require its own dedicated environment that mirrors the production environment.

First and foremost is to understand that beyond the development environment, administrative access to both production environments and non-production environments should be restricted. Even if you don’t have a dedicated release management team, at least two people should have that role (one primary and a backup). Those performing the release management function should not be involved in coding.

Access restrictions should not be viewed as a matter of trust, but of accountability and control. Even as lead architect and manager of a development team, I lacked (and didn’t want) the ability make changes to environments under the control of the release management team. Aside from making the auditors happy, not having access to make changes outside of the process insulates you from accusations that you made changes outside the process. People sometimes forget to document changes, but if they lack the ability to make a change in the first place, then that consideration can be eliminated when troubleshooting a deployment.

The purpose of forcing all changes into a controlled framework is to promote repeatability. Automated build and deployment tools help in this regard as well. Each environment that a build must be promoted through provides another chance to get the deployment process perfect before go-live. The first environment should catch almost all possible deployment errors, with only configuration and/or data errors left for the succeeding environments.

The next step is to construct a set of environments around your development process and the number of versions you need to support. In our case, we only deal with two versions, so we have two pre-production environment branches: current, which is the same version as production and is used for any hotfixes that may be required, and future which hosts the release currently under development. To support our process, we have three to four (depending on the application) pre-production environments per branch as follows:

  • Development: Used for coding, this environment typically consists of a shared database server and the virtual machines on the developers workstations that are used for web/application servers. As noted above, coders have unrestricted access to all components of this environment. All changes to code and database objects must originate in this environment and be promoted through the succeeding ones in order to be deployed to production.
  •  DevTest: This is the first controlled access environment and is used for integration testing of code by the development staff (for all applications with more than one developer assigned, we use a “no one tests their own code” rule). In addition to allowing the development team the ability to shake down the build as a whole, it verifies that the deployment instructions are complete. As noted previously, developers have no administrative access to the servers and have only read access to the database(s). This ensures that only documented changes made via the release process take place.
  • Test: This environment is used for functional testing by the test staff. As with all controlled environments, developers have no administrative access to the servers and have only read access to the database(s). Since the deployment has been verified in the previous environment (with the exception of environment-specific configuration and data changes), the chance that testing will be delayed due to a bad release should be greatly minimized.
  • UAT/Training: This environment is optional, based on the application and the preferences of the business owner(s). For those applications that use it, it allows for User Acceptance Testing and/or training to take place without impacting any functional testing that may still be under way.

These environments should share the same hardware architecture as the production environment, but need not be exact clones. For example, an application that consists of two web farms (one internal, one in the DMZ) and a common database server can have its pre-production needs be adequately served by a single database server and ten (fourteen if you include UAT/Training) web servers. Ideally, the database server should run a separate instance for each of the six (or eight) environments, but as long as the database name(s) are configurable, then they could all be handled by a single instance if absolutely necessary. The environments would look as follows:

Current Future
Development database instance only database instance only
DevTest internal web server, external web server, database instance internal web server, external web server, database instance
Test internal web server, external web server, database instance two internal web servers, two external web servers, database instance
UAT/Training internal web server, external web server, database instance internal web server, external web server, database instance

If the production environment is load balanced, then that must be accounted for in at least one environment since it can lead to functional issues (losing web session state if the balancing isn’t set up properly is a classic one). My practice is to do so in the Test Future environment since the most comprehensive functional testing occurs there and it is on the branch where new functionality is introduced.

I would imagine that some might have choked on the 10-14 web servers. Remember, however, that absent conflicting dependencies, these environment can be shared across multiple applications and virtualization technology can drastically reduce the number of physical boxes needed. Cloud computing (infrastructure as a service) could also be used to reduce infrastructure costs significantly.

The last step is to make the process as smooth as possible. Practice makes perfect, automation makes it more so. Releases should be boring.