A Two-Phase Production Deployment Plan

For anyone that’s been doing software development for any reasonable length of time – especially for web applications – then you’re likely familiar with Local (or Development), Staging (or Test), and the Production environments.

If not:

Local or Development refers to the machine on which you’re actually building the product.
Staging or Test refers to the server designed to represent Production, though is only accessible by developers, testers, clients, and perhaps some of the end users to evaluate features prior to the official rollout.
Production is the live version of the site. No development occurs on this server.

Most developers who are in the business or closely working with their client follow this particular setup.

In the past couple of months, there have been a few times when a single production rollout has fallen short and ended up either revealing bugs that were not caught in Staging or that did not hold up under Production-level loads.

As frustrating as that can be, I’ve ended up using a sort of two-phase Production deployment plan to help mitigate this.

The Typical Setup

Production Deployment Plan - A Typical Setup

A typical setup for a Development / Staging / Production Environment.

The usual setup for anyone who’s working with Development, Staging, Production, and source control may look something like the diagram above.

That is, the workflow is likely to consist of the following:

All work is done on Development and is then committed to Source Control.
After a certain milestone is reached, then the version of code is Source Control is deployed to Test. At this point, code may or may not be tagged as a certain version. Personally, I tag these based on the milestone that’s being deployed.
Once Test has been approved, source control is tagged as whatever version – say 1.0, 1.1, 2.0, whatever – and then deployed to Live.

This isn’t really anything new but following this process even for the smallest of projects can may dividends in maintenance and making sure that you’re meeting your client’s requirements properly.

On Staging and Replication

But there’s an assumption that’s buried in this particular process and that’s this: That Staging accurately represents what’s in Production.

To phrase it another way: Staging should represent Production as closely as possible so much so, perhaps, that the user can’t even tell the difference in the two environments.

For small-to-mid level sites, this is no problem. Setting up a reasonable number of users and replicating the contents of the Production database isn’t terrible, but if you’re working on a site that has 20,000 hits at minimum per day and has several thousand active users, that can become a bit more of a challenge.

Here’s why:

The customer has a set of requirements and expectations for what to add to the site
You develop the functionality, deploy it to Staging, s/he signs off, you roll it out to Production
Users then begin to contact support about various problems that are showing up. Some with the new feature, some that are collateral damage of the new feature

A Real World Example

For anyone who’s been working in the space long enough, sometimes this stuff just happens. In fact, it happened in a recent project in which users are able to sign up for a site and, while selecting their username, were free to use any letters and numbers.

Through the use of custom rewrite users, users are able to access their profile by navigating to /user/username.

The regular expression powering the rewrite rule didn’t accurately capture numbers at the end of usernames so some user profiles were redirecting to the next closest match.

This was one of those bugs that slipped by testing and QA and ended up making it to Production. Tracing it down to the regular expression was easy enough, but should this have been a larger problem, the most reasonable thing to do would’ve been to rollback to the last known good version of the codebase.

Granted, that’s part of the reason that we have source control, but it is a bit time consuming.

So what would it look like if we were to invert that process? That is, rather than rolling something out with the ability to rollback the entire codebase, what if we were to simply able to turn new functionality on and off?

A Production Deployment Plan

Rather than launching the latest version of code with the ability to rollback, what about the alternative of launching the latest version of the with the ability to enable and/or disable new functionality?

This isn’t necessarily a new idea (in fact, I’ve used it in previous jobs before), but it saves a lot of time from having to deal with snapshots of code.

Here’s how it works:

When you begin working on a new feature, you have a type of flag that can detect which environment the code is running in. That is, the site or application is aware of running in Development, Staging, or Production.
The flag can exist in a text file, a database option, or a query string flag.
The flag is committed to source control and deployed to each environment. By default, it’s on for Development and Staging but is off for Production.

At this point, you and the client determine when to activate the new feature(s). You then simply toggle the flag to on (clear the cache if need be) and then watch as the new feature is being used.

If a bug arises, you simply toggle the flag again. No mad dash to the server to tweak the file. No taking the site down to revert to a previous version. Simply toggle the flag and clear the cache.

Much easier, isn’t it?

Using a type of flag also offers a level of control such that we could activate it for a subset of users or a type of user so that we have a level of control when it comes to diagnosing issues in the Production environment even when users aren’t able to see it.

A Word About Overhead

There’s one challenge that comes with doing this: Keeping track of multiple flags.

As a codebase grows, there are bound to be multiple flags introduced into the codebase which can quickly become waste if not managed properly as each new phase or milestone is developed.

The more waste that’s in the system, the more potential there is for something to go wrong when trying to clean up flags. As such, I like to try to keep the number of active flags to no more than two.

Once a feature has proven itself, I’ll remove the flag and then redeploy the codebase.

Anyway, all of this will look a bit different depending on your process, configuration, and clients, but the bottom line is that there is a stronger alternative than simply performing code rollbacks.

Though it’s not without its overhead, I’ve found that this has been much less of a headache than dealing with the alternative and it results in greater peace of mind when doing rollouts especially for large sites.

With that said, I’d love to hear what you guys have used in your own projects. Even if you don’t dig this idea, I’m always up for hearing what alternatives exist, so feel free to leave a comment.