( function() { window.onpageshow = function( event ) { // Defined window.wpforms means that a form exists on a page. // If so and back/forward button has been clicked, // force reload a page to prevent the submit button state stuck. if ( typeof window.wpforms !== 'undefined' && event.persisted ) { window.location.reload(); } }; }() );
Practical tips and strategies for leading technical teams

Managing surge capacity on consumer websites

This is a great cautionary tale of how NOT to manage “surge” events on consumer websites.

The Dartford Crossing is a toll bridge near London used by an average of 130,000 vehicles every day.

After crossing, the toll charge has to be paid online by midnight the following day to avoid big penalty fines.

Normally the online toll charge system works smoothly.

However, after a major upgrade of the online payments website, the bridge operator made a big mistake.

They emailed more than one million registered users saying they needed to immediately update their payment details or be charged a penalty.

You can probably guess what happened next! 😆💥

There was a massive surge in visits and interactive sessions as more than 1,000,000 people all tried to update their payment details.

Inevitably the website became unavailable and holding pages are now being used to throttle the demand and allow a steady trickle of users to update their details.

The disruption is so bad that it’s made the headlines of all the major UK and even some European news outlets – and not in a good way…

During my time at Sky, one of the biggest consumer brands in the UK, we regularly managed similar “surge” events to our e-commerce site at sky.com.

To avoid commercial and reputational damage from a failure to manage demand, we followed a critical four point strategy:

1/ We established a dedicated, non-functional test (NFT) team of highly talented engineers who would run simulated user traffic to identify bottlenecks, particularly with backend databases. Those bottlenecks would be addressed, then the tests re-run on a constant iterative cycle, every time increasing both the performance and capacity of the target service.

2/ We used auto-scaling capabilities to add additional capacity as demand increased, making sure the scale always remained within the backend capacity determined by the NFT team. Sometimes we might disable lower priority services to free up “surge” capacity.

3/ We controlled incoming demand through customer segmentation. Instead of emailing all customers, or launching ads on all platforms, we would segment into smaller more manageable volumes. This would “flatten the peaks” and ensure demand remained within the available capacity. (this is where the Dartford Crossing probably got it wrong!)

4/ We actively monitored key SLIs/SLOs on a live conference bridge with key technical representatives, enabling an almost immediate response to any metrics trending toward a critical level. Waiting for an alert to trigger before opening a bridge was simply too slow.

Using these strategies we successfully fulfilled huge demand and achieved record levels of commercial success during events such as Black Friday, iPhone and Samsung launches, flash sales and even the “Martin Lewis effect” (I’ll save that one for another post!)

Follow this strategy and you will also be successful in running consumer website surge events.

Add comment

About

If you are a C-level executive, people manager or senior engineer in a technology company then this blog is for you!

Packed full of practical knowledge and tools, you will learn how to create powerful teams of engineers who feel engaged and motivated to do their best work every day.

Written by John Swarbrick from his personal experience leading globally distributed technology teams at Cisco, Sky and high-growth startups.