Managing surge capacity on consumer websites

The Dartford Crossing is a toll bridge near London used by an average of 130,000 vehicles every day.

After crossing, the toll charge has to be paid online by midnight the following day to avoid big penalty fines.

Normally the online toll charge system works smoothly.

However, after a major upgrade of the online payments website, the bridge operator made a big mistake.

They emailed more than one million registered users saying they needed to immediately update their payment details or be charged a penalty.

You can probably guess what happened next! 😆💥

There was a massive surge in visits and interactive sessions as more than 1,000,000 people all tried to update their payment details.

Inevitably the website became unavailable and holding pages are now being used to throttle the demand and allow a steady trickle of users to update their details.

The disruption is so bad that it’s made the headlines of all the major UK and even some European news outlets – and not in a good way…

During my time at Sky, one of the biggest consumer brands in the UK, we regularly managed similar “surge” events to our e-commerce site at sky.com.

To avoid commercial and reputational damage from a failure to manage demand, we followed a critical four point strategy:

1/ We established a dedicated, non-functional test (NFT) team of highly talented engineers who would run simulated user traffic to identify bottlenecks, particularly with backend databases. Those bottlenecks would be addressed, then the tests re-run on a constant iterative cycle, every time increasing both the performance and capacity of the target service.

2/ We used auto-scaling capabilities to add additional capacity as demand increased, making sure the scale always remained within the backend capacity determined by the NFT team. Sometimes we might disable lower priority services to free up “surge” capacity.

3/ We controlled incoming demand through customer segmentation. Instead of emailing all customers, or launching ads on all platforms, we would segment into smaller more manageable volumes. This would “flatten the peaks” and ensure demand remained within the available capacity. (this is where the Dartford Crossing probably got it wrong!)

4/ We actively monitored key SLIs/SLOs on a live conference bridge with key technical representatives, enabling an almost immediate response to any metrics trending toward a critical level. Waiting for an alert to trigger before opening a bridge was simply too slow.

Using these strategies we successfully fulfilled huge demand and achieved record levels of commercial success during events such as Black Friday, iPhone and Samsung launches, flash sales and even the “Martin Lewis effect” (I’ll save that one for another post!)

Follow this strategy and you will also be successful in running consumer website surge events.

Managing surge capacity on consumer websites

Add comment

Cancel reply

SRE and the Gartner Hype Cycle

About

Topics

Recent posts

Understanding the true cost of switching cloud providers

SRE and the Gartner Hype Cycle

“Work is a thing you do, not a place you go”

Five ways that you, as a leader, can earn the trust of your technical team members

Managers of technology teams must have a strong technical background

Check the ego for DevOps success

Connect with John

Email Newsletter

Sitemap

Managing surge capacity on consumer websites

Add comment

You may also like

About

Topics

Recent posts

Connect with John

Email Newsletter

Sitemap