Managing through a website outage and coming out on top

By John Miecielica, Director of Product Management for TeamQuest Corporation. 

  • 9 years ago Posted in

As those of us in the IT industry know, things rarely go as planned. IT managers can try and plan for things but more often than not, incidents will occur that can cause downtime and the resulting costs can hurt a company’s bottom line if not remedied immediately. Case in point: Target’s recent launch of the Lilly Pulitzer line.

Less than an hour after the Pulitzer line became available at Target.com a deluge of online shoppers bombarded the site and the servers crashed. A mammoth sales event the company was riding on was blown apart because the very system designed to make this a winning sales event, failed to execute and perform at optimum, under the unexpected onslaught.

The questions that need to be answered are why did Target’s site crash? What can we learn about how to prevent it in the future?

The answer is pretty straightforward. Providing a consistently pleasant online shopping experience during a high-profile launch takes planning. That planning requires IT and other business units to collaborate. IT needed to understand from an enterprise wide view, the demand that such a launch would generate so they can optimize the systems needed to support it.

However simple all this sounds, putting it in to practice is complicated. For starters, plans are only as good as the information going into them. Target expected the Lilly Pulitzer stock to last weeks, not hours. Because demand was far greater than anticipated, IT was under-provisioned and unprepared. Classic case of putting out fires instead of being prepared.

The fact is, IT organizations spend the bulk of their time fighting issues as they arise, and not enough time planning ahead. A study of global IT managers showed that nearly half (49%) believe improper capacity planning is to blame for outages. If IT managers had the proper planning tools, companies like Target would see a decrease in these outages because they would be better prepared i.e. equipped to handle major traffic to their sites and plan for the unexpected.

It’s that proverbial rock in the shoe. If you take a minute to remove the rock, you can walk much farther much faster and focus attention elsewhere along the way. IT organizations that put tools and processes in place to be more proactive can do more at a faster rate in order to provide greater value to the business. Some IT organizations have tools in place to gather data, but they lack mature processes to make sense of it. Mounds of data are useless until analyzed and translated into actionable information.

Fortunately for Target, they generated huge sales and sold out the entire collection in hours despite their website crashing. That’s not always the case. Others have fallen prey to capacity issues such as this, losing revenue, reputation and customers to their competition.

Understanding Agility and Risk
IT organizations are under the gun to efficiently operate with speed and velocity in today’s fast-changing and dynamic environments that are increasingly difficult to manage. Every day it gets harder to keep pace with shifts in technology and the resulting successful business campaigns that drive buyers to stores and websites.

Today’s information-driven enterprise is reliant on IT more than ever. IT has to consistently work at optimizing for all business units to do the same. Ultimately the primary interface with customers is via IT and IT pros are increasingly being called upon to answer to stakeholders and board members. Business is conducted directly via the Internet and websites which is under IT departments domain. A company’s reputation depends on the reliability of those interfaces.

A 2013 study by Ponemon of 563 US-based companies puts the average cost per outage incident at $690,204. That figure includes damage to organizational reputation, but doesn’t account for the potential personal damage to the careers of the men women responsible for the IT operations.

Best Practices for Optimizing IT
Start by understanding your workloads and their resource demands. Once completed, create a configuration larger than your analysis indicates with some spare systems to ensure you have sufficient resources to meet demands. Afterwards, go through the steps necessary to make the systems more efficient.

Collect Accurate Data
In order to better understand how to avoid these IT outages, IT managers must be able to collect data and analyze the current situation in terms of how many resources are being used to meet service levels. It is also important to be able to predict what demands will be placed on the system in the coming months or day in the Target example. Ideally, this information comes from the business. Absent detailed business input, for many systems, history is a good indicator of future demands. Measure resource demands over the past 6-12 months and use those to input into a queuing model to accurately predict where and when a failure may occur. Leveraging all of this data will enable you to find the least expensive configuration needed to meet service levels.
History on your Side
Think about an application that has resource data being tracked at 1-5 minute intervals. This data is aggregated to 1 hour intervals and kept for 13 months. The business forecast tells you that there is a large, new customer coming up which is expected to double the volume of transactions on the system. You could take the historical system growth (determined by analyzing the past 13 months of resource consumption) and input that as a base growth rate into your model and then double the workload in the model per the business forecast. If the model indicates you will have an issue as the new customer is being added to the system, add the necessary IT resources (CPU, memory, additional virtual machines, additional cloud resources, etc.) to the model and re-evaluate the results. Do this iteratively until you get the desired results from the model in the least expensive configuration.

IT has to know how - at the business level - infrastructure affects business and put it into practices. Business risk is decreased by helping the business make the right decisions based on accurate data. The decisions based on accurate data can mean more customers can purchase what they want and when they want it.

 

Exos X20 and IronWolf Pro 20TB CMR-based HDDs help organizations maximize the value of data.
Quest Software has signed a definitive agreement with Clearlake Capital Group, L.P. (together with...
Infinidat has achieved significant milestones in an aggressive expansion of its channel...
Collaboration will safeguard HPC storage systems and customer data with Panasas hardware-based...
Peraton, a leading mission capability integrator and transformative enterprise IT provider, has...
Helping customers plan for software failure, data loss and downtime.
Cloud Computing and Disaster Recovery specialist, virtualDCS has been named as the first UK-based...
SharePlex 10.1.2 enables customers to move data in near real-time to MySQL and PostgreSQL.