Prepare for ‘partial failures’ of IT infrastructure like Visa outage

Visa’s letter to the Treasury Select Committee, documenting details behind the recent outage which left millions of people unable to complete card transactions, reinforces a critical challenge that organisations face when exposed to a ‘partial failure’ of IT infrastructure. This is according to Peter Groucutt, managing director of Databarracks.

This week, Visa revealed that a ‘rare defect’ to a switch caused a partial failure in its primary UK data centre. The issue delayed its secondary data centre from assuming responsibility for handling all of its card transactions taking place and was the root-cause behind millions of failed card transactions, over 10 hours on Friday 1st June 2018.

 

In the wake of the outage The Committee contacted the payments firm, seeking clarification over the cause of the outage and assurances to what action Visa is taking to prevent a repeat. Amongst the findings, Groucutt reveals that a number of lessons can be learned:

 

“Businesses are often better prepared for a complete outage than ‘partial failures’. When a system fails completely the process to fail-over is more clearly defined to whether it is a manual action, or automatic process. Partial failures however, make that change-over difficult. Once the problem has been identified, you have to make the decision to either fully switch to the secondary system or fix the problem on the primary. Defining the point at which to fail-over is specific to each organisation and the issue you are dealing with.

 

“A switch issue, for instance will require a different response to a natural disaster. An organisation with good Incident and Crisis Management processes will have these processes in place – decisions will already have been made and documented, so in the event of an incident, a business knows exactly what to do.

 

“In practice, a business might decide that it can’t tolerate an outage of longer than four hours. If it takes two hours to be fully operational at a second site, it then leaves you a window of just two hours to fix that issue before committing to fail-over.

 

Groucutt continues: “We would expect Visa to have a very mature incident management process in place and based on the reports, that was absolutely the case. Partial failures can be very difficult to plan for and mange, but the issue was identified, and response protocols initiated.” 

 

Groucutt concludes: “The lessons Visa can take from the incident is that they weren’t prepared for this particular partial failure and should address this by building new processes to allow the backup switch to take over.  We can all do the same.

 

“It is a good idea to include issues like this in your testing. It’s not just switches – we’ve seen exactly this issue for UPS systems and generators too. An organisation will have a testing schedule for each of these technologies, so it’s important to include the impact of partial failures to these. A business should think about how quickly it can identify what the issue is and importantly, the actions which then need to be taken to either fix the problem and recover or alternatively, manually take it offline and failover to a secondary site.”

 

First of its kind research, in partnership with Canalys, offers deep insights into some of the...
According to a recently published report from Dell’Oro Group, worldwide data center capex is...
Managed service providers (MSPs) are increasing their spending by as much as 70% to meet growing...
Coromatic, part of the E.ON group and the leading provider of robust critical infrastructure...
Datto’s Global State of the MSP: Trends and Forecasts for 2024 underscores the importance of...
Park Place Technologies has appointed Ian Anderson as Senior Director, Channel Sales, EMEA.
Node4 has passed the ISO 27017 and ISO 27018 audits, reinforcing its dedication to data security,...
Park Place Technologies has acquired Xuper Limited, an IT solutions provider based in Derby, UK.