Trends
Common Causes of Outages
15 August 2024
Power remains the leading cause of outages
The findings from Donnellan & Lawrence (2024) and Davis et al. (2022) consistently show that on-site power issues are the most significant cause of category 4 and 5 outages in data centres. These outages are often catastrophic, causing severe disruption and financial loss. Moreover, three other frequent causes pose significant challenges: cooling system failures, software/IT system errors, and network problems.
On average, 10 to 20 major outages occur each year (Yadav, 2024), resulting in customer disruptions, loss of business revenue, and reputational damage. In extreme cases, these outages can even lead to loss of life, especially in sectors where reliability is critical, such as healthcare and emergency services.
Figure 1. Primary Cause of Outage
(Source: Davis et al., 2022)
The growing impact of power-related outages
A more in-depth analysis by Lawrence & Simon (2023) highlights that the primary causes of power-related outages are failures of uninterruptible power supplies (UPS). UPS systems are designed to provide emergency power to a load when the input power source fails. However, when these systems fail, they become a single point of failure that can take entire data centres offline.
Less commonly, transfer switches (generator/grid) and generator failures are identified as culprits. These components are crucial in managing the transition between primary and backup power sources. Failures in these systems often arise from inadequate maintenance, improper testing, or component wear and tear.
Although Uptime does not typically attribute utility grid failures as a primary cause of outages, the slight rise in power-related failures in recent years may be linked to declining grid reliability. This decline exposes data centres to increased vulnerability and highlights the importance of robust power management systems. As grid reliability diminishes, some data centres face challenges ensuring continuous operation, revealing inadequate maintenance practices and insufficient training at certain locations.
Figure 2. Most Common Causes of Power Related Outages
(Source: Lawrence & Simon, 2023)
Cooling system failures: a silent threat
Cooling systems are another significant source of outages. Data centres generate a substantial amount of heat, and efficient cooling is essential to prevent overheating of critical components. Failures in cooling systems can quickly lead to equipment overheating and shutdowns, leading to unplanned outages. Several factors contribute to cooling failures:
- Mechanical Failures: Malfunctioning compressors, pumps, and fans can halt cooling operations.
- Control System Errors: Faulty sensors and control algorithms can result in inadequate cooling or overcooling, stressing systems.
- Human Error: Poorly trained staff may mismanage cooling systems, leading to potential failures.
Ensuring regular maintenance and investing in advanced cooling technologies can mitigate these risks, safeguarding data centre operations.
Software and IT system errors
Software and IT system errors are another leading cause of outages. These issues arise from:
- Configuration Mistakes: Misconfigurations during updates or installations can create vulnerabilities.
- Software Bugs Errors: Undetected bugs in applications or systems can lead to unexpected failures.
- Cybersecurity Breaches: Attacks exploiting system vulnerabilities can disrupt services and cause outages.
The complexity of modern IT systems demands vigilant monitoring and robust patch management strategies. Implementing best practices for change management and cybersecurity can significantly reduce the likelihood of such errors.
Network problems: the connectivity challenge
Network issues continue to pose significant challenges, often resulting from:
- Hardware Failures: Malfunctioning routers, switches, or cables can disrupt communication.
- Congestion and Latency: Overloaded networks can lead to degraded performance and outages.
- DDoS Attacks: Distributed Denial of Service attacks can overwhelm network infrastructure, leading to service interruptions.
To combat network problems, data centres must invest in redundant network architectures, robust monitoring tools, and comprehensive security measures.
Strategies to mitigate outage risks
Addressing the root causes of outages requires a multifaceted approach. Here are some key strategies:
Enhancing power infrastructure
- Regular Maintenance: Conduct routine inspections and maintenance of power systems to ensure optimal performance.
- Redundancy Planning: Implement redundant power paths and backup systems to minimise single points of failure.
- Advanced Monitoring: Utilize real-time monitoring systems to detect anomalies and address issues promptly.
Improving cooling systems
- Preventive Maintenance: Regularly maintain cooling equipment to prevent unexpected failures.
- Efficient Design: Optimize data centre layout to improve airflow and cooling efficiency.
- Adaptive Cooling Technologies: Invest in technologies like liquid cooling and AI-driven climate control for improved performance.
Strengthening IT and network systems
- Configuration Management: Employ automated tools for accurate configuration and change management.
- Cybersecurity Protocols: Implement robust cybersecurity measures to safeguard against attacks.
- Network Redundancy: Design networks with redundant paths to ensure continuous connectivity.
Emerging trends in outage mitigation
As the data centre industry evolves, new trends and technologies are emerging to mitigate outage risks. These include:
Edge computing
Edge computing is gaining popularity as a means to decentralise data processing. By processing data closer to its source, edge computing reduces the reliance on centralised data centres, minimising the impact of potential outages.
AI and machine learning
Artificial intelligence and machine learning are being leveraged to predict and prevent outages. These technologies analyse vast amounts of data to identify patterns and predict potential failures before they occur, enabling proactive maintenance and intervention.
Sustainable energy solutions
Adopting sustainable energy solutions reduces the dependency on traditional power sources. By integrating renewable energy and energy storage systems, data centres can enhance their resilience and reduce the risk of power-related outages.
Conclusion
Power outages remain a significant challenge for the data centre industry, but understanding the root causes and implementing strategic measures can mitigate risks significantly. By focusing on enhancing power infrastructure, improving cooling systems, and strengthening IT and network systems, data centres can reduce the likelihood of outages and enhance operational resilience.
As the industry continues to expand, leveraging emerging technologies and sustainable solutions will be crucial in ensuring reliability and minimising the impact of outages. For a detailed exploration of common causes of outages and mitigation strategies, visit our in-depth blog on The Cost of Downtime: Beyond Lost Revenue.
Ensuring your power systems are always ready to support your operations is essential in today’s fast-paced business environment. Contact us to learn how to keep your power systems in tip-top condition so they are ready whenever needed.
References
Davis, J., Bizo, D., Lawrence, A., Rogers, O., & Smolaks, M. (2022). Uptime Institute Global Data Center Survey 2022. Retrieved from Uptime Institute: https://uptimeinstitute.com/uptime_assets/ 6768eca6a75d792c8eeede827d76de0d0380 dee6b5ced20fde45787dd3688bfe-2022-data-center-industry-survey-en.pdf
Donnellan, D., & Lawrence, A. (2024). Annual outage analysis 2024. Retrieved from Uptime Institute: https://intelligence.uptimeinstitute.com/resource/annual-outage-analysis-2024
Donnellan, D., Bizo, D., Davis, J., Lawrence, A., Rogers, O., Simon, L., & Smolaks, M. (2023). Uptime Institute Global Data Center Survey 2023. Retrieved from Uptime Institute: https://uptimeinstitute.com/uptime_assets/ 74fd7ed906aad2b6df2a96dfeb803dde 83d52ee3dffdd8ae41a50fab4e23182f-uptime-institute_global-data-center-survey-2023_executive-summary.pdf
Flower, D. (2024). The True Cost Of Downtime (And How To Avoid It). Retrieved from Forbes: https://www.forbes.com/sites/forbestechcouncil/2024/04/10/the-true-cost-of-downtime-and-how-to-avoid-it/
Lawrence, A., & Simon, L. (2023). Annual outage analysis 2023. Retrieved from Uptime Institute: https://uptimeinstitute.com/uptime_assets/ 5f40588be8d57272f91e4526dc8f82152195 0b7bec7148f815b6612651d5a9b3-annual-outages-analysis-2023.pdf? mkt_tok=NzExLVJJQS0xNDUAAAGLOKD8DT _WKXcKBKyzfSYYl-Ln0amS5sNZenTtgi- NLyg8hLHFakxOayYi7wVYmE3jl7 G4lpQOSeWkvyDai1ebeDT
STCLab, Inc. (2023). Here’s how much downtime is really costing your business. Retrieved from Medium: https://medium.com/stclab-tech-blog/heres-how-much-downtime-is-really-costing-your-business-1ee6d2667287
Trueman, C. (2024). TSMC evacuates fabs and suspends construction in earthquake aftermath; confirms all personnel are safe. Retrieved from Data Center Dynamics: https://www.datacenterdynamics.com/en/news/tsmc-evacuates-fabs-and-suspends-construction-in-earthquake-aftermath-confirms-all-personnel-are-safe/
Uptime Institute. (n.d.). Uptime Institute Outage Severity Rating. Retrieved from Uptime Institute: https://uptimeinstitute.com/resources/tools/outage-severity-rating
Yadav, N. (2024). Uptime: Frequency and severity of data center outages on the decline. Retrieved from Data Center Dynamics: https://www.datacenterdynamics.com/en/news/uptime-frequency-and-severity-of-data-center-outages-on-the-decline/