Towards a model to estimate the reliability of large-scale hybrid supercomputers

 

Spremljeno u:
Bibliografski detalji
Autori: Rojas, Elvis, Meneses, Esteban, Jones, Terry, Maxwell, Don
Format: ponencia
Datum izdanja:2020
Opis:Supercomputers stand as a fundamental tool for developing our understanding of the universe. State-of-the-art scientific simulations, big data analyses, and machine learning executions require high performance computing platforms. Such infrastructures have been growing lately with the addition of thousands of newly designed components, calling their resiliency into question. It is crucial to solidify our knowledge on the way supercomputers fail. Other recent studies have highlighted the importance of characterizing failures on supercomputers. This paper aims at modelling component failures of a supercomputer based on Mixed Weibull distributions. The model is built using a real-life multiyear failure record from a leadership-class supercomputer. Using several key observations from the data, we designed an analytical model that is robust enough to represent each of the main components of supercom-puters, yet it is flexible enough to alter the composition of the machine and be able to predict resilience of future or hypothetical systems.
Zemlja:Repositorio UNA
Institucija:Universidad Nacional de Costa Rica
Repositorio:Repositorio UNA
Jezik:Inglés
OAI Identifier:oai:https://repositorio.una.ac.cr:11056/26728
Online pristup:http://hdl.handle.net/11056/26728
Access Level:acceso abierto
Ključna riječ:RESILIENCIA
MODELADO DE FALLOS
ANÁLISIS DE FALLOS
TOLERANCIA A FALLOS
FAULT TOLERANCE
RESILIENCE
FAILURE ANALYSIS
FAILURE MODELLING