Understanding failures through the lifetime of a top-level supercomputer

محفوظ في:

التفاصيل البيبلوغرافية
المؤلفون:	Rojas, Elvis, Meneses, Esteban, Jones, Terry, Maxwell, Don
التنسيق:	artículo
تاريخ النشر:	2021
الوصف:	High performance computing systems are required to solve grand challenges in many scientific disciplines. These systems assemble many components to be powerful enough for solving extremely complex problems. An inherent consequence is the intricacy of the interaction of all those components, especially when failures come into the picture. It is crucial to develop an understanding of how these systems fail to design reliable supercomputing platforms in the future. This paper presents the results on studying multi-year failure and workload records of a powerful supercomputer that topped the world rankings. We provide a thorough analysis of the data and characterize the reliability of the system through several dimensions: failure classification, failure-rate modelling, and interplay between failures and workload. The results shed some light on the dynamics of top-level supercomputers and sensitive areas ripe for improvement.
البلد:	Repositorio UNA
المؤسسة:	Universidad Nacional de Costa Rica
Repositorio:	Repositorio UNA
اللغة:	Inglés
OAI Identifier:	oai:null:11056/21418
الوصول للمادة أونلاين:	http://hdl.handle.net/11056/21418
كلمة مفتاحية:	FAULT TOLERANCE RESILIENCE FAILURE ANALYSIS HIGH PERFORMANCE COMPUTING