A study of checkpointing in large scale training of deep neural networks

 

Đã lưu trong:
Chi tiết về thư mục
Nhiều tác giả: Rojas, Elvis, Kahira, Albert Njoroge, Meneses, Esteban, Bautista-Gomez, Leonardo, Badia, Rosa M
Định dạng: artículo preliminar
Ngày xuất bản:2021
Miêu tả:Deep learning (DL) applications are increasingly being deployed on HPC systems to leverage the massive parallelism and computing power of those systems. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. Checkpoint-restart is a common fault tolerance technique in HPC workloads. In this work, we examine the checkpointing implementation of popular DL platforms. We perform experiments with three state-of-theart DL frameworks common in HPC (Chainer, PyTorch, and TensorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide take-away points that framework developers can use to facilitate better checkpointing of DL workloads in HPC.
Quốc gia:Repositorio UNA
Tổ chức giáo dục:Universidad Nacional de Costa Rica
Repositorio:Repositorio UNA
Ngôn ngữ:Inglés
OAI Identifier:oai:null:11056/26772
Truy cập trực tuyến:http://hdl.handle.net/11056/26772
https://doi.org/10.48550/arXiv.2012.00825
Từ khóa:APRENDIZAJE PROFUNDO
RESILIENCIA
REDES NEURONALES
COMPUTACIÓN DE ALTO RENDIMIENTO
DEEP LEARNING
RESILIENCE
NEURAL NETWORKS
HIGH PERFORMANCE COMPUTING