Tear Up the Bubble Boom: Lessons Learned From a Deep Learning Research and Development Cluster

Abstract

With the proliferation of deep learning, there exists a strong need to efficiently operate GPU clusters for deep learning production in giant AI companies, as well as for research and development (R&D) in small-sized research institutes and universities. Existing works have performed thorough trace analysis on large-scale production-level clusters in giant companies, which discloses the characteristics of deep learning production jobs and motivates the design of scheduling frameworks. However, R&D clusters significantly differ from production-level clusters in both job properties and user behaviors, calling for a different scheduling mechanism. In this paper, we present a detailed workload characterization of an R&D cluster, CloudBrain-I, in a research institute, Peng Cheng Laboratory. After analyzing the fine-grained resource utilization, we discover a severe problem for R&D clusters, resource underutilization, which is especially important in R&D clusters while not characterised by existing works. We further investigate two specific underutilization phenomena and conclude several implications and lessons on R&D cluster scheduling. The traces will be open-sourced to motivate further studies in the community.

Publication
In International Conference on Computer Design (ICCD)
Potato Yang
Potato Yang
Ph.D Student
Tianhao Fu
Tianhao Fu
Ph.D Student
Yingwei Luo
Yingwei Luo
Professor
Xiaolin Wang
Xiaolin Wang
Professor
Zhenlin Wang
Zhenlin Wang
Professor