publications

2024

  1. arxiv
    Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
    2024
  2. SOSP
    LoongGen: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism
    the 30th ACM Symposium on Operating Systems Principles (SOSP 2024), 2024
  3. arxiv
    LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism
    arXiv preprint arXiv:2406.18485, 2024
  4. ICS
    Ymir: A Scheduler for Foundation Model Fine-tuning Workloads in Datacenters
    Wei Gao, Weiming Zhuang, Minghao Li, Peng Sun, Yonggang Wen, and Tianwei Zhang
    In Proceedings of the 38th ACM International Conference on Supercomputing , 2024
  5. ICS
    AutoSched: An Adaptive Self-configured Framework for Scheduling Deep Learning Training Workloads
    Wei Gao, Xu Zhang, Shan Huang, Shangwei Guo, Peng Sun, Yonggang Wen, and Tianwei Zhang
    In Proceedings of the 38th ACM International Conference on Supercomputing , 2024
  6. arxiv
    AMSP: Reducing Communication Overhead of ZeRO for Efficient LLM Training
    Qiaoling Chen, Qinghao Hu, Guoteng Wang, Yingtong Xiong, Ting Huang, Xun Chen, Yang Gao, Hang Yan, Yonggang Wen, Tianwei Zhang, and Peng Sun
    2024
  7. IWQoS
    Lins: Reducing Communication Overhead of ZeRO for Efficient LLM Training
    Qiaoling Chen, Qinghao Hu, Guoteng Wang, Yingtong Xiong, Ting Huang, Xun Chen, Yang Gao, Hang Yan, Yonggang Wen, Tianwei Zhang, and Peng Sun
    2024
  8. ASPLOS
    Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning
    Chang Chen, Xiuhong Li, Qianchao Zhu, Jiangfei Duan, Peng Sun, Xingcheng Zhang, and Chao Yang
    In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems , 2024
  9. arXiv
    Internlm2 Technical Report
    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, and  others
    arXiv preprint arXiv:2403.17297, 2024
  10. arXiv
    InternEvo: Efficient Long-sequence Large Language Model Training via Hybrid Parallelism and Redundant Sharding
    Qiaoling Chen, Diandian Gu, Guoteng Wang, Xun Chen, YingTong Xiong, Ting Huang, Qinghao Hu, Xin Jin, Yonggang Wen, Tianwei Zhang, and  others
    arXiv preprint arXiv:2401.09149, 2024
  11. WWW
    FedDSE: Distribution-aware Sub-model Extraction for Federated Learning over Resource-constrained Devices
    Haozhao Wang, Yabo Jia, Meng Zhang, Qinghao Hu, Hao Ren, Peng Sun, Yonggang Wen, and Tianwei Zhang
    In Proceedings of the ACM on Web Conference 2024 , 2024
  12. TC
    UniSched: A Unified Scheduler for Deep Learning Training Jobs with Different User Demands
    Wei Gao, Zhisheng Ye, Peng Sun, Tianwei Zhang, and Yonggang Wen
    IEEE Transactions on Computers, 2024
  13. CSUR
    Deep Learning Workload Scheduling in GPU Datacenters: A Survey
    Zhisheng Ye, Wei Gao, Qinghao Hu, Peng Sun, Xiaolin Wang, Yingwei Luo, Tianwei Zhang, and Yonggang Wen
    ACM Comput. Surv., 2024
  14. NSDI
    Characterization of Large Language Model Development in the Datacenter
    Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang
    In 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024, Santa Clara, CA, April 15-17, 2024 , 2024

2023

  1. ASPLOS
    Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs
    Qinghao Hu, Meng Zhang, Peng Sun, Yonggang Wen, and Tianwei Zhang
    In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, Vancouver, BC, Canada, March 25-29, 2023 , 2023
  2. OSDI
    Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters
    Qinghao Hu, Zhisheng Ye, Meng Zhang, Qiaoling Chen, Peng Sun, Yonggang Wen, and Tianwei Zhang
    In 17th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2023, Boston, MA, USA, July 10-12, 2023 , 2023
  3. Boosting distributed full-graph gnn training with asynchronous one-bit communication
    Meng Zhang, Qinghao Hu, Peng Sun, Yonggang Wen, and Tianwei Zhang
    arXiv preprint arXiv:2303.01277, 2023

2022

  1. TBD
    GradientFlow: Optimizing Network Performance for Large-Scale Distributed DNN Training
    Peng Sun, Yonggang Wen, Ruobing Han, Wansen Feng, and Shengen Yan
    IEEE Trans. Big Data, 2022
  2. TPDS
    Astraea: A Fair Deep Learning Scheduler for Multi-Tenant GPU Clusters
    Zhisheng Ye, Peng Sun, Wei Gao, Tianwei Zhang, Xiaolin Wang, Shengen Yan, and Yingwei Luo
    IEEE Trans. Parallel Distributed Syst., 2022
  3. SoCC
    Titan: a scheduler for foundation model fine-tuning workloads
    Wei Gao, Peng Sun, Yonggang Wen, and Tianwei Zhang
    In Proceedings of the 13th Symposium on Cloud Computing, SoCC 2022, San Francisco, California, November 7-11, 2022 , 2022
  4. ATC
    Primo: Practical Learning-Augmented Systems with Interpretable Models
    Qinghao Hu, Harsha Nori, Peng Sun, Yonggang Wen, and Tianwei Zhang
    In 2022 USENIX Annual Technical Conference, USENIX ATC 2022, Carlsbad, CA, USA, July 11-13, 2022 , 2022

2021

  1. SoCC
    Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs
    Wei Gao, Zhisheng Ye, Peng Sun, Yonggang Wen, and Tianwei Zhang
    In SoCC ’21: ACM Symposium on Cloud Computing, Seattle, WA, USA, November 1 - 4, 2021 , 2021
  2. SC
    Characterization and prediction of deep learning workloads in large-scale GPU datacenters
    Qinghao Hu, Peng Sun, Shengen Yan, Yonggang Wen, and Tianwei Zhang
    In International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14-19, 2021 , 2021

2020

  1. TBD
    GraphMP: I/O-Efficient Big Graph Analytics on a Single Commodity Machine
    Peng Sun, Yonggang Wen, Ta Nguyen Binh Duong, and Xiaokui Xiao
    IEEE Trans. Big Data, 2020
  2. ICDCS
    Elan: Towards Generic and Efficient Elastic Training for Deep Learning
    Lei Xie, Jidong Zhai, Baodong Wu, Yuanbo Wang, Xingcheng Zhang, Peng Sun, and Shengen Yan
    In 40th IEEE International Conference on Distributed Computing Systems, ICDCS 2020, Singapore, November 29 - December 1, 2020 , 2020

2019

  1. arxiv
    Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes
    Peng Sun, Wansen Feng, Ruobing Han, Shengen Yan, and Yonggang Wen
    CoRR, 2019