NDS Group @ Shanghai AI Lab

We enhance the performance and efficiency of machine learning systems!

The Network and Distributed Systems (NDS) Group at the Shanghai Artificial Intelligence Laboratory, led by Dr. Peng Sun, specializes in developing efficient systems and architectures for deep learning model training and deployment. We create both internal and open-sourced systems to efficiently train large language models and multimodal models on thousands of AI chips.

Our research has been recognized at leading conferences such as OSDI, NSDI and ASPLOS , highlighting our commitment to innovation. We received the Best Paper Award at ASPLOS 2024 and the Distinguished Paper Award at ASPLOS 2023.

The NDS Group operates under the Center of AI Training and Computation (CAIF), led by Prof. Dahua Lin and Xingcheng Zhang.

We are always seeking passionate individuals, including full-time research engineers and interns, to join our team. If you are interested in contributing to cutting-edge research in machine learning systems, please reach out to us at sunpeng@pjlab.org.cn.

联合培养博士项目

网络与分布式系统研究组(NDS)现招收与上海交通大学或复旦大学联合培养的博士研究生。研究方向主要集中在大规模人工智能模型训练系统的性能优化与能耗优化。研究组内科研氛围浓厚,提供充足的算力资源和生活补助,并与国内外知名系统研究团队保持密切合作,每年都有高质量的研究成果产出。

对NDS组感兴趣的同学请将简历发送至导师孙鹏的邮箱:sunpeng@pjlab.org.cn。

  • (2025年入学)(名额已满) 上海人工智能实验室2025年高校联合培养博士研究生项目现已启动,欢迎2025年毕业的(具有保研资格)本科生积极参与。具体情况请参考招生简介

news

selected publications

  1. arxiv
    Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
    2024
  2. arxiv
    LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism
    arXiv preprint arXiv:2406.18485, 2024
  3. ASPLOS
    Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning
    Chang Chen, Xiuhong Li, Qianchao Zhu, Jiangfei Duan, Peng Sun, Xingcheng Zhang, and Chao Yang
    In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems , 2024
  4. NSDI
    Characterization of Large Language Model Development in the Datacenter
    Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang
    In 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024, Santa Clara, CA, April 15-17, 2024 , 2024
  5. ASPLOS
    Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs
    Qinghao Hu, Meng Zhang, Peng Sun, Yonggang Wen, and Tianwei Zhang
    In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, Vancouver, BC, Canada, March 25-29, 2023 , 2023
  6. OSDI
    Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters
    Qinghao Hu, Zhisheng Ye, Meng Zhang, Qiaoling Chen, Peng Sun, Yonggang Wen, and Tianwei Zhang
    In 17th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2023, Boston, MA, USA, July 10-12, 2023 , 2023