NDS Group @ Shanghai AI Lab

We enhance the performance and efficiency of machine learning systems!

The Network and Distributed Systems (NDS) Group at the Shanghai Artificial Intelligence Laboratory, led by Dr. Peng Sun, specializes in developing efficient systems and architectures for deep learning model training and deployment. We create both internal and open-sourced systems to efficiently train large language models and multimodal models on thousands of AI chips.

Our research has been recognized at leading conferences such as OSDI, NSDI and ASPLOS , highlighting our commitment to innovation. We received the Best Paper Award at ASPLOS 2024 and the Distinguished Paper Award at ASPLOS 2023.

The NDS Group operates under the Center of AI Training and Computation (CAIF), led by Prof. Dahua Lin and Xingcheng Zhang.

We are always seeking passionate individuals, including full-time research engineers and interns, to join our team. If you are interested in contributing to cutting-edge research in machine learning systems, please reach out to us at sunpeng@pjlab.org.cn.




  • (2025年入学)上海人工智能实验室2025年高校联合培养博士研究生项目现已启动,欢迎2025年毕业的(具有保研资格)本科生积极参与。具体情况请参考招生简介


selected publications

  1. arxiv
    LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism
    arXiv preprint arXiv:2406.18485, 2024
    Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning
    Chang Chen, Xiuhong Li, Qianchao Zhu, Jiangfei Duan, Peng Sun, Xingcheng Zhang, and Chao Yang
    In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems , 2024
  3. NSDI
    Characterization of Large Language Model Development in the Datacenter
    Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, and Tianwei Zhang
    In 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024, Santa Clara, CA, April 15-17, 2024 , 2024
    Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs
    Qinghao Hu, Meng Zhang, Peng Sun, Yonggang Wen, and Tianwei Zhang
    In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, Vancouver, BC, Canada, March 25-29, 2023 , 2023
  5. OSDI
    Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters
    Qinghao Hu, Zhisheng Ye, Meng Zhang, Qiaoling Chen, Peng Sun, Yonggang Wen, and Tianwei Zhang
    In 17th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2023, Boston, MA, USA, July 10-12, 2023 , 2023