NDS Group @ Shanghai AI Lab
We enhance the performance and efficiency of machine learning systems!
The Network and Distributed Systems (NDS) Group at the Shanghai Artificial Intelligence Laboratory, led by Dr. Peng Sun, specializes in developing efficient systems and architectures for deep learning model training and deployment. We create both internal and open-sourced systems to efficiently train large language models and multimodal models on thousands of AI chips.
Our research has been recognized at leading conferences such as OSDI, NSDI and ASPLOS , highlighting our commitment to innovation. We received the Best Paper Award at ASPLOS 2024 and the Distinguished Paper Award at ASPLOS 2023.
The NDS Group operates under the Center of AI Training and Computation (CAIF), led by Prof. Dahua Lin and Xingcheng Zhang.
We are always seeking passionate individuals, including full-time research engineers and interns, to join our team. If you are interested in contributing to cutting-edge research in machine learning systems, please reach out to us at sunpeng@pjlab.org.cn.
联合培养博士项目
网络与分布式系统研究组(NDS)现招收与上海交通大学或复旦大学联合培养的博士研究生。研究方向主要集中在大规模人工智能模型训练系统的性能优化与能耗优化。研究组内科研氛围浓厚,提供充足的算力资源和生活补助,并与国内外知名系统研究团队保持密切合作,每年都有高质量的研究成果产出。
对NDS组感兴趣的同学请将简历发送至导师孙鹏的邮箱:sunpeng@pjlab.org.cn。
- (2025年入学)(名额已满) 上海人工智能实验室2025年高校联合培养博士研究生项目现已启动,欢迎2025年毕业的(具有保研资格)本科生积极参与。具体情况请参考招生简介。
news
selected publications
- arxiv
- arxivLoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context ParallelismarXiv preprint arXiv:2406.18485, 2024
- ASPLOSCentauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication PartitioningIn Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems , 2024
- NSDICharacterization of Large Language Model Development in the DatacenterIn 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024, Santa Clara, CA, April 15-17, 2024 , 2024
- ASPLOSLucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training JobsIn Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, Vancouver, BC, Canada, March 25-29, 2023 , 2023
- OSDIHydro: Surrogate-Based Hyperparameter Tuning Service in DatacentersIn 17th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2023, Boston, MA, USA, July 10-12, 2023 , 2023