


default search action
8th MLSys 2025: Santa Clara, CA, USA
- Matei Zaharia, Gauri Joshi, Yingyan (Celine) Lin:

Proceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025. OpenReview.net/mlsys.org 2024
Accept
- Carlo Siebenschuh, Kyle Hippe, Ozan Gökdemir, Alexander Brace, Arham Mushtaq Khan, Khalid Hossain, Yadu N. Babuji, Nicholas Chia, Venkatram Vishwanath, Arvind Ramanathan, Rick L. Stevens, Ian T. Foster, Robert Underwood:

AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine. - Rya Sanovar, Srikant Bharadwaj, Renée St. Amant, Victor Rühle, Saravan Rajmohan:

LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers. - Md Saidul Hoque Anik, Ariful Azad:

SparseTransX: Efficient Training of Translation-Based Knowledge Graph Embeddings Using Sparse Matrix Operations. - Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze:

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving. - Shu Liu, Asim Biswal, Amog Kamsetty, Audrey Cheng, Luis Gaspar Schroeder, Liana Patel, Shiyi Cao, Xiangxi Mo, Ion Stoica, Joseph E. Gonzalez, Matei Zaharia:

Optimizing LLM Queries in Relational Data Analytics Workloads. - Abhishek Moitra, Arkapravo Ghosh, Shrey Agrawal, Aporva Amarnath, Karthik Swaminathan, Priyadarshini Panda:

MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs. - Xinyi Zhang, Hanyu Zhao, Wencong Xiao, Xianyan Jia, Fei Xu, Yong Li, Wei Lin, Fangming Liu:

Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling. - Jianheng Ling, Pratik Worah, Yawen Wang, Yunchuan Kong, Chunlei Wang, Clifford Stein, Diwakar Gupta, Jason Behmer, Logan A. Bush, Prakash Ramanan, Rajesh Kumar, Thomas Chestna, Yajing Liu, Ying Liu, Ye Zhao, Kathryn S. McKinley, Meeyoung Park, Martin Maas:

LAVA: Lifetime-Aware VM Allocation with Learned Distributions and Adaptation to Mispredictions. - Ke Hong, Xiuhong Li, Lufang Chen, Qiuli Mao, Guohao Dai, Xuefei Ning, Shengen Yan, Yun Liang, Yu Wang:

SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling. - Man Tsung Yeung, Penghui Qi, Min Lin, Xinyi Wan:

Balancing Pipeline Parallelism with Vocabulary Parallelism. - Vithursan Thangarasa, Ganesh Venkatesh, Mike Lasby, Nish Sinnadurai, Sean Lie:

Self-Data Distillation for Recovering Quality in Pruned Large Language Models. - Francesco Daghero, Daniele Jahier Pagliari, Francesco Conti, Luca Benini, Massimo Poncino, Alessio Burrello:

Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers. - Lu Wang, Mayukh Das, Fangkai Yang, Bo Qiao, Hang Dong, Si Qin, Victor Rühle, Chetan Bansal, Eli Cortez, Íñigo Goiri, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang:

ProtoRAIL: A Risk-cognizant Imitation Agent for Adaptive vCPU Oversubscription In the Cloud. - Yujin Wang, Shunan Dong, Zongle Huang, Yichen You, Liu He, Huazhong Yang, Yongpan Liu, Hongyang Jia:

HyC-LoRA: Memory Efficient LoRA Fine-tuning with Hybrid Activation Compression. - Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, Eiko Yoneki:

ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments. - Geonhwa Jeong, Po-An Tsai, Abhimanyu Rajeshkumar Bambhaniya, Stephen W. Keckler, Tushar Krishna:

Enabling Unstructured Sparse Acceleration on Structured Sparse Accelerators. - Yue Gao, Ilia Shumailov, Kassem Fawaz:

Supply-Chain Attacks in Machine Learning Frameworks. - Jiacheng Yang, Jun Wu, Zhen Zhang, Xinwei Fu, Zhiying Xu, Zhen Jia, Yida Wang, Gennady Pekhimenko:

ScaleFusion: Scalable Inference of Spatial-Temporal Diffusion Transformers for High-Resolution Long Video Generation. - Sohaib Ahmad, Qizheng Yang, Haoliang Wang, Ramesh K. Sitaraman, Hui Guan:

DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling. - Chenxi Yang, Yan Li, Martin Maas, Mustafa Uysal, Ubaid Ullah Hafeez, Arif Merchant, Richard McDougall:

A Bring-Your-Own-Model Approach for ML-Driven Storage Placement in Warehouse-Scale Computers. - Daiyaan Arfeen, Zhen Zhang, Xinwei Fu, Gregory R. Ganger, Yida Wang:

PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training. - Beichen Huang, Yueming Yuan, Zelei Shao, Minjia Zhang:

MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators. - Zichao Yue, Chenhui Deng, Zhiru Zhang:

Graph Learning at Scale: Characterizing and Optimizing Pre-Propagation GNNs. - Mingyu Liang, Hiwot Tadese Kassa, Wenyin Fu, Brian Coutinho, Louis Feng, Christina Delimitrou:

Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training. - Neel P. Bhatt, Yunhao Yang, Rohan Siva, Daniel Milan, Ufuk Topcu, Zhangyang Wang:

Know Where You're Uncertain When Planning with Multimodal Foundation Models: A Formal Framework. - Seonjin Na, Geonhwa Jeong, Byung Hoon Ahn, Aaron Jezghani, Jeffrey Young, Christopher J. Hughes, Tushar Krishna, Hyesoon Kim:

FlexInfer: Flexible LLM Inference with CPU Computations. - Baichuan Huang, Amir Aminifar:

Efficient On-Device Machine Learning with a Biologically-Plausible Forward-Only Algorithm. - Qidong Su, Wei Zhao, Xin Li, Muralidhar Andoorveedu, Chenhao Jiang, Zhanda Zhu, Kevin Song, Christina Giannoula, Gennady Pekhimenko:

Seesaw: High-throughput LLM Inference via Model Re-sharding. - Minxue Tang, Yitu Wang, Jingyang Zhang, Louis DiValentin, Aolin Ding, Amin Hass, Yiran Chen, Hai Li:

FedProphet: Memory-Efficient Federated Adversarial Training via Robust and Consistent Cascade Learning. - Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han:

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving. - Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han:

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention. - Yinfang Chen, Manish Shetty, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Jonathan Mace, Chetan Bansal, Rujia Wang, Saravan Rajmohan:

AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds. - Zaifeng Pan, Yitong Ding, Yue Guan, Zheng Wang, Zhongkai Yu, Xulong Tang, Yida Wang, Yufei Ding:

FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference. - Tianshu Huang, Arjun Ramesh, Emily Ruppel, Nuno Pereira, Anthony Rowe, Carlee Joe-Wong:

Interference-aware Edge Runtime Prediction with Conformal Matrix Completion. - Maximilian Böther, Abraham Sebastian, Pranjal Awasthi, Ana Klimovic, Srikumar Ramalingam:

On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions. - Huaifeng Zhang, Ahmed Ali-Eldin:

The Hidden Bloat in Machine Learning Systems. - Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu:

COMET: Fine-grained Computation-communication Overlapping for Mixture-of-Experts. - Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, Jinwon Lee:

APOLLO: SGD-like Memory, AdamW-level Performance. - Size Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wenlei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Xin Liu:

TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives. - Jinghan Yao, Sam Ade Jacobs, Masahiro Tanaka, Olatunji Ruwase, Hari Subramoni, Dhabaleswar K. Panda:

Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer. - Chendong Wang, Anlan Zhang, Yifan Yang, Lili Qiu, Yuqing Yang, Xinyang Jiang, Feng Qian, Suman Banerjee:

VoLUT: Efficient Volumetric streaming enhanced by LUT-based super-resolution. - Jiachen Liu, Fan Lai, Eric Ding, Yiwen Zhang, Mosharaf Chowdhury:

Venn: Resource Management For Collaborative Learning Jobs. - Kasper Overgaard Mortensen, Konstantinos Skitsas, Emil Morre Christensen, Mohammad Sadegh Talebi, Andreas Pavlogiannis, Davide Mottin, Panagiotis Karras:

SwiftVI: Time-Efficient Planning and Learning with MDPs. - Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu:

NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference. - Wei Gao, Xinyu Zhou, Peng Sun, Tianwei Zhang, Yonggang Wen:

Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving. - Tianle Zhong, Jiechen Zhao, Qiang Su, Geoffrey Fox:

Youmu: Efficient Columnar Data Pipeline for LLM Training. - Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, Jianyu Huang:

Context Parallelism for Scalable Million-Token Inference. - Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, Ravi Netravali:

Marconi: Prefix Caching for the Era of Hybrid LLMs.
Accept with shepherding
- Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, Horace He:

FlexAttention: A Programming Model for Generating Fused Attention Variants. - Sandeep Polisetty, Juelin Liu, Yi Fung, Seung-Hwan Lim, Hui Guan, Marco Serafini:

Spa: Scaling Graph Neural Network Training on Large graphs via Probabilistic splitting. - Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ziyi Xu, Yilong Zhao, Ruihang Lai, Tianqi Chen:

XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models. - Marco Federici, Davide Belli, Mart van Baalen, Amir Jalalirad, Andrii Skliar, Bence Major, Markus Nagel, Paul N. Whatmough:

Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking. - Mohammadali Shakerdargah, Shan Lu, Chao Gao, Di Niu:

Mas-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-constrained Edge Devices. - Anxhelo Xhebraj, Sean Lee, Hanfeng Chen, Vinod Grover:

Scaling Deep Learning Training with MPMD Pipeline Parallelism. - Ahmad Faraz Khan, Samuel Fountain, Ahmed M. Abdelmoniem, Ali Raza Butt, Ali Anwar:

FLStore: Efficient Federated Learning Storage for non-training workloads. - Mingkai Zheng, Zhao Zhang:

Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-training. - Lorenzo Sani, Alex Iacob, Zeyu Cao, Royson Lee, Bill Marino, Yan Gao, Wanru Zhao, Dongqi Cai, Zexi Li, Xinchi Qiu, Nicholas D. Lane:

Photon: Federated LLM Pre-Training. - Hao Kang, Srikant Bharadwaj, James Hensman, Tushar Krishna, Victor Rühle, Saravan Rajmohan:

TurboAttention: Efficient attention approximation for high throughputs llm. - Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, Yi Wu:

ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation. - Zhiqiang Xie, Hao Kang, Ying Sheng, Tushar Krishna, Kayvon Fatahalian, Christos Kozyrakis:

AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution. - Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Xiuhong Li, Guanyu Feng, Xin Lv, Xiao Chuanfu, Dahua Lin, Chao Yang:

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention.

manage site settings
To protect your privacy, all features that rely on external API calls from your browser are turned off by default. You need to opt-in for them to become active. All settings here will be stored as cookies with your web browser. For more information see our F.A.Q.


Google
Google Scholar
Semantic Scholar
Internet Archive Scholar
CiteSeerX
ORCID














