


default search action
7th MLSys 2024: Santa Clara, CA, USA
- Phillip B. Gibbons, Gennady Pekhimenko, Christopher De Sa:

Proceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13-16, 2024. mlsys.org 2024 - Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy:

Punica: Multi-Tenant LoRA Serving. - Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, Todd C. Mowry:

ACROBAT: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time. - Gyudong Kim, Mehdi Ghasemi, Soroush Heidari, Seungryong Kim, Young Geun Kim, Sarma B. K. Vrudhula, Carole-Jean Wu:

HeteroSwitch: Characterizing and Taming System-Induced Data Heterogeneity in Federated Learning. - Mohamed Assem Ibrahim, Shaizeen Aga, Ada Li, Suchita Pati, Mahzabeen Islam:

JIT-Q: Just-in-time Quantization with Processing-In-Memory for Efficient ML Training. - Milos Nikolic, Enrique Torres-Sánchez, Jiahui Wang, Ali Hadi Zadeh, Mostafa Mahmoud, Ameer Abdelhadi, Kareem Ibrahim, Andreas Moshovos:

Schrodinger's FP Training Neural Networks with Dynamic Floating-Point Containers. - Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, Yida Wang:

Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping. - Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han:

AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. - Ye Tian, Zhen Jia, Ziyue Luo, Yida Wang, Chuan Wu:

DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines. - Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath:

Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference. - Kiwan Maeng, G. Edward Suh:

Accelerating ReLU for MPC-Based Private Inference with a Communication-Efficient Sign Estimation. - Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, Yu Wang:

FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics. - Xuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, Yang You:

HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices. - Yifei Xu, Yuning Chen, Xumiao Zhang, Xianshang Lin, Pan Hu, Yunfei Ma, Songwu Lu, Wan Du, Zhuoqing Mao, Ennan Zhai, Dennis Cai:

CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation. - Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci:

Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving. - Jingtian Dang, Jianming Tong, Anupam Golder, Cong Hao, Arijit Raychowdhury, Tushar Krishna:

Accurate Low-Degree Polynomial Approximation of Non-Polynomial Operators for Fast Private Inference in Homomorphic Encryption. - Zhixu Du, Shiyu Li, Yuhao Wu, Xiangyu Jiang, Jingwei Sun, Qilin Zheng, Yongkai Wu, Ang Li, Hai Li, Yiran Chen:

SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models. - Song Bian, Dacheng Li, Hongyi Wang, Eric P. Xing, Shivaram Venkataraman:

Does Compressing Activations Help Model Parallel Training? - Alok Tripathy, Katherine A. Yelick, Aydin Buluç:

Distributed Matrix-Based Sampling for Graph Neural Network Training. - Liang Luo, Buyun Zhang, Michael Tsang, Yinbin Ma, Ching-Hsiang Chu, Yuxin Chen, Shen Li, Yuchen Hao, Yanli Zhao, Guna Lakshminarayanan, Ellie Wen, Jongsoo Park, Dheevatsa Mudigere, Maxim Naumov:

Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale Recommendation. - Shan Yu, Zhenting Zhu, Yu Chen, Hanchen Xu, Pengzhan Zhao, Yang Wang, Arthi Padmanabhan, Hugo Latapie, Harry Xu:

VQPy: An Object-Oriented Approach to Modern Video Analytics. - Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph Gonzalez, Ion Stoica:

SLoRA: Scalable Serving of Thousands of LoRA Adapters. - Ilia Markov, Kaveh Alim, Elias Frantar, Dan Alistarh:

L-GreCo: Layerwise-adaptive Gradient Compression For Efficient Data-parallel Deep Learning. - In Gim, Guojun Chen, Seung-Seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong:

Prompt Cache: Modular Attention Reuse for Low-Latency Inference. - Yunhao Yang, Neel P. Bhatt, Tyler Ingebrand, William Ward, Steven Carr, Atlas Wang, Ufuk Topcu:

Fine-Tuning Language Models Using Formal Methods Feedback: A Use Case in Autonomous Systems. - Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, Alexey Tumanov:

VIDUR: A Large-Scale Simulation Framework for LLM Inference. - Jian Meng, Yuan Liao, Anupreetham Anupreetham, Ahmed Hassan, Shixing Yu, Han-Sok Suh, Xiaofeng Hu, Jae-sun Seo:

Torch2Chip: An End-to-end Customizable Deep Neural Network Compression and Deployment Toolkit for Prototype Hardware Accelerator Design. - Zhenyu Zhang, Shiwei Liu, Runjin Chen, Bhavya Kailkhura, Beidi Chen, Atlas Wang:

Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache. - Yuxuan Zhu, Jiachen Liu, Mosharaf Chowdhury, Fan Lai:

FedTrans: Efficient Federated Learning via Multi-Model Transformation. - Shixiong Qi, K. K. Ramakrishnan, Myungjin Lee:

LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning. - Yubo Gao, Maryam Haghifam, Christina Giannoula, Renbo Tu, Gennady Pekhimenko, Nandita Vijaykumar:

Proteus: Preserving Model Confidentiality during Graph Optimizations. - Elias Frantar, Dan Alistarh:

QMoE: Sub-1-Bit Compression of Trillion Parameter Models. - Size Zheng, Renze Chen, Meng Li, Zihao Ye, Luis Ceze, Yun Liang:

vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs. - Yichen Qian, Yongyi He, Rong Zhu, Jintao Huang, Zhijian Ma, Haibin Wang, Yaohua Wang, Xiuyu Sun, Defu Lian, Bolin Ding, Jingren Zhou:

UniDM: A Unified Framework for Data Manipulation with Large Language Models. - Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, Chang Wang, Mengni Wang:

Efficient Post-training Quantization with FP8 Formats. - Isha Chaudhary, Alex Renda, Charith Mendis, Gagandeep Singh:

COMET: Neural Cost Model Explanation Framework. - Yash Akhauri, Mohamed S. Abdelfattah:

On Latency Predictors for Neural Architecture Search. - Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Chen Wang, Hubertus Franke, Zbigniew Kalbarczyk, Tamer Basar, Ravi K. Iyer:

FLASH: Fast Model Adaptation in ML-Centric Cloud Platforms.

manage site settings
To protect your privacy, all features that rely on external API calls from your browser are turned off by default. You need to opt-in for them to become active. All settings here will be stored as cookies with your web browser. For more information see our F.A.Q.


Google
Google Scholar
Semantic Scholar
Internet Archive Scholar
CiteSeerX
ORCID














