Compressing LLMs: The Truth Is Rarely Pure and Never Simple

https://machinelearning.apple.com/research/compressing-llms

AuthorsAjay Jaiswal, Zhe Gan, Xianzhi Du, Bowen Zhang, Zhangyang Wang, Yinfei Yang

Despite their remarkable achievements, modern Large Language Models (LLMs) encounter exorbitant computational and memory footprints. Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs achieving 50-60% sparsity and reducing the bit-width down to 3 or 4 bits per weight, with negligible perplexity degradation over the uncompressed baseline. As recent research efforts are focused on developing increasingly sophisticated compression methods, our work takes a step back, and re-evaluates the effectiveness of existing SoTA compression methods, which rely on a fairly simple and widely questioned metric, perplexity (even for dense LLMs). We introduce Knowledge-Intensive Compressed LLM BenchmarK (LLM-KICK), a collection of carefully-curated tasks to re-define the evaluation protocol for compressed LLMs, which have significant alignment with their dense counterparts, and perplexity fail to capture subtle change in their true capabilities. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods: all pruning methods suffer significant performance degradation, sometimes at trivial sparsity ratios (e.g., 25-30%), and fail for N:M sparsity on knowledge-intensive tasks; current quantization methods are more successful than pruning; yet, pruned LLMs even at 50% sparsity are robust in-context retrieval and summarization systems; among others. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc. We hope our study can foster the development of better LLM compression methods.

Related readings and updates.

*Equal Contributors To deploy machine learning models on-device, practitioners use compression algorithms to shrink and speed up models while maintaining their high-quality output. A critical aspect of compression in practice is model comparison, including tracking many compression experiments, identifying subtle changes in model behavior, and negotiating complex accuracy-efficiency trade-offs. However, existing compression tools poorly support…

See paper details

In this paper we introduce Principal Filter Analysis (PFA), an easy to use and effective method for neural network compression. PFA exploits the correlation between filter responses within network layers to recommend a smaller network that maintain as much as possible the accuracy of the full model. We propose two algorithms: the first allows users to target compression to specific network property, such as number of trainable variable…

See paper details

{
"by": "zerojames",
"descendants": 0,
"id": 40246785,
"score": 2,
"time": 1714738447,
"title": "Compressing LLMs: The Truth Is Rarely Pure and Never Simple",
"type": "story",
"url": "https://machinelearning.apple.com/research/compressing-llms"
}
{
"author": null,
"date": null,
"description": "Despite their remarkable achievements, modern Large Language Models (LLMs) encounter exorbitant computational and memory footprints…",
"image": "https://mlr.cdn-apple.com/media/Home_1200x630_48225d82e9.png",
"logo": "https://logo.clearbit.com/apple.com",
"publisher": "Apple",
"title": "Compressing LLMs: The Truth is Rarely Pure and Never Simple",
"url": "https://machinelearning.apple.com/research/compressing-llms"
}
{
"url": "https://machinelearning.apple.com/research/compressing-llms",
"title": "Compressing LLMs: The Truth is Rarely Pure and Never Simple",
"description": "AuthorsAjay Jaiswal, Zhe Gan, Xianzhi Du, Bowen Zhang, Zhangyang Wang, Yinfei YangDespite their remarkable achievements, modern Large Language Models (LLMs) encounter exorbitant computational and memory...",
"links": [
"https://machinelearning.apple.com/research/compressing-llms"
],
"image": "https://mlr.cdn-apple.com/media/Home_1200x630_48225d82e9.png",
"content": "<div><div><p><span>Authors</span>Ajay Jaiswal, Zhe Gan, Xianzhi Du, Bowen Zhang, Zhangyang Wang, Yinfei Yang</p></div><div><p>Despite their remarkable achievements, modern Large Language Models (LLMs) encounter exorbitant computational and memory footprints. Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs achieving 50-60% sparsity and reducing the bit-width down to 3 or 4 bits per weight, with negligible perplexity degradation over the uncompressed baseline. As recent research efforts are focused on developing increasingly sophisticated compression methods, our work takes a step back, and re-evaluates the effectiveness of existing SoTA compression methods, which rely on a fairly simple and widely questioned metric, perplexity (even for dense LLMs). We introduce Knowledge-Intensive Compressed LLM BenchmarK (LLM-KICK), a collection of carefully-curated tasks to re-define the evaluation protocol for compressed LLMs, which have significant alignment with their dense counterparts, and perplexity fail to capture subtle change in their true capabilities. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods: all pruning methods suffer significant performance degradation, sometimes at trivial sparsity ratios (e.g., 25-30%), and fail for N:M sparsity on knowledge-intensive tasks; current quantization methods are more successful than pruning; yet, pruned LLMs even at 50% sparsity are robust in-context retrieval and summarization systems; among others. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc. We hope our study can foster the development of better LLM compression methods.</p>\n<figure><p><a href=\"https://mlr.cdn-apple.com/media/main_diagram_b27f56dd27.png\" target=\"_blank\"><img src=\"https://mlr.cdn-apple.com/media/main_diagram_b27f56dd27.png\" /></a></p></figure></div><section><p></p><h2>Related readings and updates.</h2><p></p><div><div><p>*Equal Contributors\nTo deploy machine learning models on-device, practitioners use compression algorithms to shrink and speed up models while maintaining their high-quality output. A critical aspect of compression in practice is model comparison, including tracking many compression experiments, identifying subtle changes in model behavior, and negotiating complex accuracy-efficiency trade-offs. However, existing compression tools poorly support…</p><p><a target=\"_blank\" href=\"https://machinelearning.apple.com/research/compress-compare\">See paper details</a></p></div><div><p>In this paper we introduce Principal Filter Analysis (PFA), an easy to use and effective method for neural network compression. PFA exploits the correlation between filter responses within network layers to recommend a smaller network that maintain as much as possible the accuracy of the full model. We propose two algorithms: the first allows users to target compression to specific network property, such as number of trainable variable…</p><p><a target=\"_blank\" href=\"https://machinelearning.apple.com/research/filter-distillation-for-network-compression\">See paper details</a></p></div></div></section></div>",
"author": "",
"favicon": "https://machinelearning.apple.com/favicon.ico",
"source": "machinelearning.apple.com",
"published": "",
"ttr": 74,
"type": "website"
}