The “Super Weight:” How Even a Single Parameter can Determine a Large Language Model’s Behavior - Apple Machine Learning Researchcontent type highlightpublished August 21, 2025<br>The “Super Weight:” How Even a Single Parameter can Determine a Large Language Model’s Behavior
Figure 1: Super Weight Phenomenon: Pruning a single, special scalar, called the “super weight,” can completely destroy a Large Language Model’s ability to generate text. On the left, the original Llama-7B, which contains a super weight, produces a reasonable completion. On the right, after pruning the super weight, Llama-7B generates complete gibberish. This qualitative observation has quantitative impact as well: zero-shot accuracy drops to random and perplexity increases by orders of magnitude.
Figure 1: Super Weight Phenomenon: Pruning a single, special scalar, called the “super weight,” can completely destroy a Large Language Model’s ability to generate text. On the left, the original Llama-7B, which contains a super weight, produces a reasonable completion. On the right, after pruning the super weight, Llama-7B generates complete gibberish. This qualitative observation has quantitative impact as well: zero-shot accuracy drops to random and perplexity increases by orders of magnitude.
A recent paper from Apple researchers, “The Super Weight in Large Language Models,” reveals that an extremely small subset of parameters in LLMs (in some cases, a single parameter) can exert a disproportionate influence on an LLM’s overall functionality (see Figure 1). This work highlights the critical role of these “super weights” and their corresponding “super activations,” offering a new insight into LLM architecture and avenues for efficient model compression. The paper provides full technical details and experimental results; in this post, we provide a high-level overview of the key findings and their implications.
Understanding and Compressing Increasingly Large Models
While LLMs demonstrate impressive capabilities, their sheer size, often comprising billions or even hundreds of billions of parameters, presents significant challenges for deployment on resource-constrained hardware such as mobile devices. Reducing the size and computational complexity of LLMs for such platforms leads to corresponding reductions in memory and power consumption, enabling them to operate locally, privately, and without an internet connection. However, understanding the internal mechanisms of LLMs is critical, as naïve compression or simplification can lead to substantial degradation in model quality.
Identifying Super Weights and Their Impact
Prior research indicated that a small percentage of parameter outliers in LLMs are vital for maintaining model quality — and if these weights are significantly modified (through compression) or removed entirely (pruned) then the model’s output quality suffers. While this prior work showed that this fraction can be as small as 0.01% of the weights, in models with billions of parameters, this still translates to hundreds of thousands of individual weights. In this work, Apple researchers identified a remarkably small number of parameters, termed “super weights,” that if altered, can destroy an LLM’s ability to generate coherent text, for example, leading to a threefold order of magnitude increase in perplexity and reducing zero-shot accuracy to levels consistent with random guessing. For instance, in the Llama-7B model, removing its single super weight renders the model incapable of producing meaningful output. Conversely, removing thousands of other outlier weights, even those with larger magnitudes than the super weight, results in only marginal quality degradation.
This work proposes a methodology for locating these super weights by requiring only a single forward pass through the model. This method leverages the observation that super weights induce correspondingly rare and large activation outliers, which we term “super activations.” These super activations often appear after the super weight, persist throughout subsequent layers with constant magnitude and position, irrespective of the input prompt, and their channel aligns with that of the super weight. By detecting spikes in the input and output activation distributions of specific model components (e.g., the down projection of the feed-forward network), we can locate the super weights via their corresponding super activation. Intriguingly, the super weight is consistently found in the down projection of the feed-forward network following the attention block, typically in an early layer of the network. We have compiled an index of super weight coordinates for several common, openly available LLMs to facilitate further investigation by the research community.
No. Coordinates Llama 7B2[3968, 7003]Llama 13B<br>2[2231, 2278]2[2231, 6939]Llama 30B<br>3[5633, 12817]3[5633, 17439]10[5633, 14386]Llama2 7B<br>1[2533, 7890]Llama2 13B<br>3[4743, 7678]Mistral-7B<br>v0.1
1[2070,...