KVSmooth: Outlier Suppression Algorithm for KVCache Quantization¶
Overview¶
- Problem: In KVCache quantization, a small number of outliers of the key significantly increase the quantization scale. This leads to insufficient effective bits for most channels, which causes attention score degradation and generation quality deterioration.
- Objective: Compress the dynamic range of
Kto make it easier to quantize, while maintaining numerical stability and accuracy, without changing the expected value of the attention scoreQK^T.
Preparations¶
Install msModelSlim. For details, see msModelSlim Installation Guide.
Principle and Implementation¶
Principle¶
Core Logic
- Smooth the activation values
key_statesof KVCache by using the method of fusing the scaling coefficientsinto the Q/K projection or normalization weight before RoPE:K' = K / sQ' = Q × s- With
Q'K'^T = QK^T, the attention score remains unchanged, while the dynamic range ofKis compressed, and the quantization is more robust.
- Outliers are migrated from
key_statestoquery_states. During inference, onlykey_stateswritten to the KVCache are quantized, whilequery_statesare not. This migration is acceptable and does not introduce additional quantization error. - RoPE rotates channels in pairs, and the channel dimensions have a pairwise relationship. The algorithm first takes the maximum between paired channels, and then restores the paired structure for scaling.
Implementation¶
Code Implementation¶
The algorithm is implemented in msmodelslim/processor/kv_smooth, and the processing flow consists of two phases.
Observation phase¶
- Phase:
preprocess. - Encapsulate
past_key_valuesby using the method of injecting an observer, and capturekey_stateswhen the attention module callsCache.update(). - Aggregate min/max in the
[batch, seq]dimension by using the observer to obtain the maximum absolute value of each channel at each layer, which is used as the statistical basis for scaling.
Smoothing phase¶
- Phase:
postprocess. - Calculate the scaling vector based on the maximum value of
|key_states|, and rewrites theweight(and optionalbias) of the corresponding module located before RoPE based on the fusion method, so thatkey_stateswritten to the KVCache after RoPE are smoothed. Meanwhile,query_statesare amplified accordingly.state-rope-linear: Fold the scaling intok_proj/q_projalong the path ofLinear → RoPE → KVCache.state-rope-norm: Fold the scaling intok_norm/q_normalong the path ofNorm → RoPE → KVCache.
Application Requirements¶
- Data dependency on the calibration dataset: Observe the suppression scaling factor by using inference calibration. If the data distribution of the calibration dataset deviates from the actual service, the effect will be affected.
- Model implementation constraints: The attention forward process must accept and use
past_key_valuesorpast_key_value. Otherwise, the suppression scaling factor cannot be observed. - Fusion point constraints: Currently, fusion is supported for two types of paths:
Linear/Norm → RoPE → KVCache. - Fusion modules constraints: The target
LinearorNormsubmodule must exist and have a writableweight(and optionalbias). Other custom modules are not supported for now. - RoPE assumption: By default, the algorithm performs reduction and restoration on paired channels by using RoPE. Evaluate and verify non-RoPE structures with caution.
- Quantization method assumption: The algorithm is based on the assumption of quantizing only
key_statesandvalue_statesof the KVCache without quantizingquery_states. When quantizingquery_states, evaluate the applicability of this method with caution.
Function Description¶
YAML Configuration Example¶
The following example shows a YAML configuration when the algorithm is used as a processor:
spec:
process:
- type: "kv_smooth"
smooth_factor: 1.0 # Specifies the degree of smoothing aggressiveness. The value must be greater than 0, and a larger value indicates more aggressive smoothing.
include: ["*"] # Specifies the layers to be included. Wildcard matching is supported.
exclude: ["model.layers.0.self_attn"] # Specifies the layers to be excluded. Wildcard matching is supported.
YAML Configuration Fields¶
| Field | Purpose | Type | Default Value | Description | Example |
|---|---|---|---|---|---|
type |
Specifies the processor type. | str | "kv_smooth" | The value is fixed to kv_smooth. |
"kv_smooth" |
smooth_factor |
Specifies the degree of smoothing aggressiveness. | float | 1.0 | The value must be greater than 0, and a larger value indicates more aggressive smoothing. | 1.5 |
include |
Specifies the modules to be included for smoothing. | List[str] | ["*"] | Wildcard matching is supported. | ["model.layers.*.self_attn"] |
exclude |
Specifies the modules to be excluded from smoothing. | List[str] | [] | Wildcard matching is supported. | ["model.layers.0.self_attn"] |
Note:
smooth_factormust be greater than 0.includeandexcludesupport wildcard matching, for example,model.layers.*.self_attn.excludehas higher priority thaninclude. That is, if a module matches bothincludeandexclude, the module will be excluded.
Model Adaptation¶
Interface and Data Structure¶
# Fusion mode enumeration
class KVSmoothFusedType(Enum):
StateViaRopeToNorm = 'state-rope-norm' # Supports the fusion of key_states/query_states → Norm.
StateViaRopeToLinear = 'state-rope-linear' # Supports the fusion of key_states/query_states → Linear.
# Information about the KVSmooth unit, describing the model substructure and fusion mode
class KVSmoothFusedUnit(BaseModel):
attention_name: str # Specifies the full module name, such as "model.layers.0.self_attn".
layer_idx: int # Specifies the layer index, such as 0.
fused_from_query_states_name: str # Specifies the name of the module on the query_states branch before RoPE, such as "q_proj" or "q_norm".
fused_from_key_states_name: str # Specifies the name of the module on the key_states branch before RoPE, such as "k_proj" or "k_norm".
fused_type: KVSmoothFusedType # Specifies the fusion type.
# Interface for adapting models to the KVSmooth algorithm
class KVSmoothFusedInterface(ABC):
# Retrieve a list of all units in the model that can be processed by using KVSmooth.
def get_kvsmooth_fused_subgraph(self) -> List[KVSmoothFusedUnit]: ...
# Obtain the head_dim information.
def get_head_dim(self) -> int: ...
# Obtain the num_key_value_groups information.
def get_num_key_value_groups(self) -> int: ...
# Obtain the num_key_value_heads information.
def get_num_key_value_heads(self) -> int: ...
Adaptation Procedure¶
- Prerequisites
- The attention forward process must accept
past_key_valuesorpast_key_valuethroughkwargsand callCache.update()internally. Otherwise, the observer cannot work. - The target path must comply with the
Linear/Norm → RoPE → KVCachestructure.
- The attention forward process must accept
- Procedure
- Inherit the
KVSmoothFusedInterfaceand implement all methods by using the model adapter. For reference, see msmodelslim/model/qwen3/model_adapter.py. - In
get_kvsmooth_fused_subgraph(), returnKVSmoothFusedUnitfor each layer and specify the following: parameters:attention_name: specifies the full path (for example,model.layers.{i}.self_attn) that is consistent withnamed_modules().layer_idx: specifies the layer index used forCache.update().fused_from_query_states_name: specifies the name of thenormorlinearsubmodule on thequery_statesbranch before RoPE, such asq_proj.fused_from_key_states_name: specifies the name of thenormorlinearsubmodule on thekey_statesbranch before RoPE, for example,k_proj.fused_type: enumerates the fusion modes. Valid values:StateViaRopeToNormorStateViaRopeToLinear.
- Provide the global structure information of the model by using
get_head_dim(),get_num_key_value_heads(), andget_num_key_value_groups().
- Inherit the
FAQ¶
Fallback Mismatch¶
Symptom: The alarm log contains the description are not matched any module.
Solution: Check the complete module names to confirm whether include or exclude is incorrectly set.
Missing Header Dimension Information¶
Symptom: UnsupportedError is thrown, indicating that get_head_dim, get_num_key_value_groups, and get_num_key_value_heads are missing.
Solution: Ensure that the corresponding model adapter implements the KVSmoothFusedInterface. Otherwise, the model is not applicable to the algorithm.
Attention Not Applicable¶
Symptom: The log contains the alarm past_key_values and past_key_value both are None.
Solution: Check the model file in Transformers to ensure that past_key_values and past_key_value are passed to the forward process of the Attention layer.
Otherwise, the model is not applicable to the algorithm.
Inconsistent Module Names¶
Symptom: ToDoError is thrown, indicating has no submodule.
Solution: Check the model adapter to ensure that the values of fused_from_query_states_name and fused_from_key_states_name are consistent with the actual names of the fused submodules.