跳转至

KVSmooth: Outlier Suppression Algorithm for KVCache Quantization

Overview

  • Problem: In KVCache quantization, a small number of outliers of the key significantly increase the quantization scale. This leads to insufficient effective bits for most channels, which causes attention score degradation and generation quality deterioration.
  • Objective: Compress the dynamic range of K to make it easier to quantize, while maintaining numerical stability and accuracy, without changing the expected value of the attention score QK^T.

Preparations

Install msModelSlim. For details, see msModelSlim Installation Guide.

Principle and Implementation

Principle

Core Logic

  • Smooth the activation values key_states of KVCache by using the method of fusing the scaling coefficient s into the Q/K projection or normalization weight before RoPE:
    • K' = K / s
    • Q' = Q × s
    • With Q'K'^T = QK^T, the attention score remains unchanged, while the dynamic range of K is compressed, and the quantization is more robust.
  • Outliers are migrated from key_states to query_states. During inference, only key_states written to the KVCache are quantized, while query_states are not. This migration is acceptable and does not introduce additional quantization error.
  • RoPE rotates channels in pairs, and the channel dimensions have a pairwise relationship. The algorithm first takes the maximum between paired channels, and then restores the paired structure for scaling.

Implementation

Code Implementation

The algorithm is implemented in msmodelslim/processor/kv_smooth, and the processing flow consists of two phases.

Observation phase

  • Phase: preprocess.
  • Encapsulate past_key_values by using the method of injecting an observer, and capture key_states when the attention module calls Cache.update().
  • Aggregate min/max in the [batch, seq] dimension by using the observer to obtain the maximum absolute value of each channel at each layer, which is used as the statistical basis for scaling.

Smoothing phase

  • Phase: postprocess.
  • Calculate the scaling vector based on the maximum value of |key_states|, and rewrites the weight (and optional bias) of the corresponding module located before RoPE based on the fusion method, so that key_states written to the KVCache after RoPE are smoothed. Meanwhile, query_states are amplified accordingly.
    • state-rope-linear: Fold the scaling into k_proj/q_proj along the path of Linear → RoPE → KVCache.
    • state-rope-norm: Fold the scaling into k_norm/q_norm along the path of Norm → RoPE → KVCache.

Application Requirements

  • Data dependency on the calibration dataset: Observe the suppression scaling factor by using inference calibration. If the data distribution of the calibration dataset deviates from the actual service, the effect will be affected.
  • Model implementation constraints: The attention forward process must accept and use past_key_values or past_key_value. Otherwise, the suppression scaling factor cannot be observed.
  • Fusion point constraints: Currently, fusion is supported for two types of paths: Linear/Norm → RoPE → KVCache.
  • Fusion modules constraints: The target Linear or Norm submodule must exist and have a writable weight (and optional bias). Other custom modules are not supported for now.
  • RoPE assumption: By default, the algorithm performs reduction and restoration on paired channels by using RoPE. Evaluate and verify non-RoPE structures with caution.
  • Quantization method assumption: The algorithm is based on the assumption of quantizing only key_states and value_states of the KVCache without quantizing query_states. When quantizing query_states, evaluate the applicability of this method with caution.

Function Description

YAML Configuration Example

The following example shows a YAML configuration when the algorithm is used as a processor:

spec:
  process:
    - type: "kv_smooth"
      smooth_factor: 1.0                    # Specifies the degree of smoothing aggressiveness. The value must be greater than 0, and a larger value indicates more aggressive smoothing.
      include: ["*"]                        # Specifies the layers to be included. Wildcard matching is supported.
      exclude: ["model.layers.0.self_attn"] # Specifies the layers to be excluded. Wildcard matching is supported.

YAML Configuration Fields

Field Purpose Type Default Value Description Example
type Specifies the processor type. str "kv_smooth" The value is fixed to kv_smooth. "kv_smooth"
smooth_factor Specifies the degree of smoothing aggressiveness. float 1.0 The value must be greater than 0, and a larger value indicates more aggressive smoothing. 1.5
include Specifies the modules to be included for smoothing. List[str] ["*"] Wildcard matching is supported. ["model.layers.*.self_attn"]
exclude Specifies the modules to be excluded from smoothing. List[str] [] Wildcard matching is supported. ["model.layers.0.self_attn"]

Note:

  • smooth_factor must be greater than 0.
  • include and exclude support wildcard matching, for example, model.layers.*.self_attn.
  • exclude has higher priority than include. That is, if a module matches both include and exclude, the module will be excluded.

Model Adaptation

Interface and Data Structure

# Fusion mode enumeration
class KVSmoothFusedType(Enum):
    StateViaRopeToNorm = 'state-rope-norm'  # Supports the fusion of key_states/query_states → Norm.
    StateViaRopeToLinear = 'state-rope-linear'  # Supports the fusion of key_states/query_states → Linear.


# Information about the KVSmooth unit, describing the model substructure and fusion mode
class KVSmoothFusedUnit(BaseModel):
    attention_name: str  # Specifies the full module name, such as "model.layers.0.self_attn".
    layer_idx: int  # Specifies the layer index, such as 0.
    fused_from_query_states_name: str  # Specifies the name of the module on the query_states branch before RoPE, such as "q_proj" or "q_norm".
    fused_from_key_states_name: str # Specifies the name of the module on the key_states branch before RoPE, such as "k_proj" or "k_norm".
    fused_type: KVSmoothFusedType  # Specifies the fusion type.


# Interface for adapting models to the KVSmooth algorithm
class KVSmoothFusedInterface(ABC):
    # Retrieve a list of all units in the model that can be processed by using KVSmooth.
    def get_kvsmooth_fused_subgraph(self) -> List[KVSmoothFusedUnit]: ...

    # Obtain the head_dim information.
    def get_head_dim(self) -> int: ...

    # Obtain the num_key_value_groups information.
    def get_num_key_value_groups(self) -> int: ...

    # Obtain the num_key_value_heads information.
    def get_num_key_value_heads(self) -> int: ...

Adaptation Procedure

  • Prerequisites
    • The attention forward process must accept past_key_values or past_key_value through kwargs and call Cache.update() internally. Otherwise, the observer cannot work.
    • The target path must comply with the Linear/Norm → RoPE → KVCache structure.
  • Procedure
    1. Inherit the KVSmoothFusedInterface and implement all methods by using the model adapter. For reference, see msmodelslim/model/qwen3/model_adapter.py.
    2. In get_kvsmooth_fused_subgraph(), return KVSmoothFusedUnit for each layer and specify the following: parameters:
      • attention_name: specifies the full path (for example, model.layers.{i}.self_attn) that is consistent with named_modules().
      • layer_idx: specifies the layer index used for Cache.update().
      • fused_from_query_states_name: specifies the name of the norm or linear submodule on the query_states branch before RoPE, such as q_proj.
      • fused_from_key_states_name: specifies the name of the norm or linear submodule on the key_states branch before RoPE, for example, k_proj.
      • fused_type: enumerates the fusion modes. Valid values: StateViaRopeToNorm or StateViaRopeToLinear.
    3. Provide the global structure information of the model by using get_head_dim(), get_num_key_value_heads(), and get_num_key_value_groups().

FAQ

Fallback Mismatch

Symptom: The alarm log contains the description are not matched any module.

Solution: Check the complete module names to confirm whether include or exclude is incorrectly set.

Missing Header Dimension Information

Symptom: UnsupportedError is thrown, indicating that get_head_dim, get_num_key_value_groups, and get_num_key_value_heads are missing.

Solution: Ensure that the corresponding model adapter implements the KVSmoothFusedInterface. Otherwise, the model is not applicable to the algorithm.

Attention Not Applicable

Symptom: The log contains the alarm past_key_values and past_key_value both are None.

Solution: Check the model file in Transformers to ensure that past_key_values and past_key_value are passed to the forward process of the Attention layer. Otherwise, the model is not applicable to the algorithm.

Inconsistent Module Names

Symptom: ToDoError is thrown, indicating has no submodule.

Solution: Check the model adapter to ensure that the values of fused_from_query_states_name and fused_from_key_states_name are consistent with the actual names of the fused submodules.