msModelSlim Quantization Weight Format¶

The SafeTensors quantized weight file generated by the msmodelslim llm-ptq tool contains two files: the quant_model_weight.safetensors weight file and the quant_model_description.json weight description file.

msModelSlim quantization types: W8A16: Linear weight int8 quantization, where activations are not quantized. W8A8: Linear weight int8 quantization, which applies int8 activation quantization. W8A8S: Linear weight int8 sparse quantization, which applies int8 activation quantization.

Note: The quantized weights generated by msModelSlim apply strictly to signed scenarios, where the int8 data distribution range spans from -128 to 127. If the open-source weights use an unsigned configuration, consider subtracting 128 from the weight and offset weights for int8 operations.

The convert_example.py script provides a code sample to convert open-source ChatGLM2-6B weights into the msModelSlim quantization weight format. For details about how to obtain the weights, see the low-cost deployment section in ChatGLM2-6B README. Before executing the script, modify the input and output directory paths in lines 224 and 225. Use the python convert_example.py command to run the script.

Quantized Weight and Description File Formats¶

SafeTensors Weight Format¶

Weights are saved in the SafeTensors format, which is an internal Python dictionary structure containing both quantized weights and unmodified floating-point weights. Each dictionary key is a weight name, and each value is a specific weight value. Take ChatGLM2-6B as an example: 'transformer.embedding.word_embeddings.weight' indicates the weight of the word_embedding layer in the floating-point model, where its name and weight remain unmodified, and its corresponding quantization type in the description file is 'FLOAT'; 'transformer.encoder.layers.0.self_attention.dense.weight' indicates the linear weight of the dense layer in layer 0 of the original model, which has been modified through quantization to have an int8 data type, and its corresponding quantization type in the description file is 'W8A16'; 'transformer.encoder.layers.0.self_attention.dense.weight_scale' indicates the new quantization parameter weight_scale added after quantizing the linear operator of the dense layer in layer 0 of the original model, and its corresponding quantization type in the description file is 'W8A16'.

Code sample for ChatGLM2-6B W8A16 quantized weights:

{
    'transformer.embedding.word_embeddings.weight': tensor([...]),
    'transformer.encoder.final_layernorm.weight': tensor([...]),
    'transformer.encoder.layers.0.input_layernorm.weight': tensor([...]),
    'transformer.encoder.layers.0.mlp.dense_4h_to_h.weight': tensor([...]),
    'transformer.encoder.layers.0.mlp.dense_4h_to_h.weight_scale': tensor([...]),
    'transformer.encoder.layers.0.mlp.dense_4h_to_h.weight_offset': tensor([...]),
    'transformer.encoder.layers.0.mlp.dense_h_to_4h.weight': tensor([...]),
    'transformer.encoder.layers.0.mlp.dense_h_to_4h.weight_scale': tensor([...]),
    'transformer.encoder.layers.0.mlp.dense_h_to_4h.weight_offset': tensor([...]),
    'transformer.encoder.layers.0.post_attention_layernorm.weight': tensor([...]),
    'transformer.encoder.layers.0.self_attention.dense.weight': tensor([...]),
    'transformer.encoder.layers.0.self_attention.dense.weight_scale': tensor([...]),
    'transformer.encoder.layers.0.self_attention.dense.weight_offset': tensor([...]),
    'transformer.encoder.layers.0.self_attention.query_key_value.weight': tensor([...]),
    'transformer.encoder.layers.0.self_attention.query_key_value.weight_scale': tensor([...]),
    'transformer.encoder.layers.0.self_attention.query_key_value.weight_offset': tensor([...]),
    ...
    // Remaining layers follow the same pattern
    ...
    'transformer.output_layer.weight': tensor([...]),
    'transformer.rotary_pos_emb.inv_freq': tensor([...])
}

JSON Description File Format¶

The storage format of the JSON description file maps internally to a Python dictionary. The dictionary key represents the weight name, and the value represents the corresponding quantization type. "model_quant_type" indicates the overall quantization type, and "kv_cache_type" indicates whether the KV cache is quantized. The remaining keys specify individual weight types: "FLOAT" indicates an unquantized floating-point weight, "W8A8" indicates W8A8 quantization, "W8A16" indicates W8A16 quantization, and "W8A8S" indicates sparse quantization. The following example shows a description file for ChatGLM2-6B W8A16 quantized weights. The element order within the dictionary does not affect runtime execution.

{
    "model_quant_type": "W8A16",
    "kv_cache_type": "C8", # Indicates KV cache quantization is enabled.
    "transformer.embedding.word_embeddings.weight": "FLOAT",
    "transformer.rotary_pos_emb.inv_freq": "FLOAT",
    "transformer.encoder.layers.0.input_layernorm.weight": "W8A16",
    "transformer.encoder.layers.0.self_attention.query_key_value.weight": "W8A16",
    # If KV cache quantization is enabled, the following four lines are generated:
    "transformer.encoder.layers.0.self_attention.query_key_value.k_proj.kv_cache_scale": "W8A16",
    "transformer.encoder.layers.0.self_attention.query_key_value.k_proj.kv_cache_offset": "W8A16",
    "transformer.encoder.layers.0.self_attention.query_key_value.v_proj.kv_cache_scale": "W8A16",
    "transformer.encoder.layers.0.self_attention.query_key_value.v_proj.kv_cache_offset": "W8A16",
    "transformer.encoder.layers.0.self_attention.query_key_value.weight_scale": "W8A16",
    "transformer.encoder.layers.0.self_attention.query_key_value.weight_offset": "W8A16",
    "transformer.encoder.layers.0.post_attention_layernorm.weight": "FLOAT",
    "transformer.encoder.layers.0.mlp.dense_4h_to_h.weight": "W8A16",
    "transformer.encoder.layers.0.mlp.dense_4h_to_h.weight_scale": "W8A16",
    "transformer.encoder.layers.0.mlp.dense_4h_to_h.weight_offset": "W8A16",
    ...
    // Remaining layers follow the same pattern
    ...
    "transformer.encoder.final_layernorm.weight": "FLOAT",
    "transformer.output_layer.weight": "FLOAT"
}

W8A16 Quantization¶

The msModelSlim tool generates three parameters for each quantized linear layer: weight, weight_scale, and weight_offset. In the .safetensors file, the full weight name combines the linear layer name with the parameter identifier. For example, in the ChatGLM2-6B quantized weight "transformer.encoder.layers.0.self_attention.query_key_value.weight_scale", "transformer.encoder.layers.0.self_attention.query_key_value" is the linear layer name, and "weight_scale" is the parameter identifier.

Weight Description¶

weight: quantized weight in int8 precision. The data type is torch.Tensor, the dtype is torch.int8, and its shape matches the original floating-point layout, denoted as n, k = weight.shape where k represents the hidden_size. weight_scale: quantization scaling factor. The data type is torch.Tensor and the dtype is torch.float32. The shape is [n] for per_channel scenarios, or [n, k / group_size] for per_group scenarios. weight_offset: quantization offset coefficient. The data type is torch.Tensor, and its dtype and shape are identical to weight_scale. Symmetric quantization scenarios require a zero-filled weight_offset.

Dequantization Formula¶

In per_channel scenarios:

deq_weight = (weight - weight_offset) * weight_scale

In per_group scenarios:

weight = weight.reshape((-1, group_size))
weight_offset = weight_offset.reshape((n * k / group_size, 1))
weight_scale = weight_scale.reshape((n * k / group_size, 1))
deq_weight = ((weight - weight_offset) * weight_scale).reshape((n, k))

Note: The physical computation logic executed by the NPU quantization operator is (weight + weight_offset) * weight_scale. The Ascend inference framework automatically negates the offset values when loading the quantized weights.

For details about code implementation, refer to MSModelSlimWeightProcessor.weight_process in the provided demo. Modify your implementation based on the respective dequantization formulas of the open-source model weights and the msModelSlim tool.

W8A8 and W8A8S Quantization¶

The msModelSlim quantization tool generates five parameters for each quantized linear layer: weight, input_scale, input_offset, deq_scale, and quant_bias. In the .safetensors weight file, the full weight name combines the linear layer name with the parameter identifier, which is similar to the structure used in W8A16.

Weight Description¶

weight: quantized weight in int8 precision. The data type is torch.Tensor, the dtype is torch.int8, and its shape matches the original floating-point layout, denoted as n, k = weight.shape where k represents the hidden_size. input_scale: scaling factor for activation quantization. The data type is torch.Tensor, the dtype is torch.float16 or torch.bfloat16, and the shape is [1]. input_offset: offset coefficient for activation quantization. The data type is torch.Tensor. The dtype and shape are identical to those of input_scale. deq_scale: scaling factor for dequantization. The data type is torch.Tensor, the dtype is torch.int64 or torch.float32, and the shape is [n]. To ensure compatibility with the Ascend quantization operator, if open-source model quantization uses fp16 precision, you must transform the data type of deq_scale before sending it to the quantization operator, as shown in line 120 of the code sample. If the quantized weights of the open-source model use bf16 precision, this data type conversion is omitted. quant_bias: offset coefficient for dequantization. The data type is torch.Tensor, the dtype is torch.int32, and the shape is [n].

Quantization and Dequantization Formulas¶

input_quant = input_fp / input_scale + input_offset
output_quant = input_quant * weight + quant_bias
output_dequant = output_quant * deq_scale

For details about code implementation, refer to MSModelSlimWeightProcessor.weight_activation_process in the provided demo. Modify your implementation based on the respective calculation formulas of the open-source model weights and the msModelSlim tool.

smooth quant¶

After Smooth Quan is used in msModelSlim, two parameters are generated for each normalization layer: module.weight and module.bias. The full weight name combines the normalization layer name with the parameter identifier. For example, in the ChatGLM2-6B quantized weights, "transformer.encoder.layers.0.input_layernorm.module.weight" combines the normalization layer name "transformer.encoder.layers.0.input_layernorm" with the parameter identifier "module.weight".

The Smooth Quant algorithm integrated into msModelSlim applies smooth operations specifically to the linear layer that follows a normalization layer, rather than to all linear layers. The structural advantage of this quantization strategy is that the scaling factor originally applied to the activation values can be mathematically migrated to the weight parameter (norm.weight) of the normalization layer within the original floating-point model. This optimization eliminates the performance overhead of executing extra operator layers.

module.weight: scaled normalization weight parameter (norm.weight). The data type, dtype, and shape match those of norm.weight. module.bias: offset coefficient introduced after applying module.weight. The data type, dtype, and shape match those of norm.weight.

To support specific fallback layers, the Smooth Quant weights generated by msModelSlim also include the original floating-point weight of the normalization layer (norm.weight). If the open-source model weights do not involve fallback scenarios, set this parameter to None.

For details about code implementation, refer to MSModelSlimWeightProcessor.anti_outlier_process in the provided demo. Modify your implementation based on the respective design strategies of the open-source model weights and the msModelSlim tool.

KV Cache Quantization¶

The KV Cache quantization feature provided by msModelSlim utilizes int8 precision. Four parameters are generated for each attention layer: k_proj.kv_cache_scale, k_proj.kv_cache_offset, v_proj.kv_cache_scale, and v_proj.kv_cache_offset. In fused scenarios (such as fused QKV or fused KV layers), the full names of these four parameters combine the fused linear layer name with the parameter identifier. In unfused scenarios, the full parameter names for the k_proj scaling factor and offset combine the linear layer name of the K-projection layer with their respective parameter identifiers; the parameter names for the v_proj scaling factor and offset follow the same pattern using the V-projection layer name. For example, in "transformer.encoder.layers.0.query_key_value.k_proj.kv_cache_scale", the substring "transformer.encoder.layers.0.query_key_value" is the fused QKV linear layer name, and "k_proj.kv_cache_scale" is the parameter identifier.

Weight Description¶

kv_cache_scale: scaling factor for KV Cache quantization. The data type is torch.Tensor, the dtype is torch.float32 or torch.float16, and the shape matches the size of the KV channel. In unfused scenarios, the shape matches the n-dimension of the linear layer for K or V projections, as detailed in the weight shape description within the Weight Description of the W8A16 Quantization section. kv_cache_offset: offset coefficient for KV Cache quantization. The data type, dtype, and shape are match those of kv_cache_scale.

Formulas: Quantization

cache_int = cache_fp / cache_scale + cache_offset

Dequantization

cache_deq = (cache_int - cache_offset) * cache_scale

For details about the code implementation, refer to MSModelSlimWeightProcessor.kv_cache_process in the provided demo. Modify your implementation based on the respective calculation formulas of the open-source model weights and the msModelSlim tool.