跳转至

Offline Precision Pre-check for MindSpore Dynamic Graphs

Overview

This function scans all MindSpore Mint APIs in a trained MindSpore mode, along with MindSpore APIs ported in the MSAdapter scenario running on Ascend NPUs, and then outputs diagnostic and analytical insights regarding precision. The tool takes the dump results of all APIs in the model as input, constructs corresponding API unit tests, compares NPU output against a high-precision CPU benchmark, and calculates precision metrics to identify APIs with precision issues on the NPU. It supports both random generation mode and real data mode.

Concepts

  • Mint API: An API generated by MindSpore during dynamic graph execution, corresponding to a PyTorch API.

  • MSAdapter: A compatibility layer that enables some PyTorch code to run on MindSpore.

  • Random generation mode: Input data is automatically constructed based on the value range. Precision is slightly lower, making it suitable for quickly locating potential precision issues.

  • Real data mode: The real input data generated by model dump is used for comparison, resulting in more reliable results.

Offline Pre-check Process

The operation procedure is as follows:

  1. Install msProbe in the NPU environment.
  2. Add the PrecisionDebugger API of msProbe to the NPU training script to collect the data to be pre-checked. For details, see Precision Data Collection in MindSpore. Note that you need to set level to L1.
  3. Perform the pre-check, view the pre-check result file, and analyze the APIs that do not meet the pre-check requirements.

Preparations

Environment Setup

Install msProbe by referring to msProbe Installation Guide.

Constraints

Only MindSpore dynamic graph scenarios are supported.

Quick Start

Data Preparation

Create a dump.json file in the current directory to simulate the dump output file. The file content is as follows:

{
    "task": "statistics",
    "level": "L1",
    "dump_data_dir": null,
    "framework": "mindspore",
    "data": {
        "Mint.where.0.forward": {
            "input_args": [
             {
              "type": "mindspore.Tensor",
              "dtype": "Bool",
              "shape": [
               1,
               4096
              ],
              "Max": false,
              "Min": false,
              "Mean": null,
              "Norm": null
             },
             {
              "type": "int",
              "value": 0
             },
             {
              "type": "int",
              "value": 1
             }
            ],
            "input_kwargs": {},
            "output": [
             {
              "type": "mindspore.Tensor",
              "dtype": "Int64",
              "shape": [
               1,
               4096
              ],
              "Max": 1.0,
              "Min": 1.0,
              "Mean": 1.0,
              "Norm": 64.0
             }
            ]
           }}}

Running the Command

msprobe acc_check -api_info ./dump.json

For details about the result, see Pre-check Result Description.

Function Overview

Using acc_check for Pre-check

Function

The·acc_check command is used to perform unit tests on all API execution units recorded in the dump.json file, compare the output differences between the NPU and CPU, and generate the forward and backward precision results. This function is applicable to API precision pre-check in single-device scenarios.

Precautions

  • The pre-check depends on the actual dump data. Ensure that level is set to L1 or mix for dump.

  • In random data mode, the -save_error_data parameter saves additional input and output files. Evaluate the drive capacity in advance.

Syntax

msprobe acc_check -api_info <dump_json_path> [-o <out_path>] [-csv_path <result_csv_path>] [-save_error_data]

Optional fields are enclosed in square brackets ([]), and variables are enclosed in angle brackets (<>).

Parameters

Parameter Mandatory (Yes/No) Description
-api_info or --api_info_file Yes Specifies the API information file dump.json, which is of the string type. It also pre-checks Mint APIs and some Tensor APIs. For details about the supported Tensor APIs for pre-check, see List of APIs That Can Be Checked.
-o or --out_path No Specifies the path for saving the pre-check result. The value is of the string type. The default value is ./.
-csv_path or --result_csv_path No Specifies the path of the accuracy_checking_result_{timestamp}.csv file generated when the current run is interrupted. If you want to resume the execution from the interruption point when acc_check is interrupted, set this parameter. The value is of the string type. You need to specify the value to the accuracy_checking_result_{timestamp}.csv file that was interrupted last time. For details, see Resumable Check.
-save_error_data No Saves the API input and output data (random data mode) that does not meet the precision requirements.

Example 1: Perform a pre-check.

msprobe acc_check -api_info ./dump.json -o ./checker_result

The execution result of acc_check is generated in the path specified by the -o parameter, including the accuracy_checking_result_{timestamp}.csv and accuracy_checking_details_{timestamp}.csv files. accuracy_checking_result_{timestamp}.csv contains API-level data and indicates whether each API has passed the test. It is recommended to view the accuracy_checking_result_{timestamp}.csv file first. For APIs that failed the test or APIs with special focus, query the status of each output and comparison metric in the accuracy_checking_details_{timestamp}.csv file based on the API Name field. For details, see Pre-check Result Description.

In random data mode, if you need to save the input and output data that does not meet the requirements, add -save_error_data to the end of the acc_check command.

Example 2: Save the input and output data that does not meet the requirements.

msprobe acc_check -api_info ./dump.json -o ./checker_result -save_error_data

The data of the APIs that do not meet the requirements will be saved in:

{out_path}/error_data/

Using multi_acc_check for Multi-thread Pre-check

Function

multi_acc_check can perform acc_check operations concurrently on multiple NPUs to accelerate API precision pre-check for large-sized models, such as 7B, 13B, and 38B models.

Precautions

  • All involved device IDs must be idle. acc_check running on multiple devices will concurrently invoke the Python process.

  • The output results of multiple devices do not overwrite each other, and subdirectories with different timestamps are automatically created.

Syntax

msprobe multi_acc_check -api_info <dump_json_path> [-d device_list] [-o out_path] [-csv_path result_csv_path] [-save_error_data]

Parameters

Parameter Description Type Mandatory (Yes/No)
-api_info or --api_info_file Specifies the API information file dump.json and pre-checks Mint APIs and some Tensor APIs. For details about the supported Tensor APIs for pre-check, see List of APIs That Can Be Checked. str Yes
-o or --out_path Specifies the path for saving the pre-check result. The default value is ./. str No
-csv_path or --result_csv_path Specifies the path of the accuracy_checking_result_{timestamp}.csv file generated when the current execution is interrupted. Set this parameter to resume execution from the interruption point if acc_check is interrupted. You need to specify the value to the accuracy_checking_result_{timestamp}.csv file that was interrupted last time. For details, see Resumable Check. str No
-d or --device Specifies the ID of a device where acc_check runs. The default value is 0. You can specify a range between 0 and the total number of devices minus 1, for example, 0 1 2 3 4. List[int] No
-save_error_data Saves the API input and output data (random data mode) that does not meet the precision requirements. Empty No

For details about the pre-check time consumption baseline of a 38B language model with different numbers of devices, see Time Reference Baseline in "multi_acc_check" Mode.

Example

msprobe multi_acc_check -api_info ./dump.json -d 0 1 2 3

The data of the APIs that do not meet the requirements will be saved in:

./ut_error_data{timestamp}/

Output Description

After the multi_acc_check pre-check is performed, two CSV files are generated for each device. For details, see Pre-check Result Description.

Resumable Check

Function

Resumable check allows you to continue the check from the interruption point after a pre-check is interrupted due to environment or data scale issues, without needing to perform the comparison again.

Precautions

  • Ensure that the accuracy\_checking\_result\_.csv and accuracy\_checking\_details\_.csv files have not been modified.

  • The .csv file generated during the last interruption must be used. Using any other file may result in incorrect resumable check behavior.

  • Resumable check does not create a new .csv file. Instead, it appends the result to the original file.

Example

msprobe acc_check -api_info ./dump.json -csv_path xxx/accuracy_checking_result_{timestamp}.csv

The accuracy_checking_result_{timestamp}.csv file that was interrupted last time must be used. Do not modify the accuracy_checking_result_{timestamp}.csv and accuracy_checking_details_{timestamp}.csv files or their names. Otherwise, the results of the resumable check cannot be guaranteed.

Pre-check Result Description

The content of the accuracy_checking_result_{timestamp}.csv and accuracy_checking_details_{timestamp}.csv files generated during precision pre-check is as follows.

accuracy_checking_details_{timestamp}.csv

Field Meaning
API Name API name.
Bench Dtype Type of the benchmark API data.
Tested Dtype Type of the checked API data.
Shape API shape.
Cosine Cosine similarity between the checked data and the benchmark data.
MaxAbsErr Maximum absolute error between the checked data and the benchmark data.
MaxRelativeErr Maximum relative error between the checked data and the benchmark data.
Status API pre-check status. The value pass indicates that the test is passed, and the value error indicates that the test is not passed.
Message Message.

Note: PyTorch does not support reverse derivation on tensors with integer dtype, whereas MindSpore does. The pre-check for the backward process compares only outputs with floating-point dtype.

accuracy_checking_result_{timestamp}.csv

Field Meaning
API Name API name.
Forward Test Success Whether the forward API passes the test. The value can be pass or error.
Backward Test Success Whether the backward API passes the test. The value can be pass or error. If the value is empty, the API has no backward output.
Message Message.

The pass/error status of Forward Test Success and Backward Test Success is determined by the cosine similarity and maximum absolute error values recorded in accuracy_checking_details_{timestamp}.csv. For details about the rules, see API Pre-check Metrics. Note that an API in accuracy_checking_details_{timestamp}.csv may have multiple forward or backward outputs. In this case, each output is recorded in a separate row. In accuracy_checking_result_{timestamp}.csv, the result is marked as pass only if all outputs of the API are pass. If any output is error, the result is marked as error.

API Pre-check Metrics

  • API pre-check metrics are used to determine whether an API meets the precision standards by checking the cosine similarity and maximum absolute error in accuracy_checking_details_{timestamp}.csv.
  • If the cosine similarity is greater than 0.99 and the maximum absolute error is less than 0.0001, the result is marked as pass. Otherwise, the result is marked as error.