Offline Precision Pre-check for MindSpore Dynamic Graphs¶

Overview¶

This function scans all MindSpore Mint APIs in a trained MindSpore mode, along with MindSpore APIs ported in the MSAdapter scenario running on Ascend NPUs, and then outputs diagnostic and analytical insights regarding precision. The tool takes the dump results of all APIs in the model as input, constructs corresponding API unit tests, compares NPU output against a high-precision CPU benchmark, and calculates precision metrics to identify APIs with precision issues on the NPU. It supports both random generation mode and real data mode.

Concepts

Mint API: An API generated by MindSpore during dynamic graph execution, corresponding to a PyTorch API.
MSAdapter: A compatibility layer that enables some PyTorch code to run on MindSpore.
Random generation mode: Input data is automatically constructed based on the value range. Precision is slightly lower, making it suitable for quickly locating potential precision issues.
Real data mode: The real input data generated by model dump is used for comparison, resulting in more reliable results.

Offline Pre-check Process

The operation procedure is as follows:

Install msProbe in the NPU environment.
Add the PrecisionDebugger API of msProbe to the NPU training script to collect the data to be pre-checked. For details, see Precision Data Collection in MindSpore. Note that you need to set level to L1.
Perform the pre-check, view the pre-check result file, and analyze the APIs that do not meet the pre-check requirements.

Preparations¶

Environment Setup

Install msProbe by referring to msProbe Installation Guide.

Constraints

Only MindSpore dynamic graph scenarios are supported.

Quick Start¶

Data Preparation¶

Create a dump.json file in the current directory to simulate the dump output file. The file content is as follows:

{
    "task": "statistics",
    "level": "L1",
    "dump_data_dir": null,
    "framework": "mindspore",
    "data": {
        "Mint.where.0.forward": {
            "input_args": [
             {
              "type": "mindspore.Tensor",
              "dtype": "Bool",
              "shape": [
               1,
               4096
              ],
              "Max": false,
              "Min": false,
              "Mean": null,
              "Norm": null
             },
             {
              "type": "int",
              "value": 0
             },
             {
              "type": "int",
              "value": 1
             }
            ],
            "input_kwargs": {},
            "output": [
             {
              "type": "mindspore.Tensor",
              "dtype": "Int64",
              "shape": [
               1,
               4096
              ],
              "Max": 1.0,
              "Min": 1.0,
              "Mean": 1.0,
              "Norm": 64.0
             }
            ]
           }}}

Running the Command¶

msprobe acc_check -api_info ./dump.json

For details about the result, see Pre-check Result Description.

Function Overview¶

Using acc_check for Pre-check¶

Function

The·acc_check command is used to perform unit tests on all API execution units recorded in the dump.json file, compare the output differences between the NPU and CPU, and generate the forward and backward precision results. This function is applicable to API precision pre-check in single-device scenarios.

Precautions

The pre-check depends on the actual dump data. Ensure that level is set to L1 or mix for dump.
In random data mode, the -save_error_data parameter saves additional input and output files. Evaluate the drive capacity in advance.

Syntax

msprobe acc_check -api_info <dump_json_path> [-o <out_path>] [-csv_path <result_csv_path>] [-save_error_data]

Optional fields are enclosed in square brackets ([]), and variables are enclosed in angle brackets (<>).

Parameters

Parameter	Mandatory (Yes/No)	Description
`-api_info` or `--api_info_file`	Yes	Specifies the API information file `dump.json`, which is of the string type. It also pre-checks Mint APIs and some Tensor APIs. For details about the supported Tensor APIs for pre-check, see List of APIs That Can Be Checked.
`-o` or `--out_path`	No	Specifies the path for saving the pre-check result. The value is of the string type. The default value is `./`.
`-csv_path` or `--result_csv_path`	No	Specifies the path of the `accuracy_checking_result_{timestamp}.csv` file generated when the current run is interrupted. If you want to resume the execution from the interruption point when `acc_check` is interrupted, set this parameter. The value is of the string type. You need to specify the value to the `accuracy_checking_result_{timestamp}.csv` file that was interrupted last time. For details, see Resumable Check.
-save_error_data	No	Saves the API input and output data (random data mode) that does not meet the precision requirements.

Example 1: Perform a pre-check.

msprobe acc_check -api_info ./dump.json -o ./checker_result

The execution result of acc_check is generated in the path specified by the -o parameter, including the accuracy_checking_result_{timestamp}.csv and accuracy_checking_details_{timestamp}.csv files. accuracy_checking_result_{timestamp}.csv contains API-level data and indicates whether each API has passed the test. It is recommended to view the accuracy_checking_result_{timestamp}.csv file first. For APIs that failed the test or APIs with special focus, query the status of each output and comparison metric in the accuracy_checking_details_{timestamp}.csv file based on the API Name field. For details, see Pre-check Result Description.

In random data mode, if you need to save the input and output data that does not meet the requirements, add -save_error_data to the end of the acc_check command.

Example 2: Save the input and output data that does not meet the requirements.

msprobe acc_check -api_info ./dump.json -o ./checker_result -save_error_data

The data of the APIs that do not meet the requirements will be saved in:

{out_path}/error_data/

Using `multi_acc_check` for Multi-thread Pre-check¶

Function

multi_acc_check can perform acc_check operations concurrently on multiple NPUs to accelerate API precision pre-check for large-sized models, such as 7B, 13B, and 38B models.

Precautions

All involved device IDs must be idle. acc_check running on multiple devices will concurrently invoke the Python process.
The output results of multiple devices do not overwrite each other, and subdirectories with different timestamps are automatically created.

Syntax

msprobe multi_acc_check -api_info <dump_json_path> [-d device_list] [-o out_path] [-csv_path result_csv_path] [-save_error_data]

Parameters

Parameter	Description	Type	Mandatory (Yes/No)
`-api_info` or `--api_info_file`	Specifies the API information file `dump.json` and pre-checks Mint APIs and some Tensor APIs. For details about the supported Tensor APIs for pre-check, see List of APIs That Can Be Checked.	str	Yes
`-o` or `--out_path`	Specifies the path for saving the pre-check result. The default value is `./`.	str	No
`-csv_path` or `--result_csv_path`	Specifies the path of the `accuracy_checking_result_{timestamp}.csv` file generated when the current execution is interrupted. Set this parameter to resume execution from the interruption point if `acc_check` is interrupted. You need to specify the value to the `accuracy_checking_result_{timestamp}.csv` file that was interrupted last time. For details, see Resumable Check.	str	No
`-d` or `--device`	Specifies the ID of a device where `acc_check` runs. The default value is 0. You can specify a range between 0 and the total number of devices minus 1, for example, 0 1 2 3 4.	List[int]	No
-save_error_data	Saves the API input and output data (random data mode) that does not meet the precision requirements.	Empty	No

For details about the pre-check time consumption baseline of a 38B language model with different numbers of devices, see Time Reference Baseline in "multi_acc_check" Mode.

Example

msprobe multi_acc_check -api_info ./dump.json -d 0 1 2 3

The data of the APIs that do not meet the requirements will be saved in:

./ut_error_data{timestamp}/

Output Description

After the multi_acc_check pre-check is performed, two CSV files are generated for each device. For details, see Pre-check Result Description.

Resumable Check¶

Function

Resumable check allows you to continue the check from the interruption point after a pre-check is interrupted due to environment or data scale issues, without needing to perform the comparison again.

Precautions

Ensure that the accuracy\_checking\_result\_.csv and accuracy\_checking\_details\_.csv files have not been modified.
The .csv file generated during the last interruption must be used. Using any other file may result in incorrect resumable check behavior.
Resumable check does not create a new .csv file. Instead, it appends the result to the original file.

Example

msprobe acc_check -api_info ./dump.json -csv_path xxx/accuracy_checking_result_{timestamp}.csv

The accuracy_checking_result_{timestamp}.csv file that was interrupted last time must be used. Do not modify the accuracy_checking_result_{timestamp}.csv and accuracy_checking_details_{timestamp}.csv files or their names. Otherwise, the results of the resumable check cannot be guaranteed.

Pre-check Result Description¶

The content of the accuracy_checking_result_{timestamp}.csv and accuracy_checking_details_{timestamp}.csv files generated during precision pre-check is as follows.

accuracy_checking_details_{timestamp}.csv

Field	Meaning
API Name	API name.
Bench Dtype	Type of the benchmark API data.
Tested Dtype	Type of the checked API data.
Shape	API shape.
Cosine	Cosine similarity between the checked data and the benchmark data.
MaxAbsErr	Maximum absolute error between the checked data and the benchmark data.
MaxRelativeErr	Maximum relative error between the checked data and the benchmark data.
Status	API pre-check status. The value `pass` indicates that the test is passed, and the value `error` indicates that the test is not passed.
Message	Message.

Note: PyTorch does not support reverse derivation on tensors with integer dtype, whereas MindSpore does. The pre-check for the backward process compares only outputs with floating-point dtype.

accuracy_checking_result_{timestamp}.csv

Field	Meaning
API Name	API name.
Forward Test Success	Whether the forward API passes the test. The value can be `pass` or `error`.
Backward Test Success	Whether the backward API passes the test. The value can be `pass` or `error`. If the value is empty, the API has no backward output.
Message	Message.

The pass/error status of Forward Test Success and Backward Test Success is determined by the cosine similarity and maximum absolute error values recorded in accuracy_checking_details_{timestamp}.csv. For details about the rules, see API Pre-check Metrics. Note that an API in accuracy_checking_details_{timestamp}.csv may have multiple forward or backward outputs. In this case, each output is recorded in a separate row. In accuracy_checking_result_{timestamp}.csv, the result is marked as pass only if all outputs of the API are pass. If any output is error, the result is marked as error.

API Pre-check Metrics¶

API pre-check metrics are used to determine whether an API meets the precision standards by checking the cosine similarity and maximum absolute error in accuracy_checking_details_{timestamp}.csv.
If the cosine similarity is greater than 0.99 and the maximum absolute error is less than 0.0001, the result is marked as pass. Otherwise, the result is marked as error.

Offline Precision Pre-check for MindSpore Dynamic Graphs¶

Overview¶

Preparations¶

Quick Start¶

Data Preparation¶

Running the Command¶

Function Overview¶

Using acc_check for Pre-check¶

Using multi_acc_check for Multi-thread Pre-check¶

Resumable Check¶

Pre-check Result Description¶

API Pre-check Metrics¶

Using `multi_acc_check` for Multi-thread Pre-check¶