Offline Precision Pre-check for MindSpore Dynamic Graphs¶
Overview¶
This function scans all MindSpore Mint APIs in a trained MindSpore mode, along with MindSpore APIs ported in the MSAdapter scenario running on Ascend NPUs, and then outputs diagnostic and analytical insights regarding precision. The tool takes the dump results of all APIs in the model as input, constructs corresponding API unit tests, compares NPU output against a high-precision CPU benchmark, and calculates precision metrics to identify APIs with precision issues on the NPU. It supports both random generation mode and real data mode.
Concepts
-
Mint API: An API generated by MindSpore during dynamic graph execution, corresponding to a PyTorch API.
-
MSAdapter: A compatibility layer that enables some PyTorch code to run on MindSpore.
-
Random generation mode: Input data is automatically constructed based on the value range. Precision is slightly lower, making it suitable for quickly locating potential precision issues.
- Real data mode: The real input data generated by model dump is used for comparison, resulting in more reliable results.
Offline Pre-check Process
The operation procedure is as follows:
- Install msProbe in the NPU environment.
- Add the
PrecisionDebuggerAPI of msProbe to the NPU training script to collect the data to be pre-checked. For details, see Precision Data Collection in MindSpore. Note that you need to setleveltoL1. - Perform the pre-check, view the pre-check result file, and analyze the APIs that do not meet the pre-check requirements.
Preparations¶
Environment Setup
Install msProbe by referring to msProbe Installation Guide.
Constraints
Only MindSpore dynamic graph scenarios are supported.
Quick Start¶
Data Preparation¶
Create a dump.json file in the current directory to simulate the dump output file. The file content is as follows:
{
"task": "statistics",
"level": "L1",
"dump_data_dir": null,
"framework": "mindspore",
"data": {
"Mint.where.0.forward": {
"input_args": [
{
"type": "mindspore.Tensor",
"dtype": "Bool",
"shape": [
1,
4096
],
"Max": false,
"Min": false,
"Mean": null,
"Norm": null
},
{
"type": "int",
"value": 0
},
{
"type": "int",
"value": 1
}
],
"input_kwargs": {},
"output": [
{
"type": "mindspore.Tensor",
"dtype": "Int64",
"shape": [
1,
4096
],
"Max": 1.0,
"Min": 1.0,
"Mean": 1.0,
"Norm": 64.0
}
]
}}}
Running the Command¶
For details about the result, see Pre-check Result Description.
Function Overview¶
Using acc_check for Pre-check¶
Function
The·acc_check command is used to perform unit tests on all API execution units recorded in the dump.json file, compare the output differences between the NPU and CPU, and generate the forward and backward precision results. This function is applicable to API precision pre-check in single-device scenarios.
Precautions
-
The pre-check depends on the actual dump data. Ensure that
levelis set toL1ormixfor dump. -
In random data mode, the
-save_error_dataparameter saves additional input and output files. Evaluate the drive capacity in advance.
Syntax
msprobe acc_check -api_info <dump_json_path> [-o <out_path>] [-csv_path <result_csv_path>] [-save_error_data]
Optional fields are enclosed in square brackets ([]), and variables are enclosed in angle brackets (<>).
Parameters
| Parameter | Mandatory (Yes/No) | Description |
|---|---|---|
-api_info or --api_info_file |
Yes | Specifies the API information file dump.json, which is of the string type. It also pre-checks Mint APIs and some Tensor APIs. For details about the supported Tensor APIs for pre-check, see List of APIs That Can Be Checked. |
-o or --out_path |
No | Specifies the path for saving the pre-check result. The value is of the string type. The default value is ./. |
-csv_path or --result_csv_path |
No | Specifies the path of the accuracy_checking_result_{timestamp}.csv file generated when the current run is interrupted. If you want to resume the execution from the interruption point when acc_check is interrupted, set this parameter. The value is of the string type. You need to specify the value to the accuracy_checking_result_{timestamp}.csv file that was interrupted last time. For details, see Resumable Check. |
| -save_error_data | No | Saves the API input and output data (random data mode) that does not meet the precision requirements. |
Example 1: Perform a pre-check.
The execution result of acc_check is generated in the path specified by the -o parameter, including the accuracy_checking_result_{timestamp}.csv and accuracy_checking_details_{timestamp}.csv files. accuracy_checking_result_{timestamp}.csv contains API-level data and indicates whether each API has passed the test. It is recommended to view the accuracy_checking_result_{timestamp}.csv file first. For APIs that failed the test or APIs with special focus, query the status of each output and comparison metric in the accuracy_checking_details_{timestamp}.csv file based on the API Name field. For details, see Pre-check Result Description.
In random data mode, if you need to save the input and output data that does not meet the requirements, add -save_error_data to the end of the acc_check command.
Example 2: Save the input and output data that does not meet the requirements.
The data of the APIs that do not meet the requirements will be saved in:
Using multi_acc_check for Multi-thread Pre-check¶
Function
multi_acc_check can perform acc_check operations concurrently on multiple NPUs to accelerate API precision pre-check for large-sized models, such as 7B, 13B, and 38B models.
Precautions
-
All involved device IDs must be idle.
acc_checkrunning on multiple devices will concurrently invoke the Python process. -
The output results of multiple devices do not overwrite each other, and subdirectories with different timestamps are automatically created.
Syntax
msprobe multi_acc_check -api_info <dump_json_path> [-d device_list] [-o out_path] [-csv_path result_csv_path] [-save_error_data]
Parameters
| Parameter | Description | Type | Mandatory (Yes/No) |
|---|---|---|---|
-api_info or --api_info_file |
Specifies the API information file dump.json and pre-checks Mint APIs and some Tensor APIs. For details about the supported Tensor APIs for pre-check, see List of APIs That Can Be Checked. |
str | Yes |
-o or --out_path |
Specifies the path for saving the pre-check result. The default value is ./. |
str | No |
-csv_path or --result_csv_path |
Specifies the path of the accuracy_checking_result_{timestamp}.csv file generated when the current execution is interrupted. Set this parameter to resume execution from the interruption point if acc_check is interrupted. You need to specify the value to the accuracy_checking_result_{timestamp}.csv file that was interrupted last time. For details, see Resumable Check. |
str | No |
-d or --device |
Specifies the ID of a device where acc_check runs. The default value is 0. You can specify a range between 0 and the total number of devices minus 1, for example, 0 1 2 3 4. |
List[int] | No |
| -save_error_data | Saves the API input and output data (random data mode) that does not meet the precision requirements. | Empty | No |
For details about the pre-check time consumption baseline of a 38B language model with different numbers of devices, see Time Reference Baseline in "multi_acc_check" Mode.
Example
The data of the APIs that do not meet the requirements will be saved in:
Output Description
After the multi_acc_check pre-check is performed, two CSV files are generated for each device. For details, see Pre-check Result Description.
Resumable Check¶
Function
Resumable check allows you to continue the check from the interruption point after a pre-check is interrupted due to environment or data scale issues, without needing to perform the comparison again.
Precautions
-
Ensure that the
accuracy\_checking\_result\_.csvandaccuracy\_checking\_details\_.csvfiles have not been modified. -
The .csv file generated during the last interruption must be used. Using any other file may result in incorrect resumable check behavior.
-
Resumable check does not create a new .csv file. Instead, it appends the result to the original file.
Example
The accuracy_checking_result_{timestamp}.csv file that was interrupted last time must be used. Do not modify the accuracy_checking_result_{timestamp}.csv and accuracy_checking_details_{timestamp}.csv files or their names. Otherwise, the results of the resumable check cannot be guaranteed.
Pre-check Result Description¶
The content of the accuracy_checking_result_{timestamp}.csv and accuracy_checking_details_{timestamp}.csv files generated during precision pre-check is as follows.
accuracy_checking_details_{timestamp}.csv
| Field | Meaning |
|---|---|
| API Name | API name. |
| Bench Dtype | Type of the benchmark API data. |
| Tested Dtype | Type of the checked API data. |
| Shape | API shape. |
| Cosine | Cosine similarity between the checked data and the benchmark data. |
| MaxAbsErr | Maximum absolute error between the checked data and the benchmark data. |
| MaxRelativeErr | Maximum relative error between the checked data and the benchmark data. |
| Status | API pre-check status. The value pass indicates that the test is passed, and the value error indicates that the test is not passed. |
| Message | Message. |
Note: PyTorch does not support reverse derivation on tensors with integer dtype, whereas MindSpore does. The pre-check for the backward process compares only outputs with floating-point dtype.
accuracy_checking_result_{timestamp}.csv
| Field | Meaning |
|---|---|
| API Name | API name. |
| Forward Test Success | Whether the forward API passes the test. The value can be pass or error. |
| Backward Test Success | Whether the backward API passes the test. The value can be pass or error. If the value is empty, the API has no backward output. |
| Message | Message. |
The pass/error status of Forward Test Success and Backward Test Success is determined by the cosine similarity and maximum absolute error values recorded in accuracy_checking_details_{timestamp}.csv. For details about the rules, see API Pre-check Metrics.
Note that an API in accuracy_checking_details_{timestamp}.csv may have multiple forward or backward outputs. In this case, each output is recorded in a separate row. In accuracy_checking_result_{timestamp}.csv, the result is marked as pass only if all outputs of the API are pass. If any output is error, the result is marked as error.
API Pre-check Metrics¶
- API pre-check metrics are used to determine whether an API meets the precision standards by checking the cosine similarity and maximum absolute error in
accuracy_checking_details_{timestamp}.csv. - If the cosine similarity is greater than 0.99 and the maximum absolute error is less than 0.0001, the result is marked as
pass. Otherwise, the result is marked aserror.