verl场景下Snapshot数据采集和分析案例¶

问题背景¶

verl训练任务在PPO/RLHF等场景中，显存压力通常集中在rollout生成、actor更新、reference logprob计算、critic更新、reward model推理、checkpoint保存等阶段。为了定位峰值显存来源、观察显存是否随step持续增长、对比不同rank之间的显存分配差异，可以使用verl内置的torch_memory profiler采集PyTorch memory snapshot。

本文介绍verl场景下snapshot数据的采集方法，并给出基于MindStudio Insight的分析案例。

定位思路¶

优先使用最小化采集参数验证snapshot链路可用。
单卡场景采集rank 0，并控制采集步数，降低运行开销。
多卡场景优先采集代表性rank；只有需要观察所有rank差异时，才开启all-rank采集。
采集完成后检查输出目录，确认snapshot文件生成且非空。
后续使用MindStudio Insight分析工具进行问题分析。

Snapshot数据采集¶

采集参数说明¶

verl中开启memory snapshot采集的核心参数如下：

参数	说明
`global_profiler.tool`	选择verl的PyTorch显存采集工具。
`global_profiler.save_path`	snapshot文件输出目录。
`actor_rollout_ref.actor.profiler.enable`	开启actor侧profiler集成。
`actor_rollout_ref.actor.profiler.ranks`	指定采集的rank数组，适合单rank或少量rank采集。
`actor_rollout_ref.actor.profiler.all_ranks`	采集所有rank，文件量和运行开销较大。
`trainer.device`	指定训练任务使用的设备类型，例如`cuda`或`npu`。
`global_profiler.steps`	控制采集的step范围。每次采集完成后会删除已有的memory history历史记录，避免不同采集窗口的数据相互混入。
`global_profiler.global_tool_config.torch_memory.trace_alloc_max_entries`	保留的显存分配记录数量。值越大，snapshot越完整，但额外开销越高。
`global_profiler.global_tool_config.torch_memory.stack_depth`	记录的调用栈深度。值越大，越有利于归因，但额外开销越高。

单卡采集命令¶

单卡任务只需要采集rank 0。以下命令以verl.trainer.main_ppo为入口，用户可将其中的数据集、模型和训练参数替换为实际任务配置。

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=/home/chenyan/verl_data/train.parquet \
    data.val_files=/home/chenyan/verl_data/test.parquet \
    data.train_batch_size=16 \
    data.max_prompt_length=512 \
    data.max_response_length=128 \
    data.filter_overlong_prompts=True \
    data.truncation=error \
    actor_rollout_ref.model.path=/data/models/Qwen/Qwen2.5-7B-Instruct \
    actor_rollout_ref.actor.ppo_mini_batch_size=8 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.2 \
    actor_rollout_ref.rollout.n=4 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
    trainer.n_gpus_per_node=1 \
    trainer.nnodes=1 \
    trainer.total_epochs=1 \
    trainer.default_local_dir=/home/chenyan/verl_outputs \
    trainer.device=npu \
    global_profiler.tool=torch_memory \
    actor_rollout_ref.actor.profiler.ranks='[0]' \
    actor_rollout_ref.actor.profiler.enable=True \
    global_profiler.steps=[1,2,3,4,5,6,7,8,9,10] \
    global_profiler.save_path=/home/chenyan/verl_outputs/mem_snapshots_single \
    global_profiler.global_tool_config.torch_memory.trace_alloc_max_entries=100000 \
    global_profiler.global_tool_config.torch_memory.stack_depth=32

多卡指定rank采集命令¶

多卡任务建议先采集少量代表性rank，例如rank 0和rank 1，用于对比actor执行过程中的显存分配差异。以下命令以单机4卡为例。

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=/home/chenyan/verl_data/train.parquet \
    data.val_files=/home/chenyan/verl_data/test.parquet \
    data.train_batch_size=16 \
    data.max_prompt_length=512 \
    data.max_response_length=128 \
    data.filter_overlong_prompts=True \
    data.truncation=error \
    actor_rollout_ref.model.path=/data/models/Qwen/Qwen2.5-7B-Instruct \
    actor_rollout_ref.actor.ppo_mini_batch_size=8 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.2 \
    actor_rollout_ref.rollout.n=4 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
    trainer.n_gpus_per_node=4 \
    trainer.nnodes=1 \
    trainer.total_epochs=1 \
    trainer.default_local_dir=/home/chenyan/verl_outputs \
    trainer.device=npu \
    global_profiler.tool=torch_memory \
    actor_rollout_ref.actor.profiler.ranks='[0,1]' \
    actor_rollout_ref.actor.profiler.enable=True \
    global_profiler.steps=[1,2,3,4,5,6,7,8,9,10] \
    global_profiler.save_path=/home/chenyan/verl_outputs/mem_snapshots_selected_ranks \
    global_profiler.global_tool_config.torch_memory.trace_alloc_max_entries=100000 \
    global_profiler.global_tool_config.torch_memory.stack_depth=32

多卡全rank采集命令¶

如果需要比较所有rank的峰值显存、分配栈或step间增长差异，可以开启全rank采集。该模式会生成更多snapshot文件，并带来更高运行开销，建议只在较短训练窗口内使用。

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=/home/chenyan/verl_data/train.parquet \
    data.val_files=/home/chenyan/verl_data/test.parquet \
    data.train_batch_size=16 \
    data.max_prompt_length=512 \
    data.max_response_length=128 \
    data.filter_overlong_prompts=True \
    data.truncation=error \
    actor_rollout_ref.model.path=/data/models/Qwen/Qwen2.5-7B-Instruct \
    actor_rollout_ref.actor.ppo_mini_batch_size=8 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.2 \
    actor_rollout_ref.rollout.n=4 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
    trainer.n_gpus_per_node=4 \
    trainer.nnodes=1 \
    trainer.total_epochs=1 \
    trainer.default_local_dir=/home/chenyan/verl_outputs \
    trainer.device=npu \
    global_profiler.tool=torch_memory \
    actor_rollout_ref.actor.profiler.all_ranks=True \
    actor_rollout_ref.actor.profiler.enable=True \
    global_profiler.steps=[1,2,3,4,5,6,7,8,9,10] \
    global_profiler.save_path=/home/chenyan/verl_outputs/mem_snapshots_all_ranks \
    global_profiler.global_tool_config.torch_memory.trace_alloc_max_entries=100000 \
    global_profiler.global_tool_config.torch_memory.stack_depth=32

采集结果整理¶

采集完成后，verl会在global_profiler.save_path指定的输出目录下按step创建子目录。每个step目录内部直接保存所有已配置采集rank在该step采集到的snapshot文件，文件名格式为torch_memory_rank{卡号}_pid{进程号}.pickle：

mem_snapshots_selected_ranks/
├── step1/
│   ├── torch_memory_rank0_pid12345.pickle
│   └── torch_memory_rank1_pid12346.pickle
├── step2/
│   ├── torch_memory_rank0_pid12345.pickle
│   └── torch_memory_rank1_pid12346.pickle
└── step3/
    ├── torch_memory_rank0_pid12345.pickle
    └── torch_memory_rank1_pid12346.pickle

对比不同rank时，应在同一个step目录下比较各rank的数据。对比不同实验时，需要记录模型路径、batch size、prompt/response长度、rollout并行度、FSDP/TP/PP配置和采集参数。