GYM Environment Training

GYM-style environment training wraps the “model → environment → reward” chain behind an abstract interface, letting the LLM interact with the environment as an Agent over multiple turns. The reward of each step is produced directly by the environment, so you don’t need a separate reward function to infer it from the trajectory. This document first introduces the interface, then walks through a complete custom example (FrozenLake) showing how to plug it into training.

Gym interface

GYM originates from the Gymnasium library. In ms-swift we define the following interface:

class Env(ABC):

    def __init__(self, env_config):
        """env_config comes from the env_config column of each dataset row and carries initialization arguments."""
        self.env_config = env_config

    @abstractmethod
    async def reset(self, config: RolloutInferRequest) -> Tuple[str, Dict[str, Any], str]:
        """
        Returns:
            - observation: sent to the model as the first user message
            - info: debug/log information, recorded in completions.jsonl
            - system_message: system prompt for this trajectory
        """
        pass

    @abstractmethod
    async def step(self, action: Messages) -> Tuple[str, float, bool, Dict[str, Any]]:
        """
        Args:
            action: the complete conversation messages so far; the last one is the model's latest reply
        Returns:
            - next_observation: next user message
            - reward: reward for the current step
            - done: whether the trajectory is finished
            - info: debug/log information
        """
        pass

    @abstractmethod
    async def close(self):
        """Release resources."""
        pass

The RolloutInferRequest received by reset contains the dataset row’s messages, data_dict (extra columns including env_config), etc. See the input example for the full structure.

If you need extra control over the conversation history between turns (e.g. dynamic compression, injecting hints), subclass MultiTurnScheduler and implement on_trajectory_start / on_turn_end hooks, or override step / run — see the multi-turn doc.

Launching training

Use the built-in gym_scheduler to wire the env into multi-turn rollout.

GYMScheduler is based on the generic hook protocol:

Inherits MultiTurnScheduler — no need to override the run method
Implements on_trajectory_start (calls env.reset) and on_turn_end (calls env.step)
Works with both server mode (run()) and colocate mode (run_multi_turn())

User-defined envs are loaded via --external_plugins your_plugin.py; the plugin runs envs['my_env'] = MyEnv to register them (the FrozenLake example below demonstrates the full pattern).

The built-in GYMScheduler completes the control logic via hooks:

class GYMScheduler(MultiTurnScheduler):
    def on_trajectory_start(self, requests):
        # Create an env for each request, call env.reset, inject initial observation
        for req in requests:
            env = self._create_env(req.data_dict.get('env_config', {}))
            observation, info, system_message = env.reset(req)
            req.messages = [system_msg, user_msg(observation)]
            self._envs[req.uuid] = env

    def on_turn_end(self, req, response_choice, current_turn):
        # Call env.step, accumulate reward, return done + rollout_infos
        next_obs, reward, done, info = env.step(deepcopy(req.messages))
        self._total_rewards[req.uuid] += reward
        return {
            'done': done,
            'rollout_infos': {
                'total_reward': self._total_rewards[req.uuid],
                'step_rewards': [...],
            }
        }

    def step(self, req, response_choice, current_turn):
        # Inject the next observation into a user message
        if self._pending_obs.get(req.uuid):
            req.messages.append({'role': 'user', 'content': next_obs})
        return {'infer_request': req}

Colocate mode:

megatron rlhf \
    --rlhf_type grpo \
    --vllm_mode colocate \
    --external_plugins examples/megatron/grpo/multi_turn/frozen_lake_plugin.py \
    --multi_turn_scheduler gym_scheduler \
    --gym_env frozen_lake \
    --use_gym_env true \
    --max_turns 10 \
    ...

# swift rlhf works the same way

Server mode

swift rollout \
    --model xxx \
    --use_gym_env true \
    --external_plugins examples/megatron/grpo/multi_turn/frozen_lake_plugin.py \
    --multi_turn_scheduler gym_scheduler \
    --gym_env frozen_lake \
    --max_turns 10

# On the trainer side, add --vllm_server_pass_dataset true so the env_config column reaches the rollout server.
megatron rlhf --vllm_mode server --vllm_server_pass_dataset true ...
# or swift rlhf --vllm_mode server --vllm_server_pass_dataset true ...

Two ways to select the environment:

Set it globally via --gym_env env_name (recommended — one env for the whole script);
Or specify it per dataset row via env_config.name (for mixed-env workloads; overrides --gym_env).

Example: writing a FrozenLake environment from scratch

FrozenLake environment (image from Gymnasium docs)

FrozenLake is a classic task from OpenAI Gym: the agent starts at the start cell, must cross a frozen lake to reach the goal, and avoid holes along the way. The original environment is illustrated above. The walkthrough below uses a text-only version of it (the same grid rendered as ASCII).

Full source: frozen_lake_plugin.

1. Define the Env

Each dataset row produces a freshly generated random 4x4 map (random holes + random S/G positions, BFS-validated to be solvable). Cell meanings: S start / G goal / H hole (stepping in = fail) / F safe ice / P player’s current position.

class FrozenLakeEnv(Env):
    def __init__(self, env_config):
        super().__init__(env_config)
        self.size = int(env_config.get('size', 4))
        self.p = float(env_config.get('p', 0.8))
        seed = env_config.get('seed')
        self.seed = int(seed) if seed is not None else None

    async def reset(self, config: RolloutInferRequest):
        self.grid = generate_random_map(size=self.size, p=self.p, seed=self.seed)
        ...
        return observation, {'seed': self.seed}, SYSTEM_PROMPT

    async def step(self, action: Messages):
        move = _parse_action(action[-1]['content'])  # <action>up|down|left|right</action>
        # Advance one cell, check G / H; the outer max_turns is enforced by the scheduler.
        if cell == 'G': return obs, 1.0, True, {'status': 'goal'}
        if cell == 'H': return obs, 0.0, True, {'status': 'hole'}
        ...

2. Register

Hook the env class into swift’s envs registry. --external_plugins imports the file at startup, so the registration takes effect automatically:

# examples/megatron/grpo/multi_turn/frozen_lake_plugin.py
from swift.rollout.gym_env import Env, envs

class FrozenLakeEnv(Env):
    ...

envs['frozen_lake'] = FrozenLakeEnv

3. Prepare the dataset

The dataset is just a placeholder here — the actual data is constructed by the env, with env_config.seed controlling map-generation randomness:

{"messages":[{"role":"user","content":"<placeholder>"}],"env_config":{"seed":0}}
{"messages":[{"role":"user","content":"<placeholder>"}],"env_config":{"seed":1}}
...
{"messages":[{"role":"user","content":"<placeholder>"}],"env_config":{"seed":127}}

4. (Optional) Blend in extra rewards

With --use_gym_env true, the env-provided total_reward is automatically added as one reward column — no reward function is required. To mix in additional signals (e.g. format/length checks), just pass them via --reward_funcs; the gym reward is appended as an extra column and blended with the reward_funcs through --reward_weights. For example, also enabling a format reward:

megatron rlhf ... --use_gym_env true --reward_funcs format --reward_weights 0.2 1.0
# the last entry of reward_weights corresponds to the gym total_reward

5. Train

Runnable script: examples/megatron/grpo/multi_turn/frozen_lake.sh

During training, observe rollout_infos.num_turns (steps per trajectory) and the reward mean in the logs. --log_completions true writes full conversations to completions.jsonl, so you can verify the model outputs in the <action>...</action> format turn by turn.

References:

https://gymnasium.farama.org/environments/toy_text/frozen_lake/
https://github.com/alibaba/ROLL/tree/main/roll/pipeline/agentic/env/frozen_lake
OpenEnv
TRL Sudoku GRPO Example

OpenEnv Environment Training

OpenEnv is an open-source Agentic RL environment framework by HuggingFace that communicates with environment servers via WebSocket. Unlike the local Env interface used by FrozenLake above, OpenEnv places environment logic in a separate server process, and swift communicates with it through OpenEnvScheduler + OpenEnvWrapper.

Architecture Comparison

Feature	Built-in Gym (`GYMScheduler`)	OpenEnv (`OpenEnvScheduler`)
Environment location	In-process (Python object)	Standalone server (WebSocket)
Environment interface	Subclass `Env`, implement `reset/step/close`	Server provides HTTP/WebSocket API
Registration	`--external_plugins` + `envs` registry	`--external_plugins` + `multi_turns` registry
Use case	Lightweight local envs (FrozenLake, etc.)	Complex server envs (TextArena, CARLA, etc.)
Concurrency control	Not needed	Built-in Semaphore for connection limiting

OpenEnvScheduler

OpenEnvScheduler extends GYMScheduler, replacing the local Env with OpenEnvWrapper (a WebSocket client). Key design:

_create_env: Creates an OpenEnvWrapper connected to the OpenEnv server
on_trajectory_start: Creates a wrapper per request, calls reset(), uses Semaphore to limit concurrency (default 4)
on_turn_end: Parses model output, calls wrapper.step(), accumulates reward
parse_action (overridable): Converts model text to action dict, default json.loads
format_observation (overridable): Converts server observation to string, default json.dumps

Users subclass OpenEnvScheduler and override parse_action, format_observation, on_trajectory_start, and on_turn_end to adapt to specific environments.

Example: Sudoku Environment

Using TextArena Sudoku as an example, the model places numbers on a 9x9 Sudoku grid via [row col number] format. Full code: sudoku_scheduler.py.

1. Start OpenEnv Server

Install OpenEnv and the Sudoku environment package (textarena and nltk are installed automatically as dependencies):

pip install openenv
pip install git+https://huggingface.co/spaces/openenv/sudoku

Use the provided startup script to start the local server (default port 8000). MAX_CONCURRENT_ENVS must be ≥ num_generations used in training:

TEXTARENA_ENV_ID=Sudoku-v0 MAX_CONCURRENT_ENVS=8 python examples/train/grpo/plugin/openenv/start_sudoku_server.py

The default python -m textarena_env.server.app only supports 1 concurrent session, which is insufficient for GRPO’s parallel multi-generation sampling. start_sudoku_server.py lifts this restriction by setting SUPPORTS_CONCURRENT_SESSIONS.

Point base_url to the local server in your dataset:

{"messages":[{"role":"user","content":"Play"}],"env_config":{"name":"openenv","base_url":"http://127.0.0.1:8000"}}

2. Custom Scheduler

Subclass OpenEnvScheduler to implement Sudoku-specific action parsing, observation formatting, and multi-component rewards:

from swift.rollout.multi_turn import OpenEnvScheduler

class SudokuScheduler(OpenEnvScheduler):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._last_content_len = {}  # Content diff tracking

    async def on_trajectory_start(self, requests):
        # Create env, parse board, generate hints
        # hints include 'guaranteed moves' and candidate numbers
        ...

    async def on_turn_end(self, infer_request, response_choice, current_turn):
        # Parse [row col number], step env
        # Compute 5-component reward: empty_cell / valid_move / repetition / progress / correct
        # Return updated board + hints as next observation
        ...

    def parse_action(self, text):
        import re
        match = re.search(r'\[\s*(\d+)\s+(\d+)\s+(\d+)\s*\]', text)
        if match:
            row, col, num = match.groups()
            return {"message": f"[{row} {col} {num}]"}
        return {"message": "[1 1 1]"}

Multi-component reward system (adapted from TRL Sudoku example):

Reward component	Calculation	Purpose
`empty_cell_reward`	Targets empty cell +1 / overwrites -1	Guide model to valid positions
`valid_move_reward`	Valid new move +1 / warning -0.5 / invalid 0	Encourage legal moves
`repetition_reward`	Exponential penalty for repeats (-2^n, cap -10)	Avoid repetition
`progress_reward`	(filled - initial) / (81 - initial)	Measure solving progress
`correct_reward`	Binary reward from environment	Puzzle fully solved

Combined reward = sum of component averages, providing denser learning signal than a single binary reward.

3. Hints System

At each turn, the scheduler parses the current board state and provides hints to the model:

GUARANTEED MOVES: Cells with only one candidate (can be filled directly)
Other options: Cells with 2-3 candidates
MOVES ALREADY TRIED: Previously attempted moves (to avoid repetition)

This significantly reduces exploration difficulty and enables the model to make more valid moves.

4. Prepare Dataset

The dataset serves as a placeholder; actual boards are generated by the environment server. Point base_url to the OpenEnv hosted address:

{"messages":[{"role":"user","content":"Play"}],"env_config":{"name":"openenv","base_url":"http://127.0.0.1:8000"}}

5. Register Scheduler

sudoku_scheduler.py includes registration code at the end, loaded via --external_plugins:

# End of sudoku_scheduler.py
from swift.rollout.multi_turn import multi_turns
multi_turns['sudoku_scheduler'] = SudokuScheduler

6. Start Training

swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen3.5-4B \
    --dataset examples/train/grpo/plugin/openenv/sudoku.jsonl \
    --external_plugins examples/train/grpo/plugin/openenv/sudoku_scheduler.py \
    --enable_thinking false \
    --max_completion_length 256 \
    --use_gym_env true \
    --multi_turn_scheduler sudoku_scheduler \
    --max_turns 20 \
    --use_vllm true \
    --vllm_mode colocate \
    ...

Runnable script: examples/train/grpo/plugin/openenv/run_grpo_sudoku.sh

Notes

vLLM mode: The example above uses --vllm_mode colocate, where vLLM and training share the same GPUs. If using --vllm_mode server, you need to start swift rollout separately as the vLLM server, and --multi_turn_scheduler / --max_turns should be passed to swift rlhf, not swift rollout.
Server concurrency: start_sudoku_server.py’s MAX_CONCURRENT_ENVS must be ≥ num_generations used in training. The default python -m textarena_env.server.app only supports 1 concurrent session.
Content diff: Environments like TextArena return cumulative messages (full history each turn). The scheduler tracks _last_content_len to return only the new portion, preventing context length explosion.
First-turn timing: on_trajectory_start is called BEFORE the first rollout, ensuring the model sees the actual environment observation (e.g., Sudoku board) rather than the placeholder text from the dataset.
enable_thinking: When using Qwen3.5 series models, set --enable_thinking false to skip <think> block generation.
Sync I/O: OpenEnvWrapper’s reset()/step() are synchronous WebSocket calls. OpenEnvScheduler subclasses should wrap these calls with asyncio.to_thread() to avoid blocking the event loop.