Skip to content

Reinforcement Learning for Lane Following

Overview

Train a reinforcement learning (RL) policy in simulation to keep a virtual racecar centered in its lane using a lightweight environment such as Donkey Car Gym or a custom Gazebo world. You will configure the simulator, implement a simple reward function, train the agent, and evaluate transfer potential to the real vehicle.

Prerequisites

  • Completed Train and Deploy an MLP Steering Policy
  • Python RL stack installed (uv pip install stable-baselines3[extra] gymnasium)
  • Donkey Car Gym environment or equivalent ROS 2-compatible simulator
  • GPU recommended for faster policy training

Learning Objectives

  • Students can set up and run a basic reinforcement learning training loop in simulation.
  • Formulate lane following as a Markov Decision Process (MDP)
  • Implement a reward function balancing lane centering and speed
  • Train an RL agent (e.g., PPO) and monitor convergence
  • Evaluate policy robustness and plan sim-to-real transfer steps

1. Set Up the Simulator

  1. Install Donkey Car Gym:
    git clone https://github.com/tawnkramer/donkey_gym.git
    cd donkey_gym
    pip install -e .
    
  2. Launch the simulator:
    python scripts/donkey_sim.py --headless 0 --level 1
    
  3. Confirm control loop frequency (~20 Hz) and observation structure.

2. Define the Environment Wrapper

Create fri_rl/envs/donkey_lane.py:

import gymnasium as gym
import numpy as np
from donkey_gym.envs.donkey_env import DonkeyEnv

class LaneFollowEnv(gym.Env):
    metadata = {'render_modes': ['human']}

    def __init__(self):
        self.env = DonkeyEnv(level=1, frame_skip=1, camera_height=80, camera_width=160)
        self.observation_space = gym.spaces.Box(low=0, high=255, shape=(80, 160, 3), dtype=np.uint8)
        self.action_space = gym.spaces.Box(low=np.array([-1.0, 0.0]), high=np.array([1.0, 1.0]), dtype=np.float32)

    def step(self, action):
        obs, reward, done, info = self.env.step(action)
        lane_error = info.get('cte', 0.0)
        speed = info.get('speed', 0.0)
        reward = lane_reward(lane_error, speed)
        return obs, reward, done, info

    def reset(self, *, seed=None, options=None):
        return self.env.reset(), {}

    def render(self):
        return self.env.viewer.render(return_rgb_array=True)

    def close(self):
        self.env.close()

Add helper function lane_reward(cte, speed) returning 1.0 - abs(cte) - 0.1 * (1.0 - speed / speed_ref).

3. Train a PPO Agent

  1. Script train_ppo.py:
    import gymnasium as gym
    from stable_baselines3 import PPO
    from stable_baselines3.common.callbacks import EvalCallback
    from fri_rl.envs.donkey_lane import LaneFollowEnv
    
    env = LaneFollowEnv()
    model = PPO('CnnPolicy', env, learning_rate=3e-4, n_steps=2048, batch_size=256, verbose=1)
    eval_env = LaneFollowEnv()
    callback = EvalCallback(eval_env, best_model_save_path='artifacts/', eval_freq=10000, deterministic=True)
    model.learn(total_timesteps=1_000_000, callback=callback)
    model.save('artifacts/ppo_lane_follow')
    
  2. Monitor training with TensorBoard (tensorboard --logdir logs/).
  3. Expect convergence within 0.5–1.0 million steps depending on reward shaping.

4. Evaluate and Visualize

  • Run scripted evaluations to calculate mean cross-track error (CTE) and episode length.
  • Record videos (model.get_env().render()) to present qualitative results.
  • Analyze failure cases (sharp turns, lighting changes) and adjust reward or curriculum.

5. Plan Sim-to-Real Transfer

  • Compare observation modalities: if the real car uses grayscale images, replicate in sim.
  • Apply domain randomization (textures, lighting) during training to improve robustness.
  • Export the policy (model.policy) and convert it to TorchScript for ROS 2 deployment.

6. Safety Considerations

  • Before real-world trials, test the RL policy in a hardware-in-the-loop setup with velocity caps.
  • Keep manual override ready and limit gains to reduce aggressive maneuvers.
  • Use the MLP policy as a fallback if the RL controller diverges.

Wrap-Up

Archive training logs, best-performing checkpoints, and environment configuration files. Document lesson learned on reward shaping and domain randomization to inform future sim-to-real experiments.