Reinforcement Learning
Define a reward function and train an RL policy in-sim to convergence (smoothed reward >= 6.0) with a greedy rollout that reaches the goal safely and efficiently.
Try this first — before any explanation.
Same rover, new course C2 (open arena, one goal pad, one hazard). There are no labels and no expert — you're handed a training loop already wired up. It calls one function you must write: reward(obs, action, next_obs). Make the rover reliably reach the goal by writing only the reward. The trap: the obvious sparse reward (+1 at goal, 0 otherwise) leaves the curve flat near zero — the policy almost never stumbles onto the goal by chance. Reward design, not the algorithm, is the lever.
Write the reward function (progress + safety + time) so the provided RL loop converges.
Reinforcement Learning
Write the reward function (progress + safety + time) so the provided RL loop converges.
The idea, built visually.
Last lesson you gave the rover answers — labeled examples. But what if all you can give it is a score, higher is better, and let it figure out the rest? Think of the reward as a landscape and learning as climbing it. A reward that's flat everywhere except one pinprick at the goal? The policy is blind — it wanders, never feeling which way is up.
Shape the landscape so getting closer already pays a little, and now there's a slope to climb. Episode by episode the policy nudges toward actions that scored well; watch the smoothed reward rise and cross the line — that's convergence. We never told it the path; we shaped the incentive, and the path fell out.
▣ Stage animation: The arena lifts into a 3-D reward surface: with sparse reward a flat plain with one lonely spike, the policy ball wandering; it morphs to a gentle slope toward the goal and the ball rolls uphill; a split shows the reward curve crossing a dashed convergence threshold while the rover's C2 path straightens episode by episode.
Build it up, step by step.
Step 1 (worked): run the harness with the sparse reward and plot the flat curve. Step 2 (worked): the provided potential-based progress term (2.0 * (goal_dist shrink)) plus terminal bonus. Step 3 (faded): add the hazard penalty (must exceed corner-cut progress) and a small per-step time penalty. Step 4 (independent): train to convergence and evaluate the greedy rollout.
How the Bench grades your run.
PASS WHEN Smoothed final reward >= 6.0 and non-decreasing over the last 100 episodes, greedy rollout reaches the goal with no hazard entry in <= 220 steps, and re-converges at seed 441.
- FAIL: final reward low and curve flat — reward is sparse; add a dense per-step term for closing goal_dist (potential-based shaping).
- FAIL: converged but collided in hazard — hazard penalty < corner-cut progress; raise the penalty above the progress gained by clipping the zone.
- FAIL: reward high but reached=False — shaping rewards moving not arriving; keep the terminal bonus and confirm progress = next_obs minus obs.
Bring back what you've already mastered.
- From 3.1: what does RL have that supervised classification did not, that let it learn with no labels? (A reward signal / trial-and-error.)
- From 2.2/2.3: which is easier to guarantee never enters the hazard, the FSM rule or the learned policy, and why? (Rules give guarantees; learned policies give statistics.)
- From 2.1: re-train with heading_err removed from the observation — what happens to convergence and why? (The policy can only learn from what's in its state.)
What you must demonstrate to advance.
Reward drives the RL loop to smoothed reward >= 6.0 non-decreasing over last 100 episodes, greedy rollout reaches goal with no hazard entry in <= 220 steps, re-converges at seed 441 (L3: design a reward for a safe, efficient policy and repair a degenerate curve).
How this feeds your build.
Feeds the capstone (5.1) as the learned navigation component integrated beneath the FSM; in 5.2 its inference loop is a candidate to push to the metal if profiling flags it.