Ball-on-Plate Control: Dynamic Programming and Reinforcement Learning

Academic implementation project (for credit) comparing model-based and model-free control on a physical Ball-on-Plate system.

Overview

This project studies how two different decision-making paradigms behave on the same hardware platform:

Dynamic Programming (DP) with a known model and explicit planning.
Reinforcement Learning (RL) with a learned policy using Soft Actor-Critic (SAC).

The system objective is to move a ball to targets (or through maze-like paths) while avoiding obstacles and keeping trajectories smooth. The project emphasizes not only control performance, but also interpretability through live visual diagnostics.

Control Concept

Both controllers optimize discounted return:

G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

For DP, planning is based on Bellman optimality. Value iteration updates:

V_{k+1}(s) = \max_{a \in \mathcal{A}(s)} \left[ r(s,a) + \gamma V_k(s') \right]

and extracts the greedy policy:

\pi_*(s) = \arg\max_{a \in \mathcal{A}(s)} \left[ r(s,a) + \gamma V(s') \right].

For RL (SAC), the agent learns a stochastic policy that balances return and exploration:

J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[\sum_t \gamma^t\left(r_t + \alpha\,\mathcal{H}(\pi(\cdot\mid s_t))\right)\right].

Dynamic Programming (Model-Based)

The DP controller is formulated as a discrete MDP and used as a planner above a PID layer.

State: grid cell on a $30 \times 30$ discretization of the plate.
Action: relative waypoint shift $(\Delta i, \Delta j)$ with $\Delta i,\Delta j \in \{-5,\ldots,5\}$ .
Transition model: each candidate action is evaluated by simulating the closed-loop PID + nonlinear ball dynamics.
Planning time scales: outer step $\Delta t_{\text{outer}} = 0.5\,\mathrm{s}$ , inner simulation step $\Delta t_{\text{inner}} = 0.008\,\mathrm{s}$ .
Reward design: obstacle penalties, near-obstacle penalty, step cost, stay-still penalty, zero terminal cost at goal.

To keep runtime responsive, transition rollouts are precomputed and cached (hash-based cache key from full parameter settings), and value iteration runs in a background thread.

Reinforcement Learning (SAC)

The RL controller directly outputs plate-angle commands and is trained in simulation.

Observation (8D): ball position, velocity, target position, and current plate angles.
Action (2D): normalized commands in $[-1,1]^2$ , scaled to physical tilt setpoints in $[-10^\circ,10^\circ]$ .
Reward: combines target collection bonus, distance shaping, per-step cost, distance penalty, and stuck penalty.
Training setup: Stable-Baselines3 SAC, 8 parallel environments, replay buffer, periodic evaluation/checkpointing.

The environment reuses the same nonlinear friction-based dynamics model used for simulation studies and deployment validation.

Implementation Architecture

Key modules include grid discretization helpers, PID-integrated transition simulation, MDP building/value iteration logic, RL environment/training scripts, and live projection overlays.

Visualization and Explainability

Two live overlays make controller behavior visible during experiments:

DP value-function heatmap with obstacle mask and selected waypoints.
RL neural-network visualization showing input sliders, hidden-layer activations, top- $k$ edge contributions, and output actions.

This makes it possible to inspect how decisions evolve in real time instead of treating the controller as a black box.

Ball-on-Plate cover

If you want to embed your recordings, place files in /public/images/ball-on-plate/ and add:

<video controls>
  <source src="/images/ball-on-plate/hardware-demo.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>

<video controls>
  <source src="/images/ball-on-plate/simulation-demo.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>

Results

Demonstrated both model-based and model-free control on the same physical setup.
Built a reproducible pipeline for planning, training, deployment, and live diagnostics.
Established a clear experimental baseline for future thesis work in robotics and learning-based control.

Future Work

Extend DP with approximate methods that include velocity in the planner state.
Improve sim-to-real transfer with domain randomization for SAC.
Add quantitative benchmarking (success rate, time-to-goal, smoothness, energy) across controllers.

Ball-on-Plate Control: DP + RL on Hardware

Quick Facts

Key Decisions

Results

Tech Stack

About This Project

Ball-on-Plate Control: Dynamic Programming and Reinforcement Learning

Overview

Control Concept

Dynamic Programming (Model-Based)

Reinforcement Learning (SAC)

Implementation Architecture

Visualization and Explainability

Results

Future Work