Ball-on-Plate Control: Dynamic Programming and Reinforcement Learning
Academic implementation project (for credit) comparing model-based and model-free control on a physical Ball-on-Plate system.
Overview
This project studies how two different decision-making paradigms behave on the same hardware platform:
- Dynamic Programming (DP) with a known model and explicit planning.
- Reinforcement Learning (RL) with a learned policy using Soft Actor-Critic (SAC).
The system objective is to move a ball to targets (or through maze-like paths) while avoiding obstacles and keeping trajectories smooth. The project emphasizes not only control performance, but also interpretability through live visual diagnostics.
Control Concept
Both controllers optimize discounted return:
Gt=k=0∑∞γkRt+k+1
For DP, planning is based on Bellman optimality. Value iteration updates:
Vk+1(s)=a∈A(s)max[r(s,a)+γVk(s′)]
and extracts the greedy policy:
π∗(s)=arga∈A(s)max[r(s,a)+γV(s′)].
For RL (SAC), the agent learns a stochastic policy that balances return and exploration:
J(π)=Eτ∼π[t∑γt(rt+αH(π(⋅∣st)))].
Dynamic Programming (Model-Based)
The DP controller is formulated as a discrete MDP and used as a planner above a PID layer.
- State: grid cell on a 30×30 discretization of the plate.
- Action: relative waypoint shift (Δi,Δj) with Δi,Δj∈{−5,…,5}.
- Transition model: each candidate action is evaluated by simulating the closed-loop PID + nonlinear ball dynamics.
- Planning time scales: outer step Δtouter=0.5s, inner simulation step Δtinner=0.008s.
- Reward design: obstacle penalties, near-obstacle penalty, step cost, stay-still penalty, zero terminal cost at goal.
To keep runtime responsive, transition rollouts are precomputed and cached (hash-based cache key from full parameter settings), and value iteration runs in a background thread.
Reinforcement Learning (SAC)
The RL controller directly outputs plate-angle commands and is trained in simulation.
- Observation (8D): ball position, velocity, target position, and current plate angles.
- Action (2D): normalized commands in [−1,1]2, scaled to physical tilt setpoints in [−10∘,10∘].
- Reward: combines target collection bonus, distance shaping, per-step cost, distance penalty, and stuck penalty.
- Training setup: Stable-Baselines3 SAC, 8 parallel environments, replay buffer, periodic evaluation/checkpointing.
The environment reuses the same nonlinear friction-based dynamics model used for simulation studies and deployment validation.
Implementation Architecture
Key modules include grid discretization helpers, PID-integrated transition simulation, MDP building/value iteration logic, RL environment/training scripts, and live projection overlays.
Visualization and Explainability
Two live overlays make controller behavior visible during experiments:
- DP value-function heatmap with obstacle mask and selected waypoints.
- RL neural-network visualization showing input sliders, hidden-layer activations, top-k edge contributions, and output actions.
This makes it possible to inspect how decisions evolve in real time instead of treating the controller as a black box.

If you want to embed your recordings, place files in /public/images/ball-on-plate/ and add:
<video controls>
<source src="/images/ball-on-plate/hardware-demo.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
<video controls>
<source src="/images/ball-on-plate/simulation-demo.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
Results
- Demonstrated both model-based and model-free control on the same physical setup.
- Built a reproducible pipeline for planning, training, deployment, and live diagnostics.
- Established a clear experimental baseline for future thesis work in robotics and learning-based control.
Future Work
- Extend DP with approximate methods that include velocity in the planner state.
- Improve sim-to-real transfer with domain randomization for SAC.
- Add quantitative benchmarking (success rate, time-to-goal, smoothness, energy) across controllers.