Ball-on-Plate Control: DP + RL on Hardware
PersonalMarch 2026

Ball-on-Plate Control: DP + RL on Hardware

Implemented and compared Dynamic Programming and Soft Actor-Critic control on a physical ball-on-plate system with live visual overlays.

Quick Facts

Role
Control and software implementation (DP planner, RL policy, and live visualizations)
Team
1 people
Domains
ControlRoboticsRlEmbeddedHardware

Key Decisions

  • Used a waypoint-based DP planner above PID to keep hardware actuation stable while still enabling global planning.
  • Trained SAC in simulation with the same nonlinear dynamics model to reduce transfer gap before deployment.
  • Added real-time value-function and neural-network overlays so controller decisions are inspectable during demos.

Results

Validated both controllers on physical hardware for maze navigation and target-tracking tasks.Built reusable tooling for MDP precomputation, reward tuning, and policy diagnostics.Produced a clear side-by-side baseline between model-based planning and model-free reinforcement learning.

Tech Stack

PythonNumPyGymnasiumStable-Baselines3NumbaPID Control

About This Project

Ball-on-Plate Control: Dynamic Programming and Reinforcement Learning

Academic implementation project (for credit) comparing model-based and model-free control on a physical Ball-on-Plate system.


Overview

This project studies how two different decision-making paradigms behave on the same hardware platform:

  • Dynamic Programming (DP) with a known model and explicit planning.
  • Reinforcement Learning (RL) with a learned policy using Soft Actor-Critic (SAC).

The system objective is to move a ball to targets (or through maze-like paths) while avoiding obstacles and keeping trajectories smooth. The project emphasizes not only control performance, but also interpretability through live visual diagnostics.


Control Concept

Both controllers optimize discounted return:

Gt=k=0γkRt+k+1G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

For DP, planning is based on Bellman optimality. Value iteration updates:

Vk+1(s)=maxaA(s)[r(s,a)+γVk(s)]V_{k+1}(s) = \max_{a \in \mathcal{A}(s)} \left[ r(s,a) + \gamma V_k(s') \right]

and extracts the greedy policy:

π(s)=argmaxaA(s)[r(s,a)+γV(s)].\pi_*(s) = \arg\max_{a \in \mathcal{A}(s)} \left[ r(s,a) + \gamma V(s') \right].

For RL (SAC), the agent learns a stochastic policy that balances return and exploration:

J(π)=Eτπ[tγt(rt+αH(π(st)))].J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[\sum_t \gamma^t\left(r_t + \alpha\,\mathcal{H}(\pi(\cdot\mid s_t))\right)\right].

Dynamic Programming (Model-Based)

The DP controller is formulated as a discrete MDP and used as a planner above a PID layer.

  • State: grid cell on a 30×3030 \times 30 discretization of the plate.
  • Action: relative waypoint shift (Δi,Δj)(\Delta i, \Delta j) with Δi,Δj{5,,5}\Delta i,\Delta j \in \{-5,\ldots,5\}.
  • Transition model: each candidate action is evaluated by simulating the closed-loop PID + nonlinear ball dynamics.
  • Planning time scales: outer step Δtouter=0.5s\Delta t_{\text{outer}} = 0.5\,\mathrm{s}, inner simulation step Δtinner=0.008s\Delta t_{\text{inner}} = 0.008\,\mathrm{s}.
  • Reward design: obstacle penalties, near-obstacle penalty, step cost, stay-still penalty, zero terminal cost at goal.

To keep runtime responsive, transition rollouts are precomputed and cached (hash-based cache key from full parameter settings), and value iteration runs in a background thread.


Reinforcement Learning (SAC)

The RL controller directly outputs plate-angle commands and is trained in simulation.

  • Observation (8D): ball position, velocity, target position, and current plate angles.
  • Action (2D): normalized commands in [1,1]2[-1,1]^2, scaled to physical tilt setpoints in [10,10][-10^\circ,10^\circ].
  • Reward: combines target collection bonus, distance shaping, per-step cost, distance penalty, and stuck penalty.
  • Training setup: Stable-Baselines3 SAC, 8 parallel environments, replay buffer, periodic evaluation/checkpointing.

The environment reuses the same nonlinear friction-based dynamics model used for simulation studies and deployment validation.


Implementation Architecture

Key modules include grid discretization helpers, PID-integrated transition simulation, MDP building/value iteration logic, RL environment/training scripts, and live projection overlays.


Visualization and Explainability

Two live overlays make controller behavior visible during experiments:

  1. DP value-function heatmap with obstacle mask and selected waypoints.
  2. RL neural-network visualization showing input sliders, hidden-layer activations, top-kk edge contributions, and output actions.

This makes it possible to inspect how decisions evolve in real time instead of treating the controller as a black box.

Ball-on-Plate cover

If you want to embed your recordings, place files in /public/images/ball-on-plate/ and add:

<video controls>
  <source src="/images/ball-on-plate/hardware-demo.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>

<video controls>
  <source src="/images/ball-on-plate/simulation-demo.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>

Results

  • Demonstrated both model-based and model-free control on the same physical setup.
  • Built a reproducible pipeline for planning, training, deployment, and live diagnostics.
  • Established a clear experimental baseline for future thesis work in robotics and learning-based control.

Future Work

  • Extend DP with approximate methods that include velocity in the planner state.
  • Improve sim-to-real transfer with domain randomization for SAC.
  • Add quantitative benchmarking (success rate, time-to-goal, smoothness, energy) across controllers.