Teaching Robots to Pour Latte Art

Using π0 VLA + Kinesthetic Teaching on OpenDroid R2D3

A weekend hackathon project that trained a dual-arm robot to pour heart patterns using vision-language-action models and flow matching.

📊 View Dataset 💻 GitHub Repo
40
Demonstrations
20Hz
Control Frequency
12D
Action Space
15k
Training Steps

The Process

From human demonstration to autonomous pouring

Kinesthetic teaching

1. Kinesthetic Teaching

Human guides robot arms through pouring motions

Full pour

2. Data Collection

3 cameras + 12D joint states at 20Hz

Heart formation

3. Heart Formation

Learning subtle wrist movements and timing

The Setup

Hardware and workspace configuration

Robot setup

OpenDroid R2D3 with dual Realman RM65 arms and kinesthetic teaching backpack. The backpack allows demonstrators to physically guide the robot's movements naturally.

Props

Data Collection Props: Milk pitchers, espresso cups, milk frother, and all tools needed for demonstrating professional latte art pours.

Professional latte art

The Goal: Professional heart patterns requiring smooth, coordinated pouring

Dataset

Publicly available on HuggingFace Hub

📊 Dataset Stats

  • 40 episodes
  • ~20 seconds per episode
  • LeRobot v3.0 format
  • Parquet + MP4 videos

👁️ Visual Data

  • 3 camera views
  • 640×480 resolution
  • top, left_wrist, right_wrist
  • 20Hz synchronized

🤖 Proprioception

  • 12D joint states
  • Dual RM65 arms (6 DOF each)
  • Position + velocity
  • Gripper state included
# Load the dataset
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

dataset = LeRobotDataset("ridxm/latte-pour-demos")
episode = dataset[0]

print(f"State shape: {episode['observation.state'].shape}")  # [12]
print(f"Action shape: {episode['action'].shape}")           # [12]
View on HuggingFace →

Technical Details

Built on π0 VLA with flow matching

🧠

π0 Policy

Pre-trained VLA from Physical Intelligence with 10k+ hours of robot data

🌊

Flow Matching

Generates smooth, continuous action sequences for fluid pouring motions

H100 Training

15k training steps with bfloat16 mixed precision

🎯

Temporal Ensemble

Overlapping action chunks with exponential weighting for smooth execution

Architecture

  • Input: 3 cameras + 12D state
  • Output: 12D actions (chunk_size=20)
  • Pretrained: lerobot/pi0_base
  • Fine-tuned on 40 demos

Training Config

  • Batch size: 4
  • Steps: 15,000
  • Chunk size: 20 (1s lookahead)
  • dtype: bfloat16

Deployment

  • Inference: 20Hz on robot
  • Action smoothing: EMA filter
  • Temporal ensembling enabled
  • ROS2 Humble integration

Replicate This Project

Complete guide to train your own latte art robot

1. Data Collection

  • Set up dual-arm robot with 3 cameras
  • Collect 30-50 kinesthetic demonstrations
  • Each demo: ~20 seconds of pouring
  • Upload to HuggingFace Hub

2. Training Setup

  • Spin up H100 GPU (VESSL, Lambda, RunPod)
  • Install: pip install lerobot torch wandb
  • Clone repo and set HF_TOKEN

3. Training

  • Edit scripts/train.py with your dataset
  • Run: python scripts/train.py
  • Monitor on W&B dashboard
  • Download checkpoint

4. Deployment

  • Transfer checkpoint to robot
  • Run: python scripts/deploy.py --model your-model
  • Test with temporal ensembling
# Quick start training
git clone https://github.com/ridxm/latte-art-robot
cd latte-art-robot
pip install lerobot torch wandb

# Set your dataset
# Edit scripts/train.py: --dataset.repo_id=YOUR_USERNAME/your-dataset

python scripts/train.py
View Full Documentation →