Teaching Robots to Pour Latte Art

Using π0 VLA + Kinesthetic Teaching on OpenDroid R2D3

A weekend hackathon project that trained a dual-arm robot to pour heart patterns using vision-language-action models and flow matching.

📊 View Dataset 💻 GitHub Repo

Demonstrations

20Hz

Control Frequency

12D

Action Space

15k

Training Steps

The Process

From human demonstration to autonomous pouring

1. Kinesthetic Teaching

Human guides robot arms through pouring motions

2. Data Collection

3 cameras + 12D joint states at 20Hz

3. Heart Formation

Learning subtle wrist movements and timing

The Setup

Hardware and workspace configuration

OpenDroid R2D3 with dual Realman RM65 arms and kinesthetic teaching backpack. The backpack allows demonstrators to physically guide the robot's movements naturally.

Data Collection Props: Milk pitchers, espresso cups, milk frother, and all tools needed for demonstrating professional latte art pours.

The Goal: Professional heart patterns requiring smooth, coordinated pouring

Dataset

Publicly available on HuggingFace Hub

📊 Dataset Stats

40 episodes
~20 seconds per episode
LeRobot v3.0 format
Parquet + MP4 videos

👁️ Visual Data

3 camera views
640×480 resolution
top, left_wrist, right_wrist
20Hz synchronized

🤖 Proprioception

12D joint states
Dual RM65 arms (6 DOF each)
Position + velocity
Gripper state included

# Load the dataset
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

dataset = LeRobotDataset("ridxm/latte-pour-demos")
episode = dataset[0]

print(f"State shape: {episode['observation.state'].shape}")  # [12]
print(f"Action shape: {episode['action'].shape}")           # [12]

View on HuggingFace →

Technical Details

Built on π0 VLA with flow matching

🧠

π0 Policy

Pre-trained VLA from Physical Intelligence with 10k+ hours of robot data

🌊

Flow Matching

Generates smooth, continuous action sequences for fluid pouring motions

⚡

H100 Training

15k training steps with bfloat16 mixed precision

🎯

Temporal Ensemble

Overlapping action chunks with exponential weighting for smooth execution

Architecture

Input: 3 cameras + 12D state
Output: 12D actions (chunk_size=20)
Pretrained: lerobot/pi0_base
Fine-tuned on 40 demos

Training Config

Batch size: 4
Steps: 15,000
Chunk size: 20 (1s lookahead)
dtype: bfloat16

Deployment

Inference: 20Hz on robot
Action smoothing: EMA filter
Temporal ensembling enabled
ROS2 Humble integration

Replicate This Project

Complete guide to train your own latte art robot

1. Data Collection

Set up dual-arm robot with 3 cameras
Collect 30-50 kinesthetic demonstrations
Each demo: ~20 seconds of pouring
Upload to HuggingFace Hub

2. Training Setup

Spin up H100 GPU (VESSL, Lambda, RunPod)
Install: pip install lerobot torch wandb
Clone repo and set HF_TOKEN

3. Training

Edit scripts/train.py with your dataset
Run: python scripts/train.py
Monitor on W&B dashboard
Download checkpoint

4. Deployment

Transfer checkpoint to robot
Run: python scripts/deploy.py --model your-model
Test with temporal ensembling

# Quick start training
git clone https://github.com/ridxm/latte-art-robot
cd latte-art-robot
pip install lerobot torch wandb

# Set your dataset
# Edit scripts/train.py: --dataset.repo_id=YOUR_USERNAME/your-dataset

python scripts/train.py

View Full Documentation →