Cohort starts May 15, 2026 · Limited seats

Reinforcement Learning
in Production

RL Actually Works. We'll Show You.

A 4-week intensive workshop. Train your own personal AI assistant with GRPO. Deploy RL in embodied AI and LLMs in production. Explore the frontier of RL research.

14
Core Lectures
8
Capstone Projects
9+
Guest Speakers
4
Weeks Live
Speakers & researchers from
OpenAI DeepMind NVIDIA Meta Microsoft Apple Tesla MIT Caltech CMU Princeton Harvard Perplexity Sierra IIT Madras OpenAI DeepMind NVIDIA Meta Microsoft Apple Tesla MIT Caltech CMU Princeton Harvard Perplexity Sierra IIT Madras
Curriculum

Two Phases. Zero to Production.

Phase 1 builds your RL foundations from scratch. Phase 2 throws you into real production projects.

Week 1 · Lectures 1-2

Fundamentals of RL

MDPs, Bellman equations, value functions, exploration vs exploitation. Building intuition from first principles.

MDPsBellman EquationsValue IterationPolicy Iteration
Week 1 · Lectures 3-4

Deep Q-Networks (DQN)

From tabular Q-learning to deep Q-networks. Experience replay, target networks, Double DQN, Dueling DQN, and Rainbow.

Q-LearningDQNDouble DQNRainbow
Week 2 · Lectures 5-6

Policy Gradients & Actor-Critic

REINFORCE, advantage estimation (GAE), A2C/A3C. The policy gradient theorem and variance reduction.

REINFORCEGAEA2CActor-Critic
Week 2 · Lectures 7-8

PPO, TRPO & RLHF

Trust regions, clipped objectives, KL penalty. Why PPO is the backbone of RLHF. The full RLHF pipeline — reward modeling, PPO training, and alignment.

PPOTRPORLHFReward ModelingAlignment
Week 3 · Lectures 9-10

GRPO & Its Variations

Group Relative Policy Optimization — the algorithm behind DeepSeek-R1. Online GRPO, Mini-batch GRPO, DAPO, and Dr. GRPO.

GRPODAPODr. GRPOOnline GRPO
Week 3 · Lectures 11-12

DPO, SimPO & Preference Optimization

Direct Preference Optimization and its successors. SimPO's length-normalized formulation, IPO, KTO, ORPO.

DPOSimPOIPOKTOORPO
Week 4 · Lectures 13-14

Agentic RL — DeepEyes & Beyond

RL for autonomous agents. DeepEyes for visual reasoning, SWE-RL for code generation, RLEF for multi-turn feedback. The frontier of RL + LLMs.

DeepEyesSWE-RLRLEFAgentic RL
Week 1 · Lectures 1-2

RL Training at Scale

Distributed RL training with veRL and OpenRLHF. Multi-GPU GRPO, Ray integration, vLLM rollout workers, FSDP pipelines.

veRLOpenRLHFRayDistributed RL
Week 1 · Lectures 3-4

Environments & Simulation

Building custom RL environments. Gymnasium, MetaDrive for driving, MuJoCo for robotics, Docker-based execution environments.

GymnasiumMetaDriveMuJoCoOpenEnv
Week 2 · Lectures 5-6

Autonomous Driving with RL

MetaDrive-Arena deep dive. PPO racing agents, multi-agent competition, ELO leaderboards, sim-to-real transfer.

MetaDrive-ArenaPPO RacingMulti-AgentELO
Week 2 · Lectures 7-8

Agentic RL for Software Engineering

DeepSWE + rLLM + R2E-Gym stack. RL-powered coding agents that fix real GitHub issues — 59% on SWE-Bench with pure RL.

DeepSWErLLMR2E-GymSWE-Bench
Week 3 · Lectures 9-10

Embodied RL & Humanoid Control

Embodied RL for robotics. Humanoid walking, OpenClaw manipulation, SmolVLA for robot learning, sim-to-real transfer, and reward shaping.

Embodied RLHumanoidOpenClawSmolVLASim2Real
Week 3 · Lectures 11-12

World Models & Imagination

IRIS world model — act in imagined environments. Latent dynamics, Dreamer architectures, model-based RL for sample efficiency.

IRISWorld ModelsDreamerModel-Based RL
Week 4 · Lectures 13-14

Production Deployment & Evaluation

Shipping RL systems. Reward hacking detection, safety constraints, evaluation pipelines, monitoring, RLHF/RLAIF stack.

Production RLSafetyMonitoringRLHF Stack
Frameworks

The Production RL Stack

Hands-on with the frameworks that power RL at scale.

veRL

Hybrid parallel RLHF training. Ray + FSDP integration.

Used in: OpenClaw, RL2F
github.com/volcengine/verl →

OpenRLHF

PPO, DPO, GRPO with Ray + vLLM for 70B+ models.

Used in: OpenClaw, Agentic SWE
github.com/OpenRLHF →

OpenEnv

Build and standardize custom RL environments.

Used in: MetaDrive Arena, Humanoid
github.com/openenv →

Gymnasium

Standard RL environment API. Farama Foundation.

Used in: MetaDrive Arena, Humanoid, IRIS
github.com/Farama-Foundation →

CleanRL

Single-file RL implementations for understanding.

Used in: Phase 1 Labs, MetaDrive Arena
github.com/vwxyzjn/cleanrl →

Stable-Baselines3

Reliable PyTorch RL. PPO, SAC, DQN.

Used in: MetaDrive Arena, Humanoid
github.com/DLR-RM/sb3 →

Ray RLlib

Distributed multi-agent RL at scale.

Used in: Agentic SWE, OpenClaw
github.com/ray-project/ray →

MetaDrive

Driving simulator. 1000+ FPS. Bullet physics.

Used in: MetaDrive Racing Arena
github.com/metadriverse →

MuJoCo

Gold standard physics for robotics RL.

Used in: Humanoid Walking, SmolVLA
github.com/deepmind/mujoco →

rLLM

RL for language agents. GRPO/PPO + Ray + vLLM.

Used in: Agentic RL for SWE
github.com/agentica/rllm →

TRL

HuggingFace Transformer RL. SFT, DPO, PPO.

Used in: OpenClaw, RL2F
github.com/huggingface/trl →

Weights & Biases

Experiment tracking and visualization.

Used in: All projects
wandb.ai →
Capstone Projects

Build Real Systems. Not Toy Examples.

Every project ships a working system. These are the portfolio pieces that get you hired.

Project 01

MetaDrive Racing Arena

Train PPO agents for competitive 1v1 autonomous racing. Multi-agent environments, ELO leaderboard, sim-to-real transfer.

PPOMetaDriveMulti-AgentCompetition
Agentic RL for Software Engineering — Issue to Patch to PR
Project 02

Agentic RL for SWE

Build an RL-powered coding agent using DeepSWE + rLLM + R2E-Gym. Train on 8.1K real GitHub issues. Target: 59% on SWE-Bench Verified.

DeepSWEGRPOrLLMSWE-Bench
Project 03

OpenClaw: WhatsApp AI with GRPO

Build an open-source WhatsApp AI gateway trained with GRPO on your own conversations. Real-time dashboard, Process Reward Model scoring, asynchronous training on H100 GPUs via RunPod. The model improves while serving responses live.

GRPOWhatsAppPRM ScoringH100RunPod
Project 04

SmolVLA Robot Learning

Vision-Language-Action models for robotic control. RL-tuned inference — making small models perform like large ones through smart RL.

SmolVLAVLARobot LearningInference
RL2F: Train on OMNI Math, Transfer on Coding
Project 05

Implementing RL2F: RL with Language Feedback

Implement the RL2F paper from Google DeepMind — a framework that treats in-context learning from feedback as a trainable skill. Build teacher-student didactic interactions, train with multi-turn RL, and reproduce the result where Gemini Flash nearly matches Gemini Pro on HardMath2. Achieve cross-domain generalization to ARC-AGI and Codeforces.

RL2FDeepMindMulti-Turn RLSelf-ImprovementPaper Repro
Project 06

Teaching Humanoids to Walk

Train a simulated humanoid to walk using RL. Reward shaping, curriculum learning, MuJoCo environments, and locomotion policy transfer.

HumanoidMuJoCoLocomotionReward Shaping
IRIS Atari
Project 07

IRIS World Model

Implement the IRIS world model for imagination-based RL. Latent dynamics, generate training data from imagined trajectories, benchmark on Atari.

World ModelsIRISModel-Based RLAtari
Project 08

CaP-X RL: The First Coding Agent for Robotics

Reproduce CaP-X RL — the first framework to turn frontier LLMs into coding agents that control real robots. Build CaP-Gym (program-synthesis robot environment), benchmark VLMs on CaP-Bench, run CaP-Agent0 on real embodiments, and train CaP-RL with verifiable rewards for sim-to-real transfer with near-zero gap. Outperforms specialized VLA models on perturbed manipulation tasks.

CaP-X RLCoding AgentsRobot ManipulationVerifiable RewardsSim2Real
Who Is This For

Built for Builders

💻

ML Engineers

You've trained models but never an RL agent. Understand PPO, GRPO, and the RLHF stack powering LLM alignment.

🎓

Graduate Students

You know the theory but haven't shipped production RL. Bridge the gap between papers and real systems.

🎯

Targeting Top AI Labs

Interviewing at OpenAI, DeepMind, Anthropic, NVIDIA? RL systems knowledge is the differentiator.

🤖

Robotics Engineers

You build hardware. Now train the brains. Sim-to-real, humanoid locomotion, dexterous manipulation.

🧠

LLM Practitioners

Understand the RL layer — RLHF, DPO, GRPO — that turns base models into aligned systems.

🔬

Aspiring Researchers

Research roadmaps, paper reading lists, and mentorship to get your first RL paper published.

From Our Team

Deep Dives We've Written

Read our research and explainers on Substack before the workshop begins.

Guest Speakers

Learn from the Best in RL

Speaker candidates from the top RL labs and universities in the world.

RA

Rishabh Agarwal

Fmr. Staff Research Scientist
Google DeepMind
PD

Prafulla Dhariwal

Technical Fellow, co-creator of PPO
OpenAI
TB

Trapit Bansal

Researcher, ex-OpenAI o1
Meta Superintelligence
AA

Anima Anandkumar

Professor · ex-Sr. Director ML
Caltech / NVIDIA
AH

Ankur Handa

Principal Research Scientist
NVIDIA Robotics
HS

Harshit Sikchi

Research Scientist, GPT-5
OpenAI
DB

Dhruv Batra

Research Director
Meta FAIR
AK

Aviral Kumar

Asst. Professor + Researcher
CMU / DeepMind
PA

Pulkit Agrawal

Assoc. Prof, Improbable AI Lab
MIT CSAIL
DP

Deepak Pathak

Faculty, Curiosity-driven RL
CMU
KN

Karthik Narasimhan

Head of Research, ReAct
Sierra / Princeton
AS

Aravind Srinivas

CEO, ex-OpenAI · DeepMind
Perplexity AI

Speaker candidates — final lineup confirmed closer to cohort start.

Why Vizuara

We Wrote the Book on RL. Literally.

When Manning Publications needed an author for the RL chapter in their DeepSeek book, they came to Vizuara. That's the depth of expertise behind this workshop.

DeepSeek Book — RL Chapter by Vizuara AI Labs
Manning Publications

The RL Chapter in the DeepSeek Book

Dr. Rajat Dandekar authored the reinforcement learning chapter in Manning's DeepSeek book — covering the algorithms, training pipelines, and production techniques that power state-of-the-art reasoning models.

This isn't a team that learned RL from tutorials. Vizuara has the research depth to write the textbook and the engineering experience to ship production systems. When you enroll in this workshop, you're learning from the people publishers trust to explain RL to the world.

Published by
Manning Publications
Topic
RL for Reasoning Models
Coverage
GRPO, PPO, RLHF, DeepSeek-R1
Your Instructor

Meet Dr. Rajat Dandekar

Dr. Rajat Dandekar

Dr. Rajat Dandekar

Founder, Vizuara AI Labs · Purdue PhD

Dr. Dandekar has successfully taught the acclaimed "Reasoning LLM from Scratch" course, helping hundreds of students master complex AI concepts through practical, hands-on learning.

With extensive research experience in reinforcement learning and deep learning at top-tier institutions, Dr. Dandekar brings cutting-edge knowledge directly to the classroom. This workshop is born from the conviction that RL actually works — and the gap isn't in the algorithms, it's in knowing how to ship them.

200+
Engineers Taught
50K+
YouTube Subs
Purdue
PhD
Enroll

Choose Your Path

Each phase is self-contained. Bundle for the full journey.

Phase 1: RL Foundations

DQN, Policy Gradients, PPO, TRPO, GRPO, DPO, SimPO, Agentic RL. 7 lectures + labs.

₹45,000

Phase 2: Production RL

8 capstone projects — racing agents, humanoid walking, agentic SWE, robotics coding agents, world models.

₹55,000

Both Phases — Complete Workshop

Full 4-week journey. 14 lectures, 8 projects, all labs. Best value.

₹1,00,000 ₹85,000

+ Guest Speaker Pass

All industry guest sessions from OpenAI, DeepMind, NVIDIA, Meta FAIR, and more.

+ ₹50,000

+ Research Starter Kit

Personalized roadmap, curated reading list (12-15 papers), code template, draft outline.

+ ₹15,000

+ 1:1 Mentorship (2 Months)

Bi-weekly 1:1 with Dr. Rajat or senior mentors. Research, career, publication support.

+ ₹70,000

Your Selection

Phase 1: RL Foundations₹45,000
Total₹45,000

EMI available · Lifetime recording access

FAQ

Common Questions

No. Phase 1 starts from fundamentals — MDPs, value functions, Q-learning. Python + basic ML knowledge is enough. Phase 2 assumes Phase 1 knowledge.
A laptop with a modern CPU for Phase 1. We do not provide cloud GPU access — for large-scale training (veRL, OpenRLHF), you will need to provision your own GPUs via RunPod, Lambda, or similar providers. Robotics simulation (MuJoCo, MetaDrive) runs on CPU. Some projects use Google Colab.
Yes. Each phase is self-contained. Phase 1 = foundations, Phase 2 = production projects. The bundle is the best value.
Yes. All live sessions are recorded with lifetime access — core lectures, labs, and guest sessions (with Speaker Pass).
You build real production systems — racing agents, humanoid walking, coding agents. Live instruction with industry speakers. The actual production stack — veRL, OpenRLHF, Ray — not just algorithms.
Due to limited cohort size, no refunds after enrollment. Email raj@vizuara.ai with questions before enrolling.