Reinforcement Learning in Production

Curriculum

Two Phases. Zero to Production.

Phase 1 builds your RL foundations from scratch. Phase 2 throws you into real production projects.

Week 1 · Lectures 1-2

Fundamentals of RL

MDPs, Bellman equations, value functions, exploration vs exploitation. Building intuition from first principles.

MDPsBellman EquationsValue IterationPolicy Iteration

Week 1 · Lectures 3-4

Deep Q-Networks (DQN)

From tabular Q-learning to deep Q-networks. Experience replay, target networks, Double DQN, Dueling DQN, and Rainbow.

Q-LearningDQNDouble DQNRainbow

Week 2 · Lectures 5-6

Policy Gradients & Actor-Critic

REINFORCE, advantage estimation (GAE), A2C/A3C. The policy gradient theorem and variance reduction.

REINFORCEGAEA2CActor-Critic

Week 2 · Lectures 7-8

PPO, TRPO & RLHF

Trust regions, clipped objectives, KL penalty. Why PPO is the backbone of RLHF. The full RLHF pipeline — reward modeling, PPO training, and alignment.

PPOTRPORLHFReward ModelingAlignment

Week 3 · Lectures 9-10

GRPO & Its Variations

Group Relative Policy Optimization — the algorithm behind DeepSeek-R1. Online GRPO, Mini-batch GRPO, DAPO, and Dr. GRPO.

GRPODAPODr. GRPOOnline GRPO

Week 3 · Lectures 11-12

DPO, SimPO & Preference Optimization

Direct Preference Optimization and its successors. SimPO's length-normalized formulation, IPO, KTO, ORPO.

DPOSimPOIPOKTOORPO

Week 4 · Lectures 13-14

Agentic RL — DeepEyes & Beyond

RL for autonomous agents. DeepEyes for visual reasoning, SWE-RL for code generation, RLEF for multi-turn feedback. The frontier of RL + LLMs.

DeepEyesSWE-RLRLEFAgentic RL

Week 1 · Lectures 1-2

RL Training at Scale

Distributed RL training with veRL and OpenRLHF. Multi-GPU GRPO, Ray integration, vLLM rollout workers, FSDP pipelines.

veRLOpenRLHFRayDistributed RL

Week 1 · Lectures 3-4

Environments & Simulation

Building custom RL environments. Gymnasium, MetaDrive for driving, MuJoCo for robotics, Docker-based execution environments.

GymnasiumMetaDriveMuJoCoOpenEnv

Week 2 · Lectures 5-6

Autonomous Driving with RL

MetaDrive-Arena deep dive. PPO racing agents, multi-agent competition, ELO leaderboards, sim-to-real transfer.

MetaDrive-ArenaPPO RacingMulti-AgentELO

Week 2 · Lectures 7-8

Agentic RL for Software Engineering

DeepSWE + rLLM + R2E-Gym stack. RL-powered coding agents that fix real GitHub issues — 59% on SWE-Bench with pure RL.

DeepSWErLLMR2E-GymSWE-Bench

Week 3 · Lectures 9-10

Embodied RL & Humanoid Control

Embodied RL for robotics. Humanoid walking, OpenClaw manipulation, SmolVLA for robot learning, sim-to-real transfer, and reward shaping.

Embodied RLHumanoidOpenClawSmolVLASim2Real

Week 3 · Lectures 11-12

World Models & Imagination

IRIS world model — act in imagined environments. Latent dynamics, Dreamer architectures, model-based RL for sample efficiency.

IRISWorld ModelsDreamerModel-Based RL

Week 4 · Lectures 13-14

Production Deployment & Evaluation

Shipping RL systems. Reward hacking detection, safety constraints, evaluation pipelines, monitoring, RLHF/RLAIF stack.

Production RLSafetyMonitoringRLHF Stack

Frameworks

The Production RL Stack

Hands-on with the frameworks that power RL at scale.

veRL

Hybrid parallel RLHF training. Ray + FSDP integration.

Used in: OpenClaw, RL2F

github.com/volcengine/verl →

OpenRLHF

PPO, DPO, GRPO with Ray + vLLM for 70B+ models.

Used in: OpenClaw, Agentic SWE

github.com/OpenRLHF →

OpenEnv

Build and standardize custom RL environments.

Used in: MetaDrive Arena, Humanoid

github.com/openenv →

Gymnasium

Standard RL environment API. Farama Foundation.

Used in: MetaDrive Arena, Humanoid, IRIS

github.com/Farama-Foundation →

CleanRL

Single-file RL implementations for understanding.

Used in: Phase 1 Labs, MetaDrive Arena

github.com/vwxyzjn/cleanrl →

Stable-Baselines3

Reliable PyTorch RL. PPO, SAC, DQN.

Used in: MetaDrive Arena, Humanoid

github.com/DLR-RM/sb3 →

Ray RLlib

Distributed multi-agent RL at scale.

Used in: Agentic SWE, OpenClaw

github.com/ray-project/ray →

MetaDrive

Driving simulator. 1000+ FPS. Bullet physics.

Used in: MetaDrive Racing Arena

github.com/metadriverse →

MuJoCo

Gold standard physics for robotics RL.

Used in: Humanoid Walking, SmolVLA

github.com/deepmind/mujoco →

rLLM

RL for language agents. GRPO/PPO + Ray + vLLM.

Used in: Agentic RL for SWE

github.com/agentica/rllm →

TRL

HuggingFace Transformer RL. SFT, DPO, PPO.

Used in: OpenClaw, RL2F

github.com/huggingface/trl →

Weights & Biases

Experiment tracking and visualization.

Used in: All projects

wandb.ai →

Capstone Projects

Build Real Systems. Not Toy Examples.

Every project ships a working system. These are the portfolio pieces that get you hired.

Project 01

MetaDrive Racing Arena

Train PPO agents for competitive 1v1 autonomous racing. Multi-agent environments, ELO leaderboard, sim-to-real transfer.

PPOMetaDriveMulti-AgentCompetition

Agentic RL for Software Engineering — Issue to Patch to PR

Project 02

Agentic RL for SWE

Build an RL-powered coding agent using DeepSWE + rLLM + R2E-Gym. Train on 8.1K real GitHub issues. Target: 59% on SWE-Bench Verified.

DeepSWEGRPOrLLMSWE-Bench

Project 03

OpenClaw: WhatsApp AI with GRPO

Build an open-source WhatsApp AI gateway trained with GRPO on your own conversations. Real-time dashboard, Process Reward Model scoring, asynchronous training on H100 GPUs via RunPod. The model improves while serving responses live.

GRPOWhatsAppPRM ScoringH100RunPod

Project 04

SmolVLA Robot Learning

Vision-Language-Action models for robotic control. RL-tuned inference — making small models perform like large ones through smart RL.

SmolVLAVLARobot LearningInference

RL2F: Train on OMNI Math, Transfer on Coding

Project 05

Implementing RL2F: RL with Language Feedback

Implement the RL2F paper from Google DeepMind — a framework that treats in-context learning from feedback as a trainable skill. Build teacher-student didactic interactions, train with multi-turn RL, and reproduce the result where Gemini Flash nearly matches Gemini Pro on HardMath2. Achieve cross-domain generalization to ARC-AGI and Codeforces.

RL2FDeepMindMulti-Turn RLSelf-ImprovementPaper Repro

Project 06

Teaching Humanoids to Walk

Train a simulated humanoid to walk using RL. Reward shaping, curriculum learning, MuJoCo environments, and locomotion policy transfer.

HumanoidMuJoCoLocomotionReward Shaping

Project 07

IRIS World Model

Implement the IRIS world model for imagination-based RL. Latent dynamics, generate training data from imagined trajectories, benchmark on Atari.

World ModelsIRISModel-Based RLAtari

Project 08

CaP-X RL: The First Coding Agent for Robotics

Reproduce CaP-X RL — the first framework to turn frontier LLMs into coding agents that control real robots. Build CaP-Gym (program-synthesis robot environment), benchmark VLMs on CaP-Bench, run CaP-Agent0 on real embodiments, and train CaP-RL with verifiable rewards for sim-to-real transfer with near-zero gap. Outperforms specialized VLA models on perturbed manipulation tasks.

CaP-X RLCoding AgentsRobot ManipulationVerifiable RewardsSim2Real

Who Is This For

Built for Builders

💻

ML Engineers

You've trained models but never an RL agent. Understand PPO, GRPO, and the RLHF stack powering LLM alignment.

🎓

Graduate Students

You know the theory but haven't shipped production RL. Bridge the gap between papers and real systems.

🎯

Targeting Top AI Labs

Interviewing at OpenAI, DeepMind, Anthropic, NVIDIA? RL systems knowledge is the differentiator.

🤖

Robotics Engineers

You build hardware. Now train the brains. Sim-to-real, humanoid locomotion, dexterous manipulation.

🧠

LLM Practitioners

Understand the RL layer — RLHF, DPO, GRPO — that turns base models into aligned systems.

🔬

Aspiring Researchers

Research roadmaps, paper reading lists, and mentorship to get your first RL paper published.

From Our Team

Deep Dives We've Written

Read our research and explainers on Substack before the workshop begins.

Featured

Welcome to the Era of Experience!

Why Reinforcement Learning will dominate the next phase of AI intelligence. The transition from isolated interactions to continuous learning streams.

Read on Substack →

Explainer

What are Large Reasoning Models?

How humans built AI models that reason. Inference-time compute, pure RL, supervised fine-tuning + RL, and distillation — from ChatGPT to DeepSeek-R1.

Read on Substack →

Tutorial

Understanding RLHF From Scratch

A beginner's guide to RLHF. Reward modeling, policy gradients, PPO — with practical implementations from fine-tuning GPT-2 to text summarization.

Read on Substack →

Why Vizuara

We Wrote the Book on RL. Literally.

When Manning Publications needed an author for the RL chapter in their DeepSeek book, they came to Vizuara. That's the depth of expertise behind this workshop.

DeepSeek Book — RL Chapter by Vizuara AI Labs

Manning Publications

The RL Chapter in the DeepSeek Book

Dr. Rajat Dandekar authored the reinforcement learning chapter in Manning's DeepSeek book — covering the algorithms, training pipelines, and production techniques that power state-of-the-art reasoning models.

This isn't a team that learned RL from tutorials. Vizuara has the research depth to write the textbook and the engineering experience to ship production systems. When you enroll in this workshop, you're learning from the people publishers trust to explain RL to the world.

Published by

Manning Publications

Topic

RL for Reasoning Models

Coverage

GRPO, PPO, RLHF, DeepSeek-R1

Course Curators

Designed by MIT PhDs

Co-founders of Vizuara AI Labs, bringing cutting-edge research to the classroom.

Dr. Rajat Dandekar

Co-founder, Vizuara AI Labs

MIT PhD

Dr. Raj Dandekar

Co-founder, Vizuara AI Labs

MIT PhD

Dr. Sreedath Panat

Co-founder, Vizuara AI Labs

MIT PhD

Guest Speakers

Learn from the Best in RL

Industry insights from the frontier of AI research.

AG

Abhishek Goswami

AI Frontiers Team

Microsoft

Guest speaker sessions are complimentary for all students enrolled in Phase 1 or Phase 2.

Your Instructor

Meet Dr. Rajat Dandekar

Dr. Rajat Dandekar

Founder, Vizuara AI Labs · Purdue PhD

Dr. Dandekar has successfully taught the acclaimed "Reasoning LLM from Scratch" course, helping hundreds of students master complex AI concepts through practical, hands-on learning.

With extensive research experience in reinforcement learning and deep learning at top-tier institutions, Dr. Dandekar brings cutting-edge knowledge directly to the classroom. This workshop is born from the conviction that RL actually works — and the gap isn't in the algorithms, it's in knowing how to ship them.

200+

Engineers Taught

200K+

YouTube Subs

Purdue

PhD

LinkedIn YouTube

Research Mentorship

Guided by Dr. Rajat Dandekar

Personalized research mentorship to help you publish your first RL paper and build a research career.

What You Get

Personalized Roadmap

A custom research direction tailored to your interests and background in RL.

1:1 Mentorship Sessions

Bi-weekly sessions with Dr. Rajat Dandekar covering research, career, and publication strategy.

Paper Reading & Writing

Curated reading lists, code templates, and guidance on writing your first research paper.

Publication Support

End-to-end support from idea to submission — venue selection, draft review, and rebuttal strategy.

Enroll

Choose Your Path

Individual courses or bundles. Guest speaker sessions are complimentary for all Phase 1 & Phase 2 students.

🎉 Early-bird discount applied to all prices below — valid till April 25, 2026.

Individual Courses

Phase 1: RL Foundations

DQN, Policy Gradients, PPO, TRPO, GRPO, DPO, SimPO, Agentic RL. 7 lectures + labs.

₹45,000

Includes complimentary Guest Speaker Pass

Enroll Now →

Phase 2: Production RL

8 capstone projects — racing agents, humanoid walking, agentic SWE, robotics coding agents, world models.

₹55,000

Includes complimentary Guest Speaker Pass

Enroll Now →

Research Starter Kit

Personalized roadmap, curated reading list (12-15 papers), code template, draft outline.

₹25,000

Enroll Now →

RL Research Mentorship

Bi-weekly 1:1 with Dr. Rajat Dandekar. Research, career, publication support.

₹70,000

Enroll Now →

Bundles — Save More

Best Value — 20% Off

Ultimate RL Bundle

Phase 1 + Phase 2 + Research Starter Kit + RL Research Mentorship

₹1,53,000

Enroll Now →

10% Off

Full RL Learning Path

Phase 1 + Phase 2

₹90,000

Enroll Now →

Complete Foundation Track

Phase 1 + Research Starter Kit + RL Research Mentorship

₹1,40,000

Enroll Now →

Learning Path + Mentorship

Phase 1 + Phase 2 + RL Research Mentorship

₹1,70,000

Enroll Now →

Advanced Production Track

Phase 2 + Research Starter Kit + RL Research Mentorship

₹1,50,000

Enroll Now →

Learning Path + Research Kit

Phase 1 + Phase 2 + Research Starter Kit

₹1,25,000

Enroll Now →

Foundation + Mentorship

Phase 1 + RL Research Mentorship

₹1,15,000

Enroll Now →

Production + Mentorship

Phase 2 + RL Research Mentorship

₹1,25,000

Enroll Now →

Foundation + Research Kickstart

Phase 1 + Research Starter Kit

₹70,000

Enroll Now →

Production + Research Kit

Phase 2 + Research Starter Kit

₹80,000

Enroll Now →

EMI available · Lifetime recording access

FAQ

Common Questions

No. Phase 1 starts from fundamentals — MDPs, value functions, Q-learning. Python + basic ML knowledge is enough. Phase 2 assumes Phase 1 knowledge.

A laptop with a modern CPU for Phase 1. We do not provide cloud GPU access — for large-scale training (veRL, OpenRLHF), you will need to provision your own GPUs via RunPod, Lambda, or similar providers. Robotics simulation (MuJoCo, MetaDrive) runs on CPU. Some projects use Google Colab.

Yes. Each phase is self-contained. Phase 1 = foundations, Phase 2 = production projects. The bundle is the best value.

Yes. All live sessions are recorded with lifetime access — core lectures, labs, and guest sessions (with Speaker Pass).

You build real production systems — racing agents, humanoid walking, coding agents. Live instruction with industry speakers. The actual production stack — veRL, OpenRLHF, Ray — not just algorithms.

Due to limited cohort size, no refunds after enrollment. Email rajatdandekar@vizuara.com with questions before enrolling.

Reinforcement Learningin Production

Two Phases. Zero to Production.

Fundamentals of RL

Deep Q-Networks (DQN)

Policy Gradients & Actor-Critic

PPO, TRPO & RLHF

GRPO & Its Variations

DPO, SimPO & Preference Optimization

Agentic RL — DeepEyes & Beyond

RL Training at Scale

Environments & Simulation

Autonomous Driving with RL

Agentic RL for Software Engineering

Embodied RL & Humanoid Control

World Models & Imagination

Production Deployment & Evaluation

The Production RL Stack

veRL

OpenRLHF

OpenEnv

Gymnasium

CleanRL

Stable-Baselines3

Ray RLlib

MetaDrive

MuJoCo

rLLM

TRL

Weights & Biases

Build Real Systems. Not Toy Examples.

MetaDrive Racing Arena

Agentic RL for SWE

OpenClaw: WhatsApp AI with GRPO

SmolVLA Robot Learning

Implementing RL2F: RL with Language Feedback

Teaching Humanoids to Walk

IRIS World Model

CaP-X RL: The First Coding Agent for Robotics

Built for Builders

ML Engineers

Graduate Students

Targeting Top AI Labs

Robotics Engineers

LLM Practitioners

Aspiring Researchers

Deep Dives We've Written

Welcome to the Era of Experience!

What are Large Reasoning Models?

Understanding RLHF From Scratch

We Wrote the Book on RL. Literally.

The RL Chapter in the DeepSeek Book

Designed by MIT PhDs

Dr. Rajat Dandekar

Dr. Raj Dandekar

Dr. Sreedath Panat

Learn from the Best in RL

Abhishek Goswami

Meet Dr. Rajat Dandekar

Dr. Rajat Dandekar

Guided by Dr. Rajat Dandekar

What You Get

Personalized Roadmap

1:1 Mentorship Sessions

Paper Reading & Writing

Publication Support

Choose Your Path

Individual Courses

Phase 1: RL Foundations

Phase 2: Production RL

Research Starter Kit

RL Research Mentorship

Bundles — Save More

Ultimate RL Bundle

Full RL Learning Path

Complete Foundation Track

Learning Path + Mentorship

Advanced Production Track

Learning Path + Research Kit

Foundation + Mentorship

Production + Mentorship

Reinforcement Learning
in Production