Reinforcement Learning Grid World
Compare Q-Learning and SARSA algorithms
Grid Configuration:
Beginner (8×8)
Advanced (10×10)
Local Minima (10×10)
Random Binomial (10×10)
0
Episode
0
Total Reward
Run 1 Episode
10 Episodes
25 Episodes
50 Episodes
100 Episodes
Run 10
Reset Environment
Environment
🤖 Agent | 🚀 Start | 🎯 Goal | 💀 Hazard | 💰 Reward
Q-Values:
High Positive
Medium Positive
Low Positive
Zero
Negative
Numbers show max valid Q-value, arrows show best valid action
Tap cells to see Q-values. Green intensity shows learning progress.
🏆
Hall of Fame
🌍 Global (0)
💾 Local (0)
Filter:
All (0)
Easy (0)
Complex (0)
Local Minima (0)
📊
No scores yet. Be the first to submit!
Learning Algorithm
Show Info
Q-Learning
Off-policy
SARSA
On-policy
Key Differences
Q-Learning:
Uses max Q-value for next state (off-policy), can be more aggressive
SARSA:
Uses actual next action Q-value (on-policy), generally more conservative
Learning Parameters
Core
Directional Heuristics
Advanced
Episode Rewards
No episodes completed yet. Run some episodes to see the reward history.
0
Episode
0
Total Reward
Run 1 Episode
10 Episodes
25 Episodes
50 Episodes
100 Episodes
Run 10
Reset Environment
Environment
🤖 Agent | 🚀 Start | 🎯 Goal | 💀 Hazard | 💰 Reward
Q-Values:
High Positive
Medium Positive
Low Positive
Zero
Negative
Numbers show max valid Q-value, arrows show best valid action
Tap cells to see Q-values. Green intensity shows learning progress.
🏆
Hall of Fame
🌍 Global (0)
💾 Local (0)
Filter:
All (0)
Easy (0)
Complex (0)
Local Minima (0)
📊
No scores yet. Be the first to submit!
Learning Algorithm
Show Info
Q-Learning
Off-policy
SARSA
On-policy
Key Differences
Q-Learning:
Uses max Q-value for next state (off-policy), can be more aggressive
SARSA:
Uses actual next action Q-value (on-policy), generally more conservative
Learning Parameters
Core
Directional Heuristics
Advanced
Episode Rewards
No episodes completed yet. Run some episodes to see the reward history.