Of Q‑Tables and Three‑in‑a‑Rows: Training an RL Knight in Tic‑Tac‑Toe

Of Q‑Tables and Three‑in‑a‑Rows: Training an RL Knight in Tic‑Tac‑Toe

Reinforcement Learning is fairly popular at the moment. In this chronicle, we shall embark on a quest to forge a Reinforcement Learning model for the noble game of Tic‑Tac‑Toe. We’ll write our own environment, summon a DQN sorcerer from Stable Baselines 3, and ultimately witness our AI crush the humblest of human challengers (or at least draw more than half the time).

1. Summoning the Tic‑Tac‑Toe Environment

First things first: let us craft our battlefield. In tictactoe_env.py, we define a Gymnasium environment where:

  • The board is a flat array of length 9 (0 = empty, 1 = our agent’s “X”, –1 = opponent’s “O”).
  • The opponent always strikes first with a random move (easy fodder for training).
  • Illegal moves incur a –10 penalty (a harsh tutor, indeed).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# tictactoe_env.py (excerpt)
import gymnasium as gym
from gymnasium import spaces
import numpy as np

class TicTacToeEnv(gym.Env):
def __init__(self, opponent='random'):
super().__init__()
self.observation_space = spaces.Box(-1, 1, (9,), np.int8)
self.action_space = spaces.Discrete(9)
self.opponent = opponent
self.reset()

def reset(self, **kwargs):
self.board = np.zeros(9, dtype=np.int8)
self.done = False
# Opponent (–1) moves first
self._opponent_move()
return self.board.copy(), {}

def step(self, action):
if self.done:
raise RuntimeError("Episode is done")
# Agent move: place 1 at action
if self.board[action] != 0:
return self.board, -10, True, False, {}
self.board[action] = 1
winner = self._check_winner(self.board)
if winner != 0:
return self.board, (1 if winner==1 else -1), True, False, {}

# Opponent’s random reply
self._opponent_move()
winner = self._check_winner(self.board)
if winner != 0:
return self.board, (1 if winner==1 else -1), True, False, {}

# Draw?
if np.all(self.board != 0):
return self.board, 0, True, False, {}
return self.board, 0, False, False, {}
# … plus render(), _check_winner(), _opponent_move() …

Under the hood, _check_winner scans rows, columns, and diagonals for a sum of ±3. A sum of 3 means our agent wins; –3 means the foe triumphs.

2. Conjuring the DQN Agent

With battlefield in place, we enlist the services of a Deep Q‑Network. In train.py, we:

  1. Instantiate train & evaluation environments.
  2. Call forth a DQN with MLP policy.
  3. Set learning rate, buffer size, γ, and other arcane hyperparameters.
  4. Attach an EvalCallback to record our best model.
  5. Train for 200,000 timesteps.
  6. Save the champion’s weights.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# train.py (excerpt)
from stable_baselines3 import DQN
from stable_baselines3.common.callbacks import EvalCallback
from tictactoe_env import TicTacToeEnv

def main():
train_env = TicTacToeEnv(opponent='random')
eval_env = TicTacToeEnv(opponent='random')

model = DQN(
"MlpPolicy", train_env,
learning_rate=1e-3,
buffer_size=50_000,
learning_starts=1_000,
batch_size=64,
gamma=0.99,
train_freq=4,
target_update_interval=1_000,
verbose=1,
)

eval_callback = EvalCallback(
eval_env,
best_model_save_path="./logs/best_model",
log_path="./logs/eval",
eval_freq=5_000,
deterministic=True,
render=False
)

model.learn(total_timesteps=200_000, callback=eval_callback, progress_bar=True)
model.save("saved_model/tictactoe_dqn")

After a few minutes (or hours, depending on your GPU’s mood), you’ll have a tictactoe_dqn.zip artifact ready to deploy.

3. The CLI Tournament

What good is a champion if it cannot demonstrate its prowess? Enter cli.py, a simple terminal interface that:

  • Draws the 3×3 board with emojis (🔴 for AI, 🟢 for Player, ⚪ for empty).
  • Prompts the human hero for moves (0–8).
  • Alternates turns until someone wins or the board is full.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# cli.py (excerpt)
from stable_baselines3 import DQN
import numpy as np
from colorama import Fore

model = DQN.load("saved_model/tictactoe_dqn.zip")

def draw_board(board):
symbols = {1:'🔴', -1:f'{Fore.GREEN}🟢{Fore.RESET}', 0:'⚪'}
bf = board.reshape(3,3)
print("\n".join(["".join(symbols[x] for x in row) for row in bf]))
print("-"*10)

# …input loop, check_winner, predict(), etc.…

Boot up python cli.py, choose whether you want the first move, and prepare to be dazzled (or at least forced into a draw).

4. The Results

After training, our DQN agent achieves a mean reward of 0.6. This translates to:

  • Win rate ≈ 50% against a random opponent.
  • Draw rate ≈ 45%.
  • Loss rate  ≈ 5% (illegal moves are swiftly punished).

Against a human who knows only “center first,” the AI will happily force a draw every time, and will exploit any sloppy corner openings.

Tic‑Tac‑Toe Board

5. Attachments & Repository

All code lives in this repo:

  • tictactoe_env.py – Custom Gym environment
  • train.py   – Training script with DQN
  • cli.py    – Play against your model

You may download the trained weights here:
saved_model/tictactoe_dqn.zip

Feel free to fork, tweak hyperparameters, or swap in a more fearsome opponent (Minimax, Monte Carlo, your grandma’s intuition, etc.).

Conclusion

Tic‑Tac‑Toe is a humble playground, but it teaches us the essence of RL: exploration, exploitation, and the delicate dance of rewards. From a blank 3×3 grid, our agent learned to block, to fork, and to force a draw against random foes—and even pester strategic humans. So next time someone challenges you to “naughts and crosses,” send out your RL knight. And remember, in the world of reinforcements, even a simple game can yield grand adventures.

May your episodes be endless and your Q‑values ever convergent!

Of Q‑Tables and Three‑in‑a‑Rows: Training an RL Knight in Tic‑Tac‑Toe

https://zsgs.design/2025/04/19/of-q‑tables-and-three‑in‑a‑rows-training-an-rl-knight-in-tic‑tac‑toe/

Author

Zhang Youjie

Posted on

2025-04-19

Updated on

2025-04-19

Licensed under

Comments