Posted 2025-04-19Updated 2025-04-196 minutes read (About 941 words)

Of Q‑Tables and Three‑in‑a‑Rows: Training an RL Knight in Tic‑Tac‑Toe

Reinforcement Learning is fairly popular at the moment. In this chronicle, we shall embark on a quest to forge a Reinforcement Learning model for the noble game of Tic‑Tac‑Toe. We’ll write our own environment, summon a DQN sorcerer from Stable Baselines 3, and ultimately witness our AI crush the humblest of human challengers (or at least draw more than half the time).

1. Summoning the Tic‑Tac‑Toe Environment

First things first: let us craft our battlefield. In tictactoe_env.py, we define a Gymnasium environment where:

The board is a flat array of length 9 (0 = empty, 1 = our agent’s “X”, –1 = opponent’s “O”).
The opponent always strikes first with a random move (easy fodder for training).
Illegal moves incur a –10 penalty (a harsh tutor, indeed).

# tictactoe_env.py (excerpt)
import gymnasium as gym
from gymnasium import spaces
import numpy as np

class TicTacToeEnv(gym.Env):
    def __init__(self, opponent='random'):
        super().__init__()
        self.observation_space = spaces.Box(-1, 1, (9,), np.int8)
        self.action_space = spaces.Discrete(9)
        self.opponent = opponent
        self.reset()

    def reset(self, **kwargs):
        self.board = np.zeros(9, dtype=np.int8)
        self.done = False
        # Opponent (–1) moves first
        self._opponent_move()
        return self.board.copy(), {}

    def step(self, action):
        if self.done:
            raise RuntimeError("Episode is done")
        # Agent move: place 1 at action
        if self.board[action] != 0:
            return self.board, -10, True, False, {}
        self.board[action] = 1
        winner = self._check_winner(self.board)
        if winner != 0:
            return self.board, (1 if winner==1 else -1), True, False, {}

        # Opponent’s random reply
        self._opponent_move()
        winner = self._check_winner(self.board)
        if winner != 0:
            return self.board, (1 if winner==1 else -1), True, False, {}

        # Draw?
        if np.all(self.board != 0):
            return self.board, 0, True, False, {}
        return self.board, 0, False, False, {}
    # … plus render(), _check_winner(), _opponent_move() …

Under the hood, _check_winner scans rows, columns, and diagonals for a sum of ±3. A sum of 3 means our agent wins; –3 means the foe triumphs.

2. Conjuring the DQN Agent

With battlefield in place, we enlist the services of a Deep Q‑Network. In train.py, we:

Instantiate train & evaluation environments.
Call forth a DQN with MLP policy.
Set learning rate, buffer size, γ, and other arcane hyperparameters.
Attach an EvalCallback to record our best model.
Train for 200,000 timesteps.
Save the champion’s weights.

# train.py (excerpt)
from stable_baselines3 import DQN
from stable_baselines3.common.callbacks import EvalCallback
from tictactoe_env import TicTacToeEnv

def main():
    train_env = TicTacToeEnv(opponent='random')
    eval_env  = TicTacToeEnv(opponent='random')

    model = DQN(
        "MlpPolicy", train_env,
        learning_rate=1e-3,
        buffer_size=50_000,
        learning_starts=1_000,
        batch_size=64,
        gamma=0.99,
        train_freq=4,
        target_update_interval=1_000,
        verbose=1,
    )

    eval_callback = EvalCallback(
        eval_env,
        best_model_save_path="./logs/best_model",
        log_path="./logs/eval",
        eval_freq=5_000,
        deterministic=True,
        render=False
    )

    model.learn(total_timesteps=200_000, callback=eval_callback, progress_bar=True)
    model.save("saved_model/tictactoe_dqn")

After a few minutes (or hours, depending on your GPU’s mood), you’ll have a tictactoe_dqn.zip artifact ready to deploy.

3. The CLI Tournament

What good is a champion if it cannot demonstrate its prowess? Enter cli.py, a simple terminal interface that:

Draws the 3×3 board with emojis (🔴 for AI, 🟢 for Player, ⚪ for empty).
Prompts the human hero for moves (0–8).
Alternates turns until someone wins or the board is full.

# cli.py (excerpt)
from stable_baselines3 import DQN
import numpy as np
from colorama import Fore

model = DQN.load("saved_model/tictactoe_dqn.zip")

def draw_board(board):
    symbols = {1:'🔴', -1:f'{Fore.GREEN}🟢{Fore.RESET}', 0:'⚪'}
    bf = board.reshape(3,3)
    print("\n".join(["".join(symbols[x] for x in row) for row in bf]))
    print("-"*10)

# …input loop, check_winner, predict(), etc.…

Boot up python cli.py, choose whether you want the first move, and prepare to be dazzled (or at least forced into a draw).

4. The Results

After training, our DQN agent achieves a mean reward of 0.6. This translates to:

Win rate ≈ 50% against a random opponent.
Draw rate ≈ 45%.
Loss rate ≈ 5% (illegal moves are swiftly punished).

Against a human who knows only “center first,” the AI will happily force a draw every time, and will exploit any sloppy corner openings.

Tic‑Tac‑Toe Board

5. Attachments & Repository

All code lives in this repo:

tictactoe_env.py – Custom Gym environment
train.py – Training script with DQN
cli.py – Play against your model

You may download the trained weights here:
saved_model/tictactoe_dqn.zip

Feel free to fork, tweak hyperparameters, or swap in a more fearsome opponent (Minimax, Monte Carlo, your grandma’s intuition, etc.).

Conclusion

Tic‑Tac‑Toe is a humble playground, but it teaches us the essence of RL: exploration, exploitation, and the delicate dance of rewards. From a blank 3×3 grid, our agent learned to block, to fork, and to force a draw against random foes—and even pester strategic humans. So next time someone challenges you to “naughts and crosses,” send out your RL knight. And remember, in the world of reinforcements, even a simple game can yield grand adventures.

May your episodes be endless and your Q‑values ever convergent!

Of Q‑Tables and Three‑in‑a‑Rows: Training an RL Knight in Tic‑Tac‑Toe

https://zsgs.design/2025/04/19/of-q‑tables-and-three‑in‑a‑rows-training-an-rl-knight-in-tic‑tac‑toe/

Author

Zhang Youjie

Posted on

2025-04-19

Updated on

2025-04-19

Of Q‑Tables and Three‑in‑a‑Rows: Training an RL Knight in Tic‑Tac‑Toe

1. Summoning the Tic‑Tac‑Toe Environment

2. Conjuring the DQN Agent

3. The CLI Tournament

4. The Results

5. Attachments & Repository

Conclusion

Author

Posted on

Updated on

Licensed under

Comments

Catalogue