Connect Me If You Can

Introduction

A self-learning connect-4 game with a simple GUI. I used the blog on reinforcement learning to come up with my Q-Learning solution. The blog provides an excellent step-by-step approach to implement a solution using RL. Connect-4 has a huge state space that renders Q-Learning ineffective. An approach using Deep Q-Learning will make a better self-learning AI agent for the game.

A game with a trained AI agent

Approach
  • Despite Connect-4 being a popular game, I had not played the game before I started working on this assignment.
  • I wanted to gain a deeper understanding of the game so I started by designing the graphical user interface (GUI) for the game.
  • I used Pygame for the development of the GUI by following a YouTube tutorial. The tutorial helped me understand the various scenarios in which a player can win or a tie can occur.
  • The various stages of the GUI development can be seen in the GUI Development section.
  • Note that game modes were in the final stage of GUI development, a single-player mode, namely 'vs Computer' was added to play with the learning agent.
  • After the GUI for the game was designed, I defined a general player class with 3 types of computer players: random agent, trained agent, and learning agent.
  • The random agent moves at any of the valid positions according to the current state of the game. The trained agent has been programmed to be easy level player using the following tutorial.
  • The following article helped me understand my state and action space to decide up on a learning algorithm.
  • I employed Q-learning to train my learning agent because both the state space and action space for the game environment are discrete.
  • The reward system employed for the training of the learning agent is given in Table-1.
  • Note that reward is zero until the game is won or tied. At this point, the reward is appropriated to the learning agent.
  • The learning agent was first trained against the random agent and then against the trained agent for 10,000 episodes.

GUI Development Stages
Stage 1 of GUI Development – Create Connect-4 grid with token movement space
Stage 2 of GUI Development – Drop token of each player
Stage 3 of GUI Development – Check for winning and tie conditions
Stage 4 of GUI Development – Add single-player and multi-player modes

Rewards Table
Reward TypeReward
Win+5.0
Draw+0.2
Loss-1.0
During Game0.0
Table 1 – Rewards system for the AI agent

Results & Analysis
  • Figure-1 illustrates the winning rate per 100 episodes and average moves per episode of the learning and random agent.
  • The learning agent does not have any decay in exploration employed on it. As a result, we can see the inconsistencies in the winning rate of the learning agent where the learning agent looses despite being trained for more than 200 epochs.
  • Figure-2 illustrates the winning rate per 100 episodes and average moves per episode of the learning and random agent with decay in exploration.
  • We can clearly see the advantage the learning agent has due to the employment of exploration plus exploitation.
  • Figure-3 illustrates the winning rate per 100 episodes and average moves per episode of the learning and trained agent.
  • The learning agent does not have any decay in exploration employed on it. As a result, we can clearly see the inconsistencies of the learning agent in winning against the trained agent despite pre-training with the random agent.
  • Figure-4 illustrates the winning rate per 100 episodes and average moves per episode of the learning and trained agent with decay in exploration.
  • We can clearly see the advantage the learning agent has due to the employment of exploration plus exploitation.
  • Despite the consistent winning percentage against the trained agent, the learning agent performs poorly against a human player. (See figure)
  • The learning agent did manage to learn that playing in the center column is a better move at the start of the game than any other available moves.
  • I believe the reason for the poor performance of the learning agent is that the trained agent does not cover enough of the state space for the learning agent to form a good Q-table. To tackle this, I am planning to program a trained agent using the Minimax algorithm.
  • Another reason for poor performance might be due to the huge state space of the Connect-4 game. Since Q-learning essentially provides the learning agent with a cheat sheet of state-action pair, a huge state space of the order of 1014 cannot be stored in the form of a table.
  • A Deep Q-Network can be trained to tackle the huge state space problem stated above. I am going to work on the proposed solutions during the summer.

Figure 1 – Training against Random Agent without exploration decay

Figure 2 – Training against Random Agent with exploration decay

Figure 3 – Training against Trained Agent without exploration decay

Figure 4 – Training against Trained Agent with exploration decay


GitHub

Checkout the source code of the project here.

%d bloggers like this: