You Should Not Classify A Prisoner’s Dilemma

(Ported from a Jupyter Notebook)

PLEASE NOTE! To anyone that accidentally came across this article. This was a purely for-fun failed attempt!

In this experiment I explored how to teach a machine learning model to play Prisoner’s Dilemma without using reinforcement training.

It also uses Huggingface’s Safetensors format to store the model instead of pickle.

Background

The Prisoner’s Dilemma is a famous problem in game theory that shows why two people might not cooperate, even when it’s in their best interest.

Imagine two criminals, Alice and Bob, are arrested and put in separate jail cells. The police don’t have enough evidence to convict them of a serious crime, so they offer each prisoner a deal:

If Alice and Bob both stay silent (cooperate with each other), they each get 1 year in prison.
If Alice betrays Bob, but Bob stays silent, Alice goes free, and Bob gets 3 years in prison (and vice versa).
If they both betray each other, they each get 2 years in prison.

#first we import libs
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, random_split
from safetensors.torch import save_model,save_file, safe_open
import randomCode language: Python (python)

#I am using a RX7900XTX so I can use GPU
#If running on something like a raspberry pi or an Intel iGPU computer, it will fallback to CPU
#YES, IT WILL STILL SHOW 'cuda' EVEN I AM USING AMD GPU...
device = torch.device("cuda" if torch.cuda.is_available() else "cpu");
print(f"Using device: {device}");Code language: Python (python)

Dataset Generation

We first generate our dataset. I really like to play the game myself but tit for tat is just so simple it can be generated programmatically.

A classifier model can not process dynamically sized array. We need to fake it with fixed arrays. When a round is not yet played, the array is [0.5,0.5]. If Alice stays silent but Bob betrays, the array for that round is [0,1], and vice versa.

#different strategies, some smart, some not so smart
def almost_always_cooperate(history, my_score, opp_score, round_num):
    if round_num == 0 or history[-1][1] == 0.5:  #first round always cooperate
        return 0
    if(random.randint(0,100)<80):
        return 0;
    return 1 

def almost_always_defect(history, my_score, opp_score, round_num):
    if round_num == 0 or history[-1][1] == 0.5:
        return 0
    if(random.randint(0,100)<80):
        return 1;
    return 0;

def panic_when_score_low(history, my_score, opp_score, round_num):
    if round_num == 0 or history[-1][1] == 0.5:
        return 0
    if((opp_score-my_score)>20):
        return 1;
    return 0;

#most effective one, hopefully the ML model learns it...
def tit_for_tat(history, my_score, opp_score, round_num):
    if round_num == 0 or history[-1][1] == 0.5:
        return 0
    return history[-1][1]  #copy opponent's last move

def random_choice(history, my_score, opp_score, round_num):
    return random.choice([0, 1])

#simulate game to generate dataset
class PrisonersDilemmaGenerator:
    def __init__(self, rounds=6, history_size=10):
        self.hsize = history_size;
        self.rounds = rounds;
        self.players = [tit_for_tat, random_choice, almost_always_cooperate, almost_always_defect,panic_when_score_low]

    def play_game(self):
        history = [[0.5, 0.5]] * self.hsize  #init empty history, 0.5 means empty
        score_p1, score_p2 = 0, 0
        data = []

        #randomly choose two strategies
        player1 = random.choice(self.players)
        player2 = random.choice(self.players)

        for round_num in range(self.rounds):
            move_p1 = player1(history, score_p1, score_p2, round_num)
            move_p2 = player2(history, score_p2, score_p1, round_num)
            move_tft = tit_for_tat(history, score_p2, score_p1, round_num)

            # Update scores based on payoff matrix
            if move_p1 == 0 and move_p2 == 0:
                score_p1 += 3
                score_p2 += 3
            elif move_p1 == 0 and move_p2 == 1:
                score_p1 += 0
                score_p2 += 5
            elif move_p1 == 1 and move_p2 == 0:
                score_p1 += 5
                score_p2 += 0
            else:
                score_p1 += 0
                score_p2 += 0

            #store and flatten the datas
            input_data = list(sum(history, [])) + [score_p1, score_p2, round_num]
            data.append((input_data, move_p1)) #label data, we use tit for tat here

            #update history
            history.pop(0)
            history.append([move_p1, move_p2])

        return data

    def generate_dataset(self, num_games=1000):
        dataset = []
        for _ in range(num_games):
            dataset.extend(self.play_game())
        return dataset

#generate dataset
generator = PrisonersDilemmaGenerator(rounds=random.randint(0,10))
dataset = generator.generate_dataset(num_games=300000)



Code language: Python (python)

#convert dataset into PyTorch format
class PrisonersDilemmaDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        x, y = self.data[idx]
        return torch.tensor(x, dtype=torch.float32), torch.tensor(y, dtype=torch.float32)
        
#create DataLoader
prisoners_dilemma_dataset = PrisonersDilemmaDataset(dataset)
dataloader = DataLoader(prisoners_dilemma_dataset, batch_size=32, shuffle=True)

#define data loaders for training and validation
train_size = int(0.8 * len(prisoners_dilemma_dataset))
val_size = len(prisoners_dilemma_dataset) - train_size
train_dataset, val_dataset = torch.utils.data.random_split(prisoners_dilemma_dataset, [train_size, val_size])

train_loader = DataLoader(train_dataset, batch_size=4096, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=4096, shuffle=False)Code language: Python (python)

Create the model

Now we create the classifier model using linear layers. ~~At this point why don’t I just use ml5.js…~~

class PrisonersDilemmaNN(nn.Module):
    def __init__(self):
        super(PrisonersDilemmaNN, self).__init__()
        self.fc1 = nn.Linear(23, 32)  #input: 10 rounds * 2 moves + 3 extra inputs
        self.batch_norm1 = nn.BatchNorm1d(32)
        self.fc2 = nn.Linear(32, 64)
        self.fc3 = nn.Linear(64, 32)
        self.fc4 = nn.Linear(32, 16)
        self.fc5 = nn.Linear(16, 1)
        self.sigmoid = nn.Sigmoid()  #output probability of defecting
        self.dropout = nn.Dropout(p=0.5)  #dropout with 50% probability

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.batch_norm1(x)  #batch normalization after fc1
        x = torch.relu(self.fc2(x))
        x = torch.relu(self.fc3(x))
        x = self.dropout(x)  #apply dropout after fc3
        x = torch.relu(self.fc4(x))
        x = self.fc5(x)
        return self.sigmoid(x)  #output resultCode language: Python (python)

Look’s like it’s training time!

Gotta train train train! Ouch, we might be overfitted…

#initialize model, loss function, and optimizer
model = PrisonersDilemmaNN().to(device)
criterion = nn.BCELoss()  #binary Cross-Entropy for classification
optimizer = optim.AdamW(model.parameters(), lr=0.00001)

#move loss function to the same device as model
criterion = torch.nn.BCELoss().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.00001)


#training loop
#let's train 1 epoches
epochs = 3
for epoch in range(epochs):
    model.train()
    total_loss = 0

    for inputs, labels in train_loader:
        #move data to GPU
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(inputs).squeeze()  #remove extra dimensions
        loss = criterion(outputs.float(), labels.float())
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    #validation phase
    model.eval()
    val_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for inputs, labels in val_loader:
            inputs, labels = inputs.to(device), labels.to(device)

            outputs = model(inputs).squeeze()
            loss = criterion(outputs.float(), labels.float())
            val_loss += loss.item()

            #convert probability to 0 or 1
            predictions = (outputs > 0.5).float()
            correct += (predictions == labels).sum().item()
            total += labels.size(0)

    if epoch % 1 == 0:
        print(f"Epoch {epoch}: Train Loss={total_loss/len(train_loader):.4f}, "
              f"Val Loss={val_loss/len(val_loader):.4f}, Accuracy={correct/total:.2%}")Code language: Python (python)

Let’s Play!

Now we just need to implement a playable game and start playing!

class GameSession:
    def __init__(self, model, rounds=10):
        self.model = model
        self.rounds = rounds
        self.history = [[0.5, 0.5] for _ in range(rounds)] #init this
        self.user_score = 0
        self.ai_score = 0
        self.current_round = 0

    def _get_ai_move(self):
        input_data = list(sum(self.history, [])) + [self.user_score, self.ai_score, self.current_round]
        formatted_input = torch.tensor(input_data, dtype=torch.float32).unsqueeze(0).to(device)

        self.model.eval()
        with torch.no_grad():
            prediction = self.model(formatted_input).item()

        return 1 if prediction > 0.5 else 0

    def _update_scores(self, user_move, ai_move):
        if user_move == 0 and ai_move == 0:  #both cooperate
            self.user_score += 1
            self.ai_score += 1
        elif user_move == 0 and ai_move == 1:  #user cooperates, AI defects
            self.user_score += 0
            self.ai_score += 2
        elif user_move == 1 and ai_move == 0:  #user defects, AI cooperates
            self.user_score += 2
            self.ai_score += 0
        else:  #both defect
            self.user_score += -1
            self.ai_score += -1

    def _play_round(self, user_move):
        if self.current_round >= self.rounds:
            print("Game Over! Use get_scores() to see results.")
            return

        if self.current_round == 0:
            ai_move = 0
        else:
            ai_move = self._get_ai_move()
        self._update_scores(user_move, ai_move)

        # Update history
        self.history.pop(0)
        self.history.append([user_move, ai_move])

        self.current_round += 1
        print(f"Round {self.current_round}: You {'Opposed' if user_move else 'Collaborated'}, "
              f"AI {'Opposed' if ai_move else 'Collaborated'}")

        if self.current_round == self.rounds:
            print("Use get_scores() to see results.")

    def oppose(self):
        self._play_round(1)

    def collab(self):
        self._play_round(0)

    def get_scores(self):
        print(f"Final Scores - You: {self.user_score}, AI: {self.ai_score}")Code language: Python (python)

game = GameSession(model)
for i in range(0,10):
    if(random.randint(0,100)%2==0):
        game.collab()
    else:
        game.oppose()
game.get_scores()
Code language: Python (python)

Verdict

Just like my hypothesis, Prisoner’s Dilemma is definetely not a great way to use a classifier model. However, it still wins from time to time.

The more suitable way to implement such nural network is to use reinforcement learning, which I really want to explore in the class.

Also, this training is not effcient on GPU since I can not hear any coil whine from my GPU during the training process. To make the insultation even worst, the GPU’s fan is not even spinning. I tried to convert the model to half, which makes it use AMD RDNA3’s float16 instead of float32. Unfortunately it completely breaks the model and made it stupid. The performance uplift is also negligible.