Abstract
Prisoner’s Dilemma is a game Invented by Merrill Flood & Melvin Dresher during the 1950s with the main focus on, Iterated Prisoner's Dilemma experiments by Robert Axelrod’s. Prisoner’s Dilemma game is a classic prototype which is responsive to evolutionary behaviours. Iterated Prisoners Dilemma (IPD) is widely studied in the fields of Artificial Intelligence and other fields such as biology due to its models which show the cooperation between two individuals. In this paper I will mainly use genetic algorithm to test the payoff matrix and compare the genetic algorithm with other methods to find the best possible strategies, I will show how the genetic algorithm works in comparison to other obtained strategies such as TFT, Pavlov,ALLC and ALLD. I have simulated a test using Wake software and a website online [1] with a test of 10,000 generations with a mutation rate of mutation rate was 0.001 and recombination rate of 0.8 and a random intervention of 5% and finally determine if the obtained strategy is efficient based on results of simulation.
Introduction
Before we progress into various aspects of Prisoner's Dilemma we need to learn the basics.
To start off with we look at this snippet from Wikipedia.Two criminals are caught and jailed (Ann and Bob). Each criminal is in solitary confinement which means they cannot communicate with the other. The police lack evidence to prison the pair on the crimes committed. The police wish to sentence both to prison for at least a year. Each prisoner is given the opportunity either to: defect the other by stating that the other person committed the crime, or to cooperate with the other by remaining silent. The offer is:
If Ann and Bob each defect the other, each of them serves 10 years in prison
If Ann defects Bob but Bob remains silent, Ann will be set free and Bob will serve 20 years in prison (and vice versa)
If Ann and Bob both remain silent, both of them will only serve 1 year in prison (on the lesser charge)
This gives an understanding of what Prisoner's Dilemma is about. Over the course of the paper we will look at different optimization methods using different memory depths and see how they standoff against other strategies such as TFT, Pavlov, ALLC and ALLD. With the help of machine learning I hope to find the most suitable strategy.
Common Obtained Strategies used
There are several strategies used in the area of Prisoner's Dilemma, namely these are few of the strategies often used, I will consider only TFT, Pavlov, ALLD and ALLC for this research paper.
Tit-For-Tat (TFT) – The action chosen is based on the opponent’s last move.
On the first turn, the previous move cannot be known, so always cooperate on the first move. Thereafter, always choose the opponent’s last move as your next move.
Tit-For-Two-Tat (TF2T) – Same as Tit for Tat, but requires two consecutive defections for a defection to be returned. Cooperate on the first two moves. If the opponent defects twice in a row, choose defection as the next move.
Suspicious Tit-For-Tat (STFT) – Always defect on the first move.
Thereafter, replicate opponent’s last move.
Free Rider (ALLD) – Always choose to defect no matter what the opponent’s last turn was.
This is a dominant strategy against an opponent that has a tendency to cooperate.
Always Cooperate (ALLC) -Always choose to cooperate no matter what the opponent’s last turn was.
This strategy can be terribly abused by the Free Rider Strategy.
Pavlov (repeat last choice if good outcome): If 5 or 3 points scored in the last round then repeat last choice.
Iterated Prisoner's Dilemma
If two players play more than once continuously and remember the previous actions of the prisoners then its called: Iterated Prisoner's Dilemma commonly referred to as “IPD”. This is a sample table representing the above definition.
In the table, the pair (X, Y) which corresponds to the row and column of Ann and Bob respectively indicates that the Ann's payoff is X and the Bob's payoff is Y. In defining a Prisoner’s Dilemma game, certain conditions have to hold. The order of the payoffs is important. The best a player can do is in this situation is to defect “T”. The worst a player can do is to get the defect's payoff, “S”. If the two players cooperate then the reward for that mutual cooperation, “R”, should be better than the punishment for mutual defection, “P”. Therefore, the following must hold. T > R > P > S.
As per the IPD Payoff matrix given in the project slide.
In the iterated game, player strategies are rules that determine in a stochastic manner, a player’s next move in any given game situation which can include the history of the game to that point. Each player’s aim is to maximize his total payoff over the series. If you know how many times you are to play, then one can argue that the game can be reduced to a one-shot Prisoner’s Dilemma. The argument is based on the observation that you, as a rational player will defect on the last iteration, because you are in effect playing a single iteration. Thus your opponent will rely on the same induction. Knowing that your opponent will therefore defect on the last iteration. Your opponent will make the same deduction. This logic can be applied all the way back to the first iteration. Thus, both players binded into a sequence of mutual defections.
One way to avoid this situation is to use a routine in which the players do not know when the game will end. If the players know the probability, that the game continues, then from their point of view, it is equivalent to an infinite game where the values of payoffs in each successive round are discounted by a factor. Depending on the value of factor and various other parameters, different Nash equilibria are possible, where both players play the same strategy. Some strategies have no advantages over the single game. A player who cooperates regardless of previous behavior (AllC) or who always defects (AllD) will score no better than their memory-less counterpart. Much research suggests, however, that the Tit For Tat (TFT) strategy is very successful. This strategy simply states that a player should repeat the opponent’s move of the previous round. In earlier research, TFT has been shown to outperform most other strategies [2]. Another strategy shown to perform well against a wide range of opponents is the Pavlov strategy. Pavlov players would cooperate on a history of mutual cooperation and mutual defection. Since they are rewarded with a score of 3 for mutual cooperation, Pavlov players continue to cooperate. With a history of DD, players will also choose too cooperate in the next round since they had been punished with a low score of 1. On the other hand, Pavlov players would defect on a history of DC since they had just been rewarded with the best score of 5 points for defection, and would also defect with a history of CD since they had just been severely punished with a score of 0 for cooperating. This strategy was shown to perform as well or better than any other strategies in the memory-one iterated Prisoner’s Dilemma [5].
Genetic Algorithm and Simulation (GA)
Genetic algorithms lend themselves well studying strategies in the prisoner’s dilemma. Each player is represented by its strategy. In the memory-three game used in this study, each player’s strategy must address sixty-four possible histories. We use the set of moves to create a 64 bit string which represents each player in the algorithm. This would be the simplest explanation of how genetic algorithm works.
Start: Generate random population of n chromosomes (suitable solutions for the problem)
Fitness: Evaluate the fitness f(x) of each chromosome x in the population
New population: Create a new population by repeating following steps until the new population is complete
Selection: Select two parent chromosomes from a population according to their fitness (the better fitness, the bigger chance to be selected)
Crossover: With a crossover probability cross over the parents to form new offspring (children). If no crossover was performed, offspring is the exact copy of parents.
Mutation: With a mutation probability mutate new offspring at each locus (position in chromosome).
Accepting: Place new offspring in the new population
Replace: Use new generated population for a further run of the algorithm
Test: If the end condition is satisfied, stop, and return the best solution in current population
Loop: Go to step 2
The first issue is figuring out how to encode a strategy as a string. Suppose the memory of each player is one previous game. There are four possibilities for the previous game (memory depth = 1):
Case 1: CC
Case 2: CD
Case 3: DC
Case 4: DD
Therefore the encoding of A will be CDCD and encoding of B will be CCDD
Where C is ―cooperate and D is ―defect. Case 1 is when both players cooperated in the previous game, case 2 is when player A cooperated and player B defected, and so on A strategy is simply a rule that specifies an action in each of these cases.
If CC (Case 1) Then C
If CD (Case 2) Then D
If DC (Case 3) Then C
If DD (Case 4) Then D
If the cases are ordered in this canonical way, this strategy can be expressed compactly as the string CDCD. To use the string as a strategy, the player records the moves made in the previous game (e.g., CD), finds the case number i by looking up that case in a table of ordered cases like that given above (for CD, i = 2), and selects the letter in the ith position of the string as its move in the next game (for i = 2, the move is D). Consider the tournament involved strategies that remembered three previous games, then there are 64 possibilities for the previous three games:
CC CC CC (Case 1)
CC CC CD (Case 2)
CC CC DC (Case 3)
DD DD DC (Case 63)
DD DD DD (Case 64)
After calculating fitness, which is described in the next section, this study implements roulette wheel selection, also called stochastic sampling with replacement [3]. In this stochastic algorithm, the fitness of each individual is normalized. Based on their fitness, individuals are mapped to contiguous segments of a line, such that each individual's segment is equal in size to its fitness. A random number is generated and the individual whose segment spans the random number is selected. The process is repeated until the correct number of individuals is obtained. A sample encoding strategy for IPD.