Abstract
Prisoner’s Dilemma is a game Invented by Merrill Flood & Melvin Dresher during the 1950s with the main focus on, Iterated Prisoner's Dilemma experiments by Robert Axelrod’s. Prisoner’s Dilemma game is a classic prototype which is responsive to evolutionary behaviours. Iterated Prisoners Dilemma (IPD) is widely studied in the fields of Artificial Intelligence and other fields such as biology due to its models which show the cooperation between two individuals. In this paper I will mainly use genetic algorithm to test the payoff matrix and compare the genetic algorithm with other methods to find the best possible strategies, I will show how the genetic algorithm works in comparison to other obtained strategies such as TFT, ALLC and ALLD. I have simulated a test using Wake software and a website online [1] with a test of 10,000 generations with a mutation rate of mutation rate was 0.001 and recombination rate of 0.8 and a random intervention of 5% and finally determine if the obtained strategy is efficient based on results of simulation.
Introduction
Before we progress into various aspects of Prisoner's Dilemma we need to learn the basics.
To start off with we look at this snippet from Wikipedia.Two criminals are caught and jailed (Ann and Bob). Each criminal is in solitary confinement which means they cannot communicate with the other. The police lack evidence to prison the pair on the crimes committed. The police wish to sentence both to prison for at least a year. Each prisoner is given the opportunity either to: defect the other by stating that the other person committed the crime, or to cooperate with the other by remaining silent. The offer is:
If Ann and Bob each defect the other, each of them serves 10 years in prison
If Ann defects Bob but Bob remains silent, Ann will be set free and Bob will serve 20 years in prison (and vice versa)
If Ann and Bob both remain silent, both of them will only serve 1 year in prison (on the lesser charge)
This gives an understanding of what Prisoner's Dilemma is about. Over the course of the paper we will look at different optimization methods using different memory depths and see how they standoff against other strategies such as TFT, TF2T and STFT. With the help of machine learning I hope to find the most suitable strategy.
Common Obtained Strategies used
There are several strategies used in the area of Prisoner's Dilemma, namely these are few of the strategies often used, I will consider only TFT, ALLD and ALLC for this research paper.
Tit-For-Tat (TFT) – The action chosen is based on the opponent’s last move.
On the first turn, the previous move cannot be known, so always cooperate on the first move. Thereafter, always choose the opponent’s last move as your next move.
Tit-For-Two-Tat (TF2T) – Same as Tit for Tat, but requires two consecutive defections for a defection to be returned. Cooperate on the first two moves. If the opponent defects twice in a row, choose defection as the next move.
Suspicious Tit-For-Tat (STFT) – Always defect on the first move.
Thereafter, replicate opponent’s last move.
Free Rider (ALLD) – Always choose to defect no matter what the opponent’s last turn was.
This is a dominant strategy against an opponent that has a tendency to cooperate.
Always Cooperate (ALLC) -Always choose to cooperate no matter what the opponent’s last turn was.
This strategy can be terribly abused by the Free Rider Strategy.
Iterated Prisoner's Dilemma
If two players play more than once continuously and remember the previous actions of the prisoners then its called: Iterated Prisoner's Dilemma commonly referred to as “IPD”. This is a sample table representing the above definition.
BOB
Cooperate
Defect
ANN
Cooperate
R,R
S,T
Defect
T,S
P,P
In the table, the pair (X, Y) which corresponds to the row and column of Ann and Bob respectively indicates that the Ann's payoff is X and the Bob's payoff is Y. In defining a Prisoner’s Dilemma game, certain conditions have to hold. The order of the payoffs is important. The best a player can do is in this situation is to defect “T”. The worst a player can do is to get the defect's payoff, “S”. If the two players cooperate then the reward for that mutual cooperation, “R”, should be better than the punishment for mutual defection, “P”. Therefore, the following must hold. T > R > P > S.
As per the IPD Payoff matrix given in the project slide.
BOB
Cooperate
Defect
ANN
Cooperate
3,3
0,5
Defect
5,0
1,1
In the iterated game, player strategies are rules that determine in a stochastic manner, a player’s next move in any given game situation which can include the history of the game to that point. Each player’s aim is to maximize his total payoff over the series. If you know how many times you are to play, then one can argue that the game can be reduced to a one-shot Prisoner’s Dilemma. The argument is based on the observation that you, as a rational player will defect on the last iteration, because you are in effect playing a single iteration. Thus your opponent will rely on the same induction. Knowing that your opponent will therefore defect on the last iteration. Your opponent will make the same deduction. This logic can be applied all the way back to the first iteration. Thus, both players binded into a sequence of mutual defections.
One way to avoid this situation is to use a routine in which the players do not know when the game will end. If the players know the probability, that the game continues, then from their point of view, it is equivalent to an infinite game where the values of payoffs in each successive round are discounted by a factor. Depending on the value of factor and various other parameters, different Nash equilibria are possible, where both players play the same strategy. Some strategies have no advantages over the single game. A player who cooperates regardless of previous behavior (AllC) or who always defects (AllD) will score no better than their memory-less counterpart. Much research suggests, however, that the Tit For Tat (TFT) strategy is very successful. This strategy simply states that a player should repeat the opponent’s move of the previous round. In earlier research, TFT has been shown to outperform most other strategies [2]. Another strategy shown to perform well against a wide range of opponents is the Pavlov strategy.
Genetic Algorithm and Simulation (GA)
Genetic algorithms lend themselves well studying strategies in the prisoner’s dilemma. Each player is represented by its strategy. In the memory-three game used in this study, each player’s strategy must address sixty-four possible histories. We use the set of moves to create a 64 bit string which represents each player in the algorithm. This would be the simplest explanation of how genetic algorithm works.
Start: Generate random population of n chromosomes (suitable solutions for the problem)
Fitness: Evaluate the fitness f(x) of each chromosome x in the population
New population: Create a new population by repeating following steps until the new population is complete
Selection: Select two parent chromosomes from a population according to their fitness (the better fitness, the bigger chance to be selected)
Crossover: With a crossover probability cross over the parents to form new offspring (children). If no crossover was performed, offspring is the exact copy of parents.
Mutation: With a mutation probability mutate new offspring at each locus (position in chromosome).
Accepting: Place new offspring in the new population
Replace: Use new generated population for a further run of the algorithm
Test: If the end condition is satisfied, stop, and return the best solution in current population
Loop: Go to step 2
The first issue is figuring out how to encode a strategy as a string. Suppose the memory of each player is one previous game. There are four possibilities for the previous game (memory depth = 1):
Case 1: CC
Case 2: CD
Case 3: DC
Case 4: DD
Therefore the encoding of A will be CDCD and encoding of B will be CCDD
Where C is ―cooperate and D is ―defect. Case 1 is when both players cooperated in the previous game, case 2 is when player A cooperated and player B defected, and so on A strategy is simply a rule that specifies an action in each of these cases.
If CC (Case 1) Then C
If CD (Case 2) Then D
If DC (Case 3) Then C
If DD (Case 4) Then D
If the cases are ordered in this canonical way, this strategy can be expressed compactly as the string CDCD. To use the string as a strategy, the player records the moves made in the previous game (e.g., CD), finds the case number i by looking up that case in a table of ordered cases like that given above (for CD, i = 2), and selects the letter in the ith position of the string as its move in the next game (for i = 2, the move is D). Consider the tournament involved strategies that remembered three previous games, then there are 64 possibilities for the previous three games:
CC CC CC (Case 1)
CC CC CD (Case 2)
CC CC DC (Case 3)
DD DD DC (Case 63)
DD DD DD (Case 64)
After calculating fitness, which is described in the next section, this study implements roulette wheel selection, also called stochastic sampling with replacement [3]. In this stochastic algorithm, the fitness of each individual is normalized. Based on their fitness, individuals are mapped to contiguous segments of a line, such that each individual's segment is equal in size to its fitness. A random number is generated and the individual whose segment spans the random number is selected. The process is repeated until the correct number of individuals is obtained. A sample encoding strategy for IPD.
The table below gives the entire list of possibilities.
Using the table above we can now find different encodings for different memory depths. As seen earlier we saw the outcome when memory depth = 1 and the encoding of A resulted in CDCD
When we change the memory depth to 2 we get
Case 2: CD
Case 3: DC
Case 4: DD
Case 5: CC
Now the new encoding of A will be DCDC and B will be CDDC.
When we change the memory depths we can find that the all bit moves to the left by 1 and a new bit is replaced on the right end. Thus creating a new set of genes.
Table below shows a sample population with calculated and normalized fitness.
Person 1 has a normalized fitness of approximately 0.20 which gives it a 1 in 5 chance of being selected. Person 10 has the lowest fitness, with a normalized fitness of 0.02. If a person had a fitness of zero, the person would have no chance of being selected to propagate into the new population. Random points are selected on this line to select individuals to reproduce. Children's chromosomes are produced by single point crossover at a random point in the parent's chromosome. The mutation rate was 0.001 which produced approximately one mutation in the population per generation, and the recombination rate was set at 0.8.
Person
1
2
3
4
5
6
7
8
9
10
Fitness
27
22
18
15
17
12
9
8
4
3
Normalized Fitness
0.20
0.17
0.14
0.11
0.10
0.10
0.07
0.06
0.03
0.02
As per the payoff table given in IPD section above. Simulations in this study utilized a genetic algorithm to evolve strategies for the Prisoner’s Dilemma. Each simulation began with an initial population of twenty players represented by their strategies. Several terms are used in this section. A game refers to one “turn” in the Prisoner’s Dilemma. Both players make simultaneous moves, and each are awarded points based on the outcome. A round is a set of games between two players. Rounds in this study are 64 games long. A cycle is completed when every player has played one round against every other player. To determine fitness, each player was paired with every other for one round of 64 games. Players did not compete against themselves. Since there are sixty-four possible histories, this number of games ensures that each reachable history is visited at least once. After each game, the players’ scores are tallied and their histories are updated. Players maintain a performance score which is the sum of the points that they receive in every game against every player. The maximum possible performance score is 6080: if a player defected in every game and his opponents all cooperated in every game he would receive 5 points X 64 games X 19 opponents. For a player who is able to mutually cooperate in every game, the performance score would be 3,648 (3 points X 64 games X 19 opponents). After a full cycle of play, players are ranked according to their performance score, and selected to reproduce. Recombination occurs, children replace parents, and the cycle repeats. At the end of each generation, the total score for the population is tallied. This value is the sum of the scores of all members in the population.