The Prisoner’s Dilemma | EssaySauce.com

Abstract
Prisoner’s Dilemma is a game Invented by Merrill Flood & Melvin Dresher during the 1950s with the main focus on, Iterated Prisoner’s Dilemma experiments by Robert Axelrod’s. Prisoner’s Dilemma game is a classic prototype which is responsive to evolutionary behaviours. Iterated Prisoners Dilemma (IPD) is widely studied in the fields of Artificial Intelligence and other fields such as biology due to its models which show the cooperation between two individuals. In this paper I will mainly use genetic algorithm to test the payoff matrix and compare the genetic algorithm with other methods to find the best possible strategies, I will show how the genetic algorithm works in comparison to other obtained strategies such as TFT, Pavlov,ALLC and ALLD. I have simulated a test using Wake software and a website online [1] with a test of 10,000 generations with a mutation rate of mutation rate was 0.001 and recombination rate of 0.8 and a random intervention of 5% and finally determine if the obtained strategy is efficient based on results of simulation.

Table of Contents

Introduction

Before we progress into various aspects of Prisoner’s Dilemma we need to learn the basics.
To start off with we look at this snippet from Wikipedia. Two criminals are caught and jailed (Ann and Bob). Each criminal is in solitary confinement which means they cannot communicate with the other. The police lack evidence to prison the pair on the crimes committed. The police wish to sentence both to prison for at least a year. Each prisoner is given the opportunity either to: defect the other by stating that the other person committed the crime, or to cooperate with the other by remaining silent. The offer is:

If Ann and Bob each defect the other, each of them serves 10 years in prison
If Ann defects Bob but Bob remains silent, Ann will be set free and Bob will serve 20 years in prison (and vice versa)
If Ann and Bob both remain silent, both of them will only serve 1 year in prison (on the lesser charge)

This gives an understanding of what Prisoner’s Dilemma is about. Over the course of the paper we will look at different optimization methods using different memory depths and see how they standoff against other strategies such as TFT, Pavlov, ALLC and ALLD. With the help of machine learning I hope to find the most suitable strategy.
Common Obtained Strategies used
There are several strategies used in the area of Prisoner’s Dilemma, namely these are few of the strategies often used, I will consider only TFT, Pavlov, ALLD and ALLC for this research paper.
Tit-For-Tat (TFT) – The action chosen is based on the opponent’s last move.
On the first turn, the previous move cannot be known, so always cooperate on the first move. Thereafter, always choose the opponent’s last move as your next move.
Tit-For-Two-Tat (TF2T) – Same as Tit for Tat, but requires two consecutive defections for a defection to be returned. Cooperate on the first two moves. If the opponent defects twice in a row, choose defection as the next move.
Suspicious Tit-For-Tat (STFT) – Always defect on the first move.
Thereafter, replicate opponent’s last move.
Free Rider (ALLD) – Always choose to defect no matter what the opponent’s last turn was.
This is a dominant strategy against an opponent that has a tendency to cooperate.
Always Cooperate (ALLC) -Always choose to cooperate no matter what the opponent’s last turn was.
This strategy can be terribly abused by the Free Rider Strategy.
Pavlov (repeat last choice if good outcome): If 5 or 3 points scored in the last round then repeat last choice.
Iterated Prisoner’s Dilemma
If two players play more than once continuously and remember the previous actions of the prisoners then its called: Iterated Prisoner’s Dilemma commonly referred to as “IPD”. This is a sample table representing the above definition.
In the table, the pair (X, Y) which corresponds to the row and column of Ann and Bob respectively indicates that the Ann’s payoff is X and the Bob’s payoff is Y. In defining a Prisoner’s Dilemma game, certain conditions have to hold. The order of the payoffs is important. The best a player can do is in this situation is to defect “T”. The worst a player can do is to get the defect’s payoff, “S”. If the two players cooperate then the reward for that mutual cooperation, “R”, should be better than the punishment for mutual defection, “P”. Therefore, the following must hold. T > R > P > S.
As per the IPD Payoff matrix given in the project slide.
In the iterated game, player strategies are rules that determine in a stochastic manner, a player’s next move in any given game situation which can include the history of the game to that point. Each player’s aim is to maximize his total payoff over the series. If you know how many times you are to play, then one can argue that the game can be reduced to a one-shot Prisoner’s Dilemma. The argument is based on the observation that you, as a rational player will defect on the last iteration, because you are in effect playing a single iteration. Thus your opponent will rely on the same induction. Knowing that your opponent will therefore defect on the last iteration. Your opponent will make the same deduction. This logic can be applied all the way back to the first iteration. Thus, both players binded into a sequence of mutual defections.
One way to avoid this situation is to use a routine in which the players do not know when the game will end. If the players know the probability, that the game continues, then from their point of view, it is equivalent to an infinite game where the values of payoffs in each successive round are discounted by a factor. Depending on the value of factor and various other parameters, different Nash equilibria are possible, where both players play the same strategy. Some strategies have no advantages over the single game. A player who cooperates regardless of previous behavior (AllC) or who always defects (AllD) will score no better than their memory-less counterpart. Much research suggests, however, that the Tit For Tat (TFT) strategy is very successful. This strategy simply states that a player should repeat the opponent’s move of the previous round. In earlier research, TFT has been shown to outperform most other strategies [2]. Another strategy shown to perform well against a wide range of opponents is the Pavlov strategy. Pavlov players would cooperate on a history of mutual cooperation and mutual defection. Since they are rewarded with a score of 3 for mutual cooperation, Pavlov players continue to cooperate. With a history of DD, players will also choose too cooperate in the next round since they had been punished with a low score of 1. On the other hand, Pavlov players would defect on a history of DC since they had just been rewarded with the best score of 5 points for defection, and would also defect with a history of CD since they had just been severely punished with a score of 0 for cooperating. This strategy was shown to perform as well or better than any other strategies in the memory-one iterated Prisoner’s Dilemma [5].
Genetic Algorithm and Simulation (GA)
Genetic algorithms lend themselves well studying strategies in the prisoner’s dilemma. Each player is represented by its strategy. In the memory-three game used in this study, each player’s strategy must address sixty-four possible histories. We use the set of moves to create a 64 bit string which represents each player in the algorithm. This would be the simplest explanation of how genetic algorithm works.
Start: Generate random population of n chromosomes (suitable solutions for the problem)
Fitness: Evaluate the fitness f(x) of each chromosome x in the population
New population: Create a new population by repeating following steps until the new population is complete
Selection: Select two parent chromosomes from a population according to their fitness (the better fitness, the bigger chance to be selected)
Crossover: With a crossover probability cross over the parents to form new offspring (children). If no crossover was performed, offspring is the exact copy of parents.
Mutation: With a mutation probability mutate new offspring at each locus (position in chromosome).
Accepting: Place new offspring in the new population
Replace: Use new generated population for a further run of the algorithm
Test: If the end condition is satisfied, stop, and return the best solution in current population
Loop: Go to step 2
The first issue is figuring out how to encode a strategy as a string. Suppose the memory of each player is one previous game. There are four possibilities for the previous game (memory depth = 1):
Case 1: CC
Case 2: CD
Case 3: DC
Case 4: DD
Therefore the encoding of A will be CDCD and encoding of B will be CCDD
Where C is ―cooperate and D is ―defect. Case 1 is when both players cooperated in the previous game, case 2 is when player A cooperated and player B defected, and so on A strategy is simply a rule that specifies an action in each of these cases.
If CC (Case 1) Then C
If CD (Case 2) Then D
If DC (Case 3) Then C
If DD (Case 4) Then D
If the cases are ordered in this canonical way, this strategy can be expressed compactly as the string CDCD. To use the string as a strategy, the player records the moves made in the previous game (e.g., CD), finds the case number i by looking up that case in a table of ordered cases like that given above (for CD, i = 2), and selects the letter in the ith position of the string as its move in the next game (for i = 2, the move is D). Consider the tournament involved strategies that remembered three previous games, then there are 64 possibilities for the previous three games:
CC CC CC (Case 1)
CC CC CD (Case 2)
CC CC DC (Case 3)
DD DD DC (Case 63)
DD DD DD (Case 64)
After calculating fitness, which is described in the next section, this study implements roulette wheel selection, also called stochastic sampling with replacement [3]. In this stochastic algorithm, the fitness of each individual is normalized. Based on their fitness, individuals are mapped to contiguous segments of a line, such that each individual’s segment is equal in size to its fitness. A random number is generated and the individual whose segment spans the random number is selected. The process is repeated until the correct number of individuals is obtained. A sample encoding strategy for IPD.
The table below gives the entire list of possibilities.
Using the table above we can now find different encodings for different memory depths. As seen earlier we saw the outcome when memory depth = 1 and the encoding of A resulted in CDCD
When we change the memory depth to 2 we get
Case 2: CD
Case 3: DC
Case 4: DD
Case 5: CC
Now the new encoding of A will be DCDC and B will be CDDC.
When we change the memory depths we can find that the all bit moves to the left by 1 and a new bit is replaced on the right end. Thus creating a new set of genes.
Table below shows a sample population with calculated and normalized fitness.
Person 1 has a normalized fitness of approximately 0.20 which gives it a 1 in 5 chance of being selected. Person 10 has the lowest fitness, with a normalized fitness of 0.02. If a person had a fitness of zero, the person would have no chance of being selected to propagate into the new population. Random points are selected on this line to select individuals to reproduce. Children’s chromosomes are produced by single point crossover at a random point in the parent’s chromosome. The mutation rate was 0.001 which produced approximately one mutation in the population per generation, and the recombination rate was set at 0.8.
As per the payoff table given in IPD section above. Simulations in this study utilized a genetic algorithm to evolve strategies for the Prisoner’s Dilemma. Each simulation began with an initial population of twenty players represented by their strategies. Several terms are used in this section. A game refers to one “turn” in the Prisoner’s Dilemma. Both players make simultaneous moves, and each are awarded points based on the outcome. A round is a set of games between two players. Rounds in this study are 64 games long. A cycle is completed when every player has played one round against every other player. To determine fitness, each player was paired with every other for one round of 64 games. Players did not compete against themselves. Since there are sixty-four possible histories, this number of games ensures that each reachable history is visited at least once. After each game, the players’ scores are tallied and their histories are updated. Players maintain a performance score which is the sum of the points that they receive in every game against every player. The maximum possible performance score is 6080: if a player defected in every game and his opponents all cooperated in every game he would receive 5 points X 64 games X 19 opponents. For a player who is able to mutually cooperate in every game, the performance score would be 3,648 (3 points X 64 games X 19 opponents). After a full cycle of play, players are ranked according to their performance score, and selected to reproduce. Recombination occurs, children replace parents, and the cycle repeats. At the end of each generation, the total score for the population is tallied. This value is the sum of the scores of all members in the population.
While the maximum score for an individual is 6080, the maximum score for a population of 20 cannot be 20 times that. For one individual to score the maximum, all others must score very low. The highest cumulative score achievable in an individual game is 6, when both players receive a score of 3 for mutual cooperation. Mutual defection would have a total game score of 2 (1 point each), and mixed plays, with one cooperator and one defector, have a game total of 5 (5 for the defector plus 0 for the cooperator). Thus, the highest score that a population can achieve is 72,960 (3 points X 64 games X 19 opponents gives 3,648 per player X 20 players total). In the end, the fitness of a population is measured by what percentage of the highest possible score is achieved. A population with a total score of 63,480, for example, would have a population fitness of 50% (63,480 / 72,960).
Results of Genetic Algorithm
To test whether or not a population had each of the two traits described in the hypothesis, players’ behavior in these experiments is compared to the behavior of Tit-For-Tat. Consider a population that has evolved Tit for Tat like behavior. That population is likely using only a small percentage of its genes because many of the possible histories are not achievable by a Tit for Tat player (i.e. CDCDCD where a cooperating player always defends against the defecting opponent). This means that an individual might look very little like an unevolved Tit-For-Tat player.
Five distinct populations were used to compare behavior before and after evolution. Tit For Tat, as discussed previously, were the two control populations for this experiment. Both have the inherent ability to exploit mutual cooperation and defend against defectors. Three other populations were respectively comprised of AllC players, of AllD players, and of independently randomly initialized players.
To measure the performance of populations, the average fitness over the last 10,000 generations of each simulation was studied. Starting with the five initial populations, each was evolved for 200,000 generations. This evolution was simulated several hundred times for each initial population. Significance was calculated by the standard 2 tailed t-test for data sets with unequal variance. Each population was compared to the Tit for Tat and Pavlov strategy.
Evolved populations of players develop the ability to defend against defectors, and the ability to take advantage of mutual cooperation.
After a period of evolution as described earlier, the average performances of the five populations were statistically equal. This equality came about as a result of “random drift” of the populations. Random drift occurs when strategies are recombined and mutated without selection. Each specific gene has occurred simply by chance mutation or recombination, and the performance of such a population is generally low. By turning off the selection mechanism in the genetic algorithm, results for a random drift population were generated. The evolved populations all performed well above the level of the random drift population, indicating that they exhibit evolutionarily preferred traits.
The first experiment looks for the ability to defend against defectors. In the experiment, the five unevolved, initial populations were mixed with a small set of AllD players. Fitness of those populations was calculated over the first 10,000 generations immediately following inoculation. TFT and Pavlov performed well, with scores around 80%. Neither Random, ALLC, nor ALLD came near this level. By the standard t-test, all were significantly lower than Pavlov and TFT with p = 0.01.
The same experiment was performed with the five populations after evolution. After 200,000 generations, the populations were mixed with a small group of AllD players. Fitness was calculated over the next 10,000 generations. Looking at the average fitness of all five populations, it was found that there was no statistical difference in performance among them with p=.01. Additionally, comparing these results to the performance of unevolved Tit for Tat and Pavlov players, there was no statistical difference. Additionally, there was no statistical difference between performance of the inoculated populations, and the uninoculated evolved populations, indicating that defectors had no effect on performance of evolved populations.
Repeating the same experimental structure above, the five unevolved populations were mixed with a small set of AllC players. Tit for Tat again performed at nearly 80% if the maximum fitness, as did the initial population of AllC players. AllC players always cooperate by their nature. In an initial population made up entirely of AllC players, mutual cooperation is the norm. Introducing more AllC players to that initial population obviously does not change it. The prevalence of mutual cooperation explains the excellent performance of the unevolved AllC population.

Conclusion

These results lead to several conclusions. Our first experiment shows that defectors effect all five of the evolved populations in the same way. They react identically, but does this necessarily indicate that they all have a defensive ability?
Since the populations which were unable to defend against defectors show that behaviour after evolution they must evolve that ability over time. The second set of conclusions that can be drawn are those regarding mutual cooperation.
Results here show that evolved populations are able to cooperate among themselves since they perform the same as the control populations in the presence of cooperators. Further, once can conclude that populations exhibit this behaviour even without the experimental conditions, since there is no difference between performance in the natural, evolved environment and performance in the presence of pure cooperators.
With the results outlined above, it follows that in this experiment, evolved populations performed equivalently to Tit for Tat. Specifically, these experiments show that evolved populations are able to oppose the defectors and mutually cooperate with other evolved individuals.
Since these populations did not have such abilities, we can say that it follows an evolution introduced this behaviour over time. Some preliminary simulations have been run to study this phenomenon in probabilistic strategies. Initial results show no difference in results between deterministic and probabilistic populations.

Essay: The Prisoner’s Dilemma

Essay details and download:

Text preview of this essay:

Introduction

Conclusion

About this essay:

Essay details and download:

Text preview of this essay:

Introduction

Conclusion

About this essay:

Essay Categories: