Bruno BEAUFILS - Jean-Paul DELAHAYE - Philippe MATHIEU
Laboratoire d'Informatique Fondamentale de Lille
U.R.A. 369 C.N.R.S. - Université des Sciences et Technologies de
Lille
U.F.R. d'I.E.E.A. Bât. M3
59655 Villeneuve d'Ascq Cedex - FRANCE
{beaufils,delahaye,mathieu}@lifl.fr
Introduced by Merill M. FLOOD and Melvin DRESHER in the RAND Corporation in 1952, see [3], who tried to introduce some irrationality in the game theory of John VON NEUMANN and Oskar MORGENSTERN [8], the Classical Iterated Prisoner's Dilemma (CIPD) is based on this simple story quoted by Albert TUCKER for instance in [5, pages 117-118]:
Two men, charged with a joint violation of law, are held separately by the police. Each is told that
(1) if one confesses and the other does not, the former will be given a reward...and the latter will be fined
(2) if both confess, each will be fined...
At the same time, each has a good reason to believe that
(3) if neither confesses, both will go clear.
It seems clearly obvious that the most reasonable choice is to betray its partner. More formally the CIPD is represented, using game theory, as a two-person non-zero-sum non-cooperative and simultaneous game where each player has to choose between two moves:
The payoff of each player depends on the moves played by the two people. Table 1 provides the score in each case.
Cooperate | Defect | |
Cooperate |
R=3, R=3
Reward for mutual cooperation |
S=0, T=5
Sucker's payoff Temptation to defect |
Defect |
T=5, S=0
Temptation to defect Sucker's payoff |
P=1, P=1
Punishment for mutual defection |
To have a dilemma, the following inequation has to be respected:
S<P<R<T | (1) |
The dilemma stands on the fact that individual interests differ from collective ones.
As the one shot game is solved by the NASH equilibrium, which is to
always betray its partner, the model is extended: in the iterated version
players meet each other more than one time, without knowing if it is the last
time or not.
The payoff of a player is then simply the sum of each of its meeting's payoff.
To favour the cooperation, and also to keep this difference between individual
and collective interest the following inequation has to be respected:
S+T<2R | (2) |
The classical choice of values for the four parameters is given in Table 1.
Of course, with such an iterated game, what the opponent did on the past moves may influence the way a player will choose their next one. It is then possible to define more strategies than in the one shot version.
Let us define some simples ones:
Now the main problem is to evaluate strategies for the CIPD, in order to compare them. Two kinds of experimentation could be used for this purpose. The basic one, is to make a pairwise round-robin tournament between some different strategies. The payoff to each one would be the total sum of each iterated game. A ranking could then be computed according to the score of each strategy. The higher a strategy is ranked, the better it is. Good strategies in round-robin tournament are well adapted to their environments, but often are not very robust to environment modifications.
The second kind of experimentation is a kind of imitation of the natural selection process, and is closely related to population dynamics. Let us consider a population of N players, each one adopting a particular strategy. At the beginning we consider that each strategy is equally represented in the population. Then a tournament is made, and good strategies are favoured, whereas bad ones are disadvantaged, by a proportional population redistribution. This redistribution process, also called a generation, is repeated until an eventual population stabilisation, i.e. no changes between two generations. A good strategy is then a strategy which stays alive in the population for the longest possible time, and in the biggest possible proportion.
An example of evolution between all strategies described in the previous section is shown in Figure 1. The x-axis represents the generation number, whereas the y-axis represents the size of the population for each strategy. For simplicity we make our computation with a fixed global population size.
Classical results in this field, which were presented by AXELROD in [1], show that to be good a strategy must:
The well-known tit_for_tat strategy, which satisfies all those criteria, has, since [1], been considered by a lot of people using the dilemma -but not by game theorist- to be one of the best strategies not only for cooperation but also for evolution of cooperation. We think that the simplicity criterium is not so good, and have thus introduced a strategy called gradual which illustrates our point of view.
Gradual cooperates on the first move, then after the first opponent's defection defects one time, and cooperates two times, after the second opponent's defection defects two times and cooperates two times, ..., after the n^{th} opponent's defection defects n times, let us call it the punishment period, and cooperates two times, let us name it the lull time. Gradual had better results than tit_for_tat in almost all our experiments, see [2].
It is easy to imagine strategies derived from gradual, for instance in modifying the function of punishment, which is the identity in the gradual case (n defections punishment of length n).
Our main research direction to establish our ideas about a strategy's complexity is to
To make our research process easier, we have to find a descriptive method to define strategies, which is less risky than an exhaustive method, which are never objective, nor complete. One of the ways we have chosen is to use a genetic approach. We describe a structure (a genotype), which can be decoded in a particular behaviour (a phenotype). Then a way to have a strategy, is simply to fill this structure. A manner to have a lot of strategies is to consider all the ways of filling this genotype, i.e. to consider all possible individuals based on a given genotype. Let us call the set of all strategies described by a particular genotype, the complete class of strategies issued by this genotype.
We have described three genotypes based on the same simple idea, in order to remain objective. This idea is to consider the observable length of the game's history. Such idea of strategies has already been studied in [4,7]. Those three genotypes are:
Despite the apparently simplicity of the strategies described by those genotypes, many classical ones are included in those classes, tit_for_tat for instance.
Let us describe the behavior of one of those strategies, in order to understand how the genotype works. We consider a strategy of the complete class of memory with and .
One of the individuals of this class plays on the first move
then:
- if on the previous move I played | and the opponent played | then I play | |||
- if on the previous move I played | and the opponent played | then I play | |||
- if on the previous move I played | and the opponent played | then I play | |||
- if on the previous move I played | and the opponent played | then I play |
The genotype of the strategy is then:
This strategy is one way of coding spiteful.
We have conducted some experiments using those complete classes. The main purpose was to evaluate other strategies in big ecological evolution, but also to try to find some new good strategies. In all our experiments we have computed one ecological evolution between all strategies of a class, and then another with gradual added to the population. Thus we have been able to partially confirm our ideas on the strength of this strategy, as shown in Table 2. In all the results of Table 2, gradual is better evaluated than tit_for_tat. We, however, do not show results of the evaluation of tit_for_tat here since it is included in some of the classes explored, thus its evaluation is only partial.
w
20.7cm | 20.7cm | 21.3cmclass size | evaluation | |||
gradual | tit_for_tat | spiteful | ||||
43cmmemory | 0 | 1 | 8 | 1 | 1 | 1 |
0 | 2 | 64 | 5 | 2 | 21 | |
1 | 1 | 32 | 2 | 3 | 1 | |
1 | 2 | 1024 | 6 | 13 | 37 | |
23cmbinary_memory | 0 | 1 | 32 | 1 | 2 | 1 |
1 | 1 | 512 | 1 | 7 | 13 | |
13cmmemory_automata | 0 | 1 | 512 | 1 | 31 | 32 |
We performed many experiments on those classes, since they are the most simple and objective.
Evolution of two classes are presented on Figures 2 and 3.
Noticed that when or , the population is no longer uniform since, there are more naughty strategies than nice ones, due to the starting moves.
We then compute some subclasses, including the same number of naughty than nice strategies. In order to create such classes we limit the starting moves to those containing only moves or only moves.
In almost all cases when we add gradual in those kinds of population it is well evaluated too.
When gradual is not the winner of the evolution, we almost always find a variation of it, like gradual_n4 (ndefections punishment of length n^{4})) which does win. It is not exactly the case when those gradual's variation are added in complete classes as shown in Figure 3.
One example of evolution of a binary_memory class is given in Figure 4.
Due to the explosion of memory_automata class size relatively to parameters and , we just present the result of one of those classes, the case with and which contains 512 strategies.
As we can see on Figure 5 the best strategy of the memory_automata class with and , is a strategy we called str01e_c_0111_cccd, which has the following genotype
0 | 1 | 1 | 1 |
The behavior of the strategy is then:
It begins (in state 0) to cooperate (playing ) then:
This automata is represented by the scheme of Figure 6.
This strategy, which acts in a manner very close to the one of tit_for_tat is beaten by gradual, when the latter is added in the population.
The genetic method we use to define strategies for the CIPD offers two big advantages:
Evaluation of strategies based on complete classes cannot be considered as subjective as long as the genotype used is sufficently general, and based on basic ideas. Evaluation can, however, not be based only on the results of complete classes evolution, since a strategy could have a behavior well adapted to this kind of environment, and not well adapted to a completely different environment, with mixed strategy, in the game theory sense, for instance.
The results we obtained mainly confirm our ideas about a strategy's complexity, which is, there may exist an infinite gradient of complexity in the definition of strategy, each level defining a new criterium of quality.
We plan to include this kind of evaluation, in a more general, objective and complete evaluation method. We also plan to explore larger classes of strategy, which implies to find optimization method of computation, enabling us to describe and to use very large sets of strategies.
A simulation software with many strategies is already available for Unix, DOS or Windows on the World Wide Web at http://www.lifl.fr/~mathieu/ipd or by anonymous ftp on the site ftp.lifl.fr in pub/users/mathieu/soft.