GEP Book

  Home
  News
  Author
  Q&A
  Tutorials
  Downloads
  GEP Biblio
  Contacts

  Visit Gepsoft

 

C. FERREIRA Invited Tutorial Presented at WSC6, 2001

Gene Expression Programming in Problem Solving

Solving a Simple Problem with GEP
 

The aim of this section is to study a successful run in its entirety in order to understand how populations of GEP individuals evolve towards a perfect or good solution.

In symbolic regression or function finding the goal is to find an expression that satisfactorily explains the dependent variable. The input into the system is a set of fitness cases in the form (a(i,0), a(i,1), ..., a(i,n-1), yi) where a(i,0) - a(i,n-1) are the independent variables and yi is the dependent variable. The set of fitness cases consists of the adaptation environment where solutions adapt, discovering, in the process, solutions to problems.

In the example of this section, a simple test function was chosen, being therefore the fitness cases computer generated. Thus, in this case, we know exactly which function we are aiming at (remember, however, that in real-world problems the function is obviously unknown). So, suppose we are given a sampling of the numerical values from the curve:

y = 3a2 + 2a + 1

(2.11)

over 10 randomly chosen points in the real interval [-10, +10] and we wanted to find a function fitting those values within a certain error. In this case, we are given a sample of data in the form of 10 pairs (ai, yi), where ai is the value of the independent variable in the given interval and yi is the respective value of the dependent variable (Table 1). These 10 pairs are the fitness cases (the input) that will be used as the adaptation environment. The fitness of a particular program will depend on how well it performs in this environment.

Table 1
Set of 10 random fitness cases used in the simple problem of symbolic regression.


There are five major steps in preparing to use gene expression programming, and the first is to choose the fitness function. For this problem we could measure the fitness fi of an individual program i by the following expression:

(2.12)

where M is the range of selection, C(i,j) the value returned by the individual chromosome i for fitness case j (out of Ct fitness cases) and Tj is the target value for fitness case j. If |C(i,j) - Tj| (the precision) less than or equal to 0.01, then the precision is equal to zero, and fi = fmax = Ct*M. For this problem, we will use an M = 100 and, therefore, fmax = 1000. The advantage of this kind of fitness function is that the system can find the optimal solution for itself (Ferreira 2001).

The second major step consists in choosing the set of terminals T and the set of functions F to create the chromosomes. In this problem, the terminal set consists obviously of the independent variable, i.e., T = {a}. The choice of the appropriate function set is not so obvious, but a good guess can always be done in order to include all the necessary functions. In this case, to make things simple, we will use the four basic arithmetic operators. Thus, F = {+, -, *, /}.

The third major step is to choose the chromosomal architecture, i.e., the length of the head and the number of genes. In this problem we will use an h = 6 and three genes per chromosome.

The fourth major step in preparing to use gene expression programming is to choose the linking function. In this case we will link the sub-ETs by addition.

And finally, the fifth major step is to choose the set of genetic operators that cause variation and their rates. In this case we will use a combination of all genetic operators (mutation, the three kinds of transposition, and the three kinds of recombination) (see Table 2).

The parameters used per run are summarized in Table 2. I chose a small population of 20 individuals for this problem in order to simplify the analysis of the evolutionary process and not fill this text with pages of encoded individuals. However, one of the advantages of GEP is that it is capable of solving relatively complex problems using small population sizes and, thanks to the compact Karva notation, it is possible to fully analyze the evolutionary history of a run.

Table 2
Parameters for the simple symbolic regression problem.

Number of generations 50
Population size 30
Number of fitness cases 10 (Table 1)
Function set + - * /
Gene length 13
Number of genes 3
Linking function +
Chromosome length 39
Mutation rate 0.051
One-point recombination rate 0.3
Two-point recombination rate 0.3
Gene recombination rate 0.1
IS transposition rate 0.1
IS elements length 1,2,3
RIS transposition rate 0.1
RIS elements length 1,2,3
Gene transposition rate 0.1
Selection range 100
Precision 0.01


Figure 6 shows the progression of average fitness and the fitness of the best individual of a successful run. In this run, a perfect solution was found in generation 3.

Figure 6. Progression of average fitness of the population and the fitness of the best individual for a successful run of the experiment summarized in Table 2.


The initial population of this run, together with the fitness of each individual, is shown in Figure 7. Note that three of the 20 individuals are nonviable and thus have fitness zero. The best of generation individual, chromosome 19, has fitness 661.5933. Its expression and the corresponding mathematical equation are shown in Figure 8. Note that gene 2 returns zero and, therefore, might be considered a pseudogene. Note also how the algorithm created constants in all sub-ETs on its own.


Generation N: 0
012345678901201234567890120123456789012
+**/*/aaaaaaa/+a/a*aaaaaaa/a-*a+aaaaaaa-[ 0] = 577.3946
--aa++aaaaaaa+-/a*/aaaaaaa/--a-aaaaaaaa-[ 1] = 0
/***/+aaaaaaa*+/+-aaaaaaaa++aa/aaaaaaaa-[ 2] = 463.6533
-/+/++aaaaaaa+-//+/aaaaaaa+-/a/*aaaaaaa-[ 3] = 546.4241
++a/*aaaaaaaa+-+a*-aaaaaaa-a/-*aaaaaaaa-[ 4] = 460.8625
*+*a-*aaaaaaa*a/aa/aaaaaaa//+*a/aaaaaaa-[ 5] = 353.2168
*/**+aaaaaaaa+a/**+aaaaaaa----+/aaaaaaa-[ 6] = 492.6827
*aa-+-aaaaaaa+a/-+/aaaaaaa***/-*aaaaaaa-[ 7] = 560.9289
+/-*//aaaaaaa*+*//+aaaaaaa-/**+*aaaaaaa-[ 8] = 363.4358
--a+*/aaaaaaa+a++--aaaaaaa+a+aa+aaaaaaa-[ 9] = 386.7576
+-*-**aaaaaaa*/-+**aaaaaaa*+--++aaaaaaa-[10] = 380.6484
/a-**/aaaaaaa/-a/a/aaaaaaa+/a/-*aaaaaaa-[11] = 0
+--+//aaaaaaa+*+/*-aaaaaaa/*-a-+aaaaaaa-[12] = 551.2066
-a/+a/aaaaaaa*/--/aaaaaaaa*-+/a+aaaaaaa-[13] = 308.1296
/+/-+-aaaaaaa+-a/aaaaaaaaa**+-*-aaaaaaa-[14] = 0
//-*+/aaaaaaa//*a+aaaaaaaa/a++a*aaaaaaa-[15] = 489.5392
*a-a*-aaaaaaa+*+-a/aaaaaaa*/*aa*aaaaaaa-[16] = 399.2122
-a++*/aaaaaaa+/aa-*aaaaaaa---/**aaaaaaa-[17] = 317.6631
--a/*aaaaaaaa++*+-aaaaaaaa+-/*+-aaaaaaa-[18] = 597.8777
*+++-/aaaaaaa/--///aaaaaaa+-+aaaaaaaaaa-[19] = 661.5933

Figure 7. Initial population (generation 0) for the simple problem of symbolic regression. For each problem, such an initial, totally random population is generated. The value after each chromosome indicates the fitness for the set of fitness cases shown in Table 1.


Figure 8. Best individual of generation 0 (chromosome 19 of Figure 7). It has a fitness of 661.5933. a) The chromosome of the individual. b) The sub-ETs codified by each gene. c) The corresponding mathematical expression after linking with addition (the contribution of each sub-ET is shown in brackets).


The descendants of the individuals of the initial population are shown in Figure 9. Note that chromosome 0 is the clone of the best individual of the previous generation. In this generation, a new individual was created, chromosome 7, considerably better than the best individual of the initial population. This chromosome has a fitness of 961.8512 and its expression is shown in Figure 10.


Generation N: 1
012345678901201234567890120123456789012
*+++-/aaaaaaa/--///aaaaaaa+-+aaaaaaaaaa-[ 0] = 661.5933
-a++*/aaaaaaa+//a--aaaaaaa---/**aaaaaaa-[ 1] = 0
+-*-**aaaaaaa*/-+**aaaaaaa*+--++aaaaaaa-[ 2] = 380.6484
+-*-**aaaaaaa*/-+**aaaaaaa*/*a**aaaaaaa-[ 3] = 356.9471
+-+aaaaaaaaaa*+++-/aaaaaaa/--///aaaaaaa-[ 4] = 661.5933
*aa-+-aaaaaaa+a/++/aaaaaaa***+-*aaaaaaa-[ 5] = 567.9289
*a-a*-aaaaaaa+/*-a/aaaaaaa*+-*++aaaaaaa-[ 6] = 449.802
*aa-+-aaaaaaa+a/-+/aaaaaaa*+--++aaaaaaa-[ 7] = 961.8512
/***/+aaaaaaa*+/+-aaaaaaaa-a/-*aaaaaaaa-[ 8] = 470.5862
+--+//aaaaaaa+*+/*-aaaaaaa/*-a-+aaaaaaa-[ 9] = 551.2066
*+++-/aaaaaaa-//--/aaaaaaa+-+aaaaaaaaaa-[10] = 0
--+a*-aaaaaaa++a/*aaaaaaaa-a/-*aaaaaaaa-[11] = 487.3099
-a++*/aaaaaaa+/aa-*aaaaaaa---/**aaaaaaa-[12] = 317.6631
++a/*aaaaaaaa+-+a*-aaaaaaa++aa/aaaaaaaa-[13] = 451.464
+--+/-aaaaaaa+a/**+aaaaaaa----+/aaaaaaa-[14] = 493.5336
*/-a++aaaaaaa+/aa-*aaaaaaa---/**aaaaaaa-[15] = 356.4241
+/-*//aaaaaaa*+a//+aaaaaaa-/+*+*aaaaaaa-[16] = 493.9218
*/**+aaaaaaaa+*+/*aaaaaaaa***/-*aaaaaaa-[17] = 448.4805
+-*-**aaaaaaa*/-+**aaaaaaa*+--++aaaaaaa-[18] = 380.6484
++a/*aaaaaaaa+-+a*+aaaaaaa--/-*aaaaaaaa-[19] = 380.8585

Figure 9. The descendants of the individuals of the initial population of Figure 7. The value after each chromosome indicates the fitness for the set of fitness cases shown in Table 1. Note that chromosome 0 is the clone of the best individual of the previous generation. In fact, this position is always occupied by the clone of the best individual of the previous generation.


Figure 10. Best individual of generation 1 (chromosome 7 of Figure 9) with a fitness of 961.8512. a) Its chromosome. b) The sub-ETs codified by each gene. c) The corresponding mathematical expression after linking with addition (the contribution of each sub-ET is shown in brackets).


The descendants of the individuals of this generation are shown in Figure 11 (generation 2). Note that despite the global improvement in fitness (compare the average fitness of both populations in Figure 6), none of the descendants surpassed the best individual of the previous generation.


Generation N: 2
012345678901201234567890120123456789012
*aa-+-aaaaaaa+a/-+/aaaaaaa*+--++aaaaaaa-[ 0] = 961.8512
*/**+aaaaaaaa*/-+**aaaaaaa***/-*aaaaaaa-[ 1] = 446.2061
+-*-**aaaaaaa*+a//-aaaaaaa-/+*+*aaaaaaa-[ 2] = 323.1036
+--+//aaaaaaa+*+/*-aaaaaaa/*-*-+aaaaaaa-[ 3] = 551.2066
*aa-+-aaaaaaa+a/++/aaaaaaa***+-*aaaaaaa-[ 4] = 567.9289
++a/*aaaaaaaa*/-+-*aaaaaaa*+--++aaaaaaa-[ 5] = 0
+-*-**aaaaaaa+*+/*aaaaaaaa*/*a**aaaaaaa-[ 6] = 386.6484
++a/*aaaaaaaa+-+/*-aaaaaaa+aa++aaaaaaaa-[ 7] = 466.1533
+-*-a*aaaaaaa*/-+**aaaaaaa*a*a**aaaaaaa-[ 8] = 194.0452
/***/+aaaaaaa*+/+-aaaaaaaa-a--*aaaaaaaa-[ 9] = 541.4829
+-*-+*aaaaaaa+-+a*-aaaaaaa***/-*aaaaaaa-[10] = 346.2235
--*+*-aaaaaaa*aa-+-aaaaaaaaa/-+/aaaaaaa-[11] = 467.0862
*/-+**aaaaaaa+-*-*+aaaaaaa*/*a**aaaaaaa-[12] = 672.877
*aa+*/aaaaaaa+a/-+/aaaaaaa*+--++aaaaaaa-[13] = 961.8512
*+++/+aaaaaaa*++/+-aaaaaaa-a/-*aaaaaaaa-[14] = 395.858
/***-/aaaaaaa/--///aaaaaaa+-+a-aaaaaaaa-[15] = 467.0862
*aa-+-aaaaaaa+a/++/aaaaaaa***+-*aaaaaaa-[16] = 567.9289
+-+aaaaaaaaaa*+++-/aaaaaaa/--///aaaaaaa-[17] = 661.5933
+/-*//aaaaaaa*/a+**aaaaaaa*+--++aaaaaaa-[18] = 903.8886
*/**+aaaaaaaa+*+/*aaaaaaaa+/aa/aaaaaaaa-[19] = 423.885


Generation N: 3
012345678901201234567890120123456789012
*aa+*/aaaaaaa+a/-+/aaaaaaa*+--++aaaaaaa-[ 0] = 961.8512
*aa-+-aaaaaaa+a/-+/aaaaaaa/--///aaaaaaa-[ 1] = 560.9289
*aa-+-aaaaaaa-++/+-aaaaaaa-a/-*aaaaaaaa-[ 2] = 558.2066
*+++/+aaaaaaa*+a/-+aaaaaaa++--++aaaaaaa-[ 3] = 569.0469
/+++/+aaaaaaa*++/+-aaaaaaa-a/-*aaaaaaaa-[ 4] = 699.5153
+-+aa/aaaaaaa++++-/aaaaaaa***+-*aaaaaaa-[ 5] = 466.1533
*aa-+-aaaaaaaaa--**aaaaaaa*+--++aaaaaaa-[ 6] = 957.9443
--++*-aaaaaaa*a+/*-aaaaaaa+aa++aaaaaaaa-[ 7] = 337.7807
*aaa*/aaaaaaa+a+-+/aaaaaaa*+-/++aaaaaaa-[ 8] = 953.9443
/***/-aaaaaaa*+/+-aaaaaaaa-a--*aaaaaaaa-[ 9] = 0
*aa-+-aaaaaaa+a/-+/aaaaaaa*/--++aaaaaaa-[10] = 560.9289
*aa-+-aaaaaaa+a/++/aaaaaaa/--///aaaaaaa-[11] = 567.9289
+-+a-aaaaaaaa/***-/aaaaaaa*+--++aaaaaaa-[12] = 676.0663
+/**//aaaaaaa*/a+**aaaaaaa*+--++aaaaaaa-[13] = 1000
*/-+**aaaaaaa+-*-*+aaaaaaa*/*a**aaaaaaa-[14] = 672.877
/***/+aaaaaaa/+*+/+aaaaaaa-a*/--aaaaaaa-[15] = 498.3734
+/-*//aaaaaaa*/a+-*aaaaaaa*+--++aaaaaaa-[16] = 0
--*+--aaaaaaa*/a-+-aaaaaaa/a/-+/aaaaaaa-[17] = 506.1233
++a/*aaaaaaaa+-a-+-aaaaaaa-a*-+/aaaaaaa-[18] = 815.7772
*+a//-aaaaaaa+a/-+/aaaaaaa-/+*+*aaaaaaa-[19] = 412.5237

Figure 11. The chromosomes of two populations for the simple problem of symbolic regression. The value after each chromosome indicates the fitness for the set of fitness cases shown in Table 1. In generation 2, none of the individuals surpassed the best of the previous generation. In generation 3, a perfect solution with maximum fitness was found (chromosome 13).


And finally, in the next generation (generation 3 of Figure 11), an individual with maximum fitness was created. Note that this chromosome is a descendant, via mutation, of chromosome 18 of the previous generation: their chromosomes differ only in one position (the ‘-’ at position 2 of gene 1 was replaced by ‘*’). The expression of this chromosome shows that it codes for a perfect solution (Figure 12).


Figure 12. Perfect solution found in generation 3 (chromosome 13 of Figure 11). It has the maximum value 1000 of fitness. a) The chromosome of this individual. b) The sub-ETs codified by each gene. c) The corresponding mathematical expression after linking with addition (the contribution of each sub-ET is shown in brackets).

Home | Contents | Previous | Next