GEP: Q&A from Correspondence with GeneXproTools Users


	Questions & Answers from Correspondence with GeneXproTools Users
Home News Author Q&A Tutorials Downloads GEP Biblio Contacts Visit Gepsoft

Q&A from Correspondence with GeneXproTools Users

Some of GeneXproTools users are also interested in the workings of GEP and are even students of the field – they read my papers and my book and acquired a hands-on knowledge of the algorithm through experimentation. They sometimes ask questions of our support staff that are very GEP-specific and in those cases I like to step in and answer them myself. I think that these questions are also general enough and worth sharing with you all. As usual, for privacy sake, I removed any references that could identify the people involved.

I have some doubts about how the selection process and genetic modification work. Is the entire set of new individuals selected at once and then mutation, inversion, etc. applied by selecting from this new population? And how are the rates applied? For example, are they picked up randomly n times (where n is the number of individuals in this selected population)... then apply mutation, inversion etc and then put back into the population? And for the recombination operators, are they chosen 2 at a time and put the 2 new children back into the population, removing the parents?

The selection and reproduction processes are done as shown in the GEP flowchart, that is, first, individuals are selected by roulette-wheel to be reproduced (so, in your words, they are all selected at once). And when they reproduce, depending on the rates and number of the genetic operators being used, their genomes might get modified or not.

There are, however, basically two different kinds of modification rates: (1) rates like mutation rates which relate to points (or individual targets) in the chromosomes; and (2) rates like inversion or transposition rates which relate to individual chromosomes.

This means that when you are evaluating the mutation rate (the first kind of modification rate) the probability of mutation is evaluated relatively to the number of points in the chromosome. But since chromosomes in artificial evolutionary systems are usually very small (comparatively to chromosomes in nature), it is advisable to use all the chromosomes in the population in order to determine the number of points to mutate. For instance, for 30 chromosomes of size 50, there are a total of 1500 potential targets, so, in this case, a mutation rate of say 0.05 means that a total of 75 points are randomly mutated each generation.

When, for example, you are evaluating a transposition rate or an inversion rate (the second type of modification rate), the probability of transposition/inversion is evaluated relatively to the number of chromosomes in the population. For instance, for a population with 30 individuals, a transposition rate of 0.3 means that, in this case, a total of 9 different chromosomes are randomly chosen to undergo transposition.

Recombination is similar to the latter, with the difference that in this case you’ll need an even number of chromosomes to recombine (if, for instance, the total number of chromosomes to recombine gives 9, only 8 are in truth recombined). And for all kinds of genetic modification, when a chromosome is modified, say chromosome 2, the new version replaces the old one (recombination is no different and, for instance, if chromosomes 5 and 8 were selected to undergo recombination, then the two new versions would replace the old chromosomes 5 and 8).

So, as you can see, different modifications might accumulate on a chromosome during its reproduction or, in other words, a chromosome might be modified more than once by different genetic operators during its reproduction.

I have some questions about elitism. Theoretically, elitism means that from the current population the n elite members of the population are moved to the next generation (n is likely to be 1 but I’m not sure... not even really sure if elitism is used in GeneXproTools). If this is done are these elite individuals not subjected to further genetic manipulation (mutation, recombination, etc.)?

GeneXproTools uses simple elitism (or the cloning of the single best individual of the population). This means that, each generation, the genome of the best individual is copied unchanged into the next generation. However, like all individuals of the population, this individual is also subjected to selection and, being the very best, it has the highest probability of being selected to reproduce with modification. So, this individual not only gets special treatment (cloning) but also is selected according to its fitness like everybody else. And its lucky descendants (besides the clone, of course) are subjected to the genetic operators of mutation, inversion, transposition, and recombination in order to create genetic diversity.

Is there a way to turn on a record of all genomes that are produced in each generation? This would be most useful for teaching and for understanding details of the evolutionary process. Students could use this to see what's happening in their own small problems.

The amount of information required to keep track of all the programs created during a run is quite voluminous and therefore is not saved in any way in GeneXproTools. However, since more and more people are using GeneXproTools for teaching purposes we’ll try and include this feature in version 5.0. Meanwhile, there are at the GEP website some old executables that show the complete history of a run for two simple problems (symbolic regression and sequence induction) that might be interesting for your purposes.

In problems with multiple variables, one often finds models with high fitness, but these models often contain only a small portion of independent variables instead of the entire set. Is this reasonable? Can this be avoided?

Yes, this happens frequently and although it can be avoided, in real-world problems it is not good practice to avoid this as most datasets contain noisy information (including noisy variables) and one usually uses the algorithm to find exactly what is relevant and what is not. However, for some problems (for instance, simple problems with just a few relevant variables or logic synthesis problems devoid of noisy data) one can force the algorithm to include all the variables in the evolved models by attributing zero fitness to all the models that do not comply with this criterion (I must say, however, that this option is not accessible in GeneXproTools for the entire population, although you might apply pressure by ensuring that the very best includes all the variables through a suitable custom fitness). So, as you can see, it all depends on your goals.

For my data it seems to me that absolute error fitness function outperforms R-square fitness function. Is some fitness function better than others in symbolic regression? Is there any theory on this?

This will depend on what you want and on the nature of your problem. As you know, different fitness functions travel the fitness landscape differently and it is always interesting to see the kind of models the algorithm creates using different fitness functions. In my book I use a rather limited selection of fitness functions, but several others can be easily implemented. GeneXproTools implements many more and you can see their definitions in the Online Knowledge Base. Although their merits are not discussed there, you can easily compare their behavior by experimenting a little with GeneXproTools. Also worth experimenting with are the different fitness functions for Classification and Logic Synthesis; check also their descriptions in the Knowledge Base.

This is the main difficulty I met with my data which, I admit, are a bit challenging: briefly, the training records I want to classify consist of 50 groups of 10 records, each group containing 3 "1" records, so the training set has 150 "1" and 350 "0" records. If I wait until I have the maximum of correctly classified training records to apply the model against the test set, practically 99 times out of 100 I get a fitness of 0 (because the threshold of all the "0" records correctly classified has not been reached). So, I must stop the evolutionary process, try the model until some interesting fitness value has been reached and then evolve again. I have to say that I got the best results in the test set after dozens and dozens, if not hundreds, of trials and because I was lucky enough to stop at the right time and get an unusually high and rare fitness value! Is there a more rigorous way?

Well, since version 3.0 GeneXproTools has a new feature that makes this process less painful and more interesting. This new feature records the evolutionary history of a run, and then allows evaluating the performance of all the best-of-generation models in the test set after which you can choose the model with better performance. And obviously you can then do with this model all the usual things: generate code, use it as seed, make predictions, etc.

Questions & Answers

***

Last update: 23/July/2013