Q&A from Correspondence with GeneXproTools Users
Some of
GeneXproTools users are also interested in the workings of GEP and
are even students of the field – they read my
papers and my
book and acquired a hands-on knowledge of the algorithm through
experimentation. They sometimes ask questions of our support staff that
are very GEP-specific and in those cases I like to step in and answer
them myself. I think that these questions are also general enough and
worth sharing with you all. As usual, for privacy sake, I removed any
references that could identify the people involved.
I have some doubts about how the selection process and
genetic modification work. Is the entire set of new individuals selected
at once and then mutation, inversion, etc. applied by selecting from
this new population? And how are the rates applied? For example, are
they picked up randomly n times (where n is the number of individuals in
this selected population)... then apply mutation, inversion etc and then
put back into the population? And for the recombination operators, are
they chosen 2 at a time and put the 2 new children back into the
population, removing the parents?
The selection and reproduction processes are done as shown in the
GEP
flowchart, that is, first, individuals are selected by
roulette-wheel to be reproduced (so, in your words, they are all
selected at once). And when they reproduce, depending on the rates and
number of the genetic operators being used, their genomes might get
modified or not.
There are, however, basically two different kinds of modification rates:
(1) rates like mutation rates which relate to points (or individual
targets) in the chromosomes; and (2) rates like inversion or
transposition rates which relate to individual chromosomes.
This means that when you are evaluating the mutation rate (the first
kind of modification rate) the probability of mutation is evaluated
relatively to the number of points in the chromosome. But since
chromosomes in artificial evolutionary systems are usually very small
(comparatively to chromosomes in nature), it is advisable to use all the
chromosomes in the population in order to determine the number of points
to mutate. For instance, for 30 chromosomes of size 50, there are a
total of 1500 potential targets, so, in this case, a mutation rate of
say 0.05 means that a total of 75 points are randomly mutated each
generation.
When, for example, you are evaluating a transposition rate or an
inversion rate (the second type of modification rate), the probability
of transposition/inversion is evaluated relatively to the number of
chromosomes in the population. For instance, for a population with 30
individuals, a transposition rate of 0.3 means that, in this case, a
total of 9 different chromosomes are randomly chosen to undergo
transposition.
Recombination is similar to the latter, with the difference that in this
case you’ll need an even number of chromosomes to recombine (if, for
instance, the total number of chromosomes to recombine gives 9, only 8
are in truth recombined). And for all kinds of genetic modification, when a chromosome is
modified, say chromosome 2, the new version replaces the old one
(recombination is no different and, for instance, if chromosomes 5 and 8
were selected to undergo recombination, then the two new versions would
replace the old chromosomes 5 and 8).
So, as you can see, different modifications might accumulate on a
chromosome during its reproduction or, in other words, a chromosome
might be modified more than once by different genetic operators during
its reproduction.
I have some questions about elitism. Theoretically,
elitism means that from the current population the n elite members of
the population are moved to the next generation (n is likely to be 1 but
I’m not sure... not even really sure if elitism is used in
GeneXproTools). If this is done
are these elite individuals not subjected to further genetic
manipulation (mutation, recombination, etc.)?
GeneXproTools uses simple elitism
(or the cloning of the single best individual of the population). This
means that, each generation, the genome of the best individual is copied
unchanged into the next generation. However, like all individuals of the
population, this individual is also subjected to selection and, being
the very best, it has the highest probability of being selected to
reproduce with modification. So, this individual not only gets special
treatment (cloning) but also is selected according to its fitness like
everybody else. And its lucky descendants (besides the clone, of course)
are subjected to the genetic operators of mutation, inversion,
transposition, and recombination in order to create genetic diversity.
Is there a way to turn on a record of all genomes that
are produced in each generation? This would be most useful for teaching
and for understanding details of the evolutionary process. Students
could use this to see what's happening in their own small problems.
The amount of information required to keep track of all the programs
created during a run is quite voluminous and therefore is not saved in
any way in GeneXproTools. However,
since more and more people are using
GeneXproTools for teaching purposes we’ll try and include this
feature in version 5.0. Meanwhile, there are at the GEP website some old
executables that show the complete history of a run for two simple
problems (symbolic
regression and
sequence induction) that might be interesting for your purposes.
In problems with multiple variables, one often finds
models with high fitness, but these models often contain only a small
portion of independent variables instead of the entire set. Is this
reasonable? Can this be avoided?
Yes, this happens frequently and although it can be avoided, in
real-world problems it is not good practice to avoid this as most
datasets contain noisy information (including noisy variables) and one
usually uses the algorithm to find exactly what is relevant and what is
not. However, for some problems (for instance, simple problems with just
a few relevant variables or logic synthesis problems devoid of noisy
data) one can force the algorithm to include all the variables in the
evolved models by attributing zero fitness to all the models that do not
comply with this criterion (I must say, however, that this option is not
accessible in GeneXproTools for
the entire population, although you might apply pressure by ensuring
that the very best includes all the variables through a suitable custom
fitness). So,
as you can see, it all depends on your goals.
For my data it seems to me that absolute error fitness
function outperforms R-square fitness function. Is some fitness function
better than others in symbolic regression? Is there any theory on this?
This will depend on what you want and on the nature of your problem. As
you know, different fitness functions travel the fitness landscape
differently and it is always interesting to see the kind of models the
algorithm creates using different fitness functions. In my
book I use a rather limited selection of fitness functions, but
several others can be easily implemented.
GeneXproTools implements many more
and you can see their definitions in the
Online Knowledge Base. Although their merits are not discussed
there, you can easily compare their behavior by experimenting a little
with GeneXproTools. Also worth
experimenting with are the different fitness functions for
Classification and Logic Synthesis; check also their descriptions in the
Knowledge Base.
This is the main difficulty I met with my data which, I
admit, are a bit challenging: briefly, the training records I want to
classify consist of 50 groups of 10 records, each group containing 3 "1"
records, so the training set has 150 "1" and 350 "0" records. If I wait
until I have the maximum of correctly classified training records to
apply the model against the test set, practically 99 times out of 100 I
get a fitness of 0 (because the threshold of all the "0" records
correctly classified has not been reached). So, I must stop the
evolutionary process, try the model until some interesting fitness value
has been reached and then evolve again. I have to say that I got the
best results in the test set after dozens and dozens, if not hundreds,
of trials and because I was lucky enough to stop at the right time and
get an unusually high and rare fitness value! Is there a more rigorous
way?
Well, since version 3.0 GeneXproTools
has a new feature that makes this process less painful and more
interesting. This new feature records the evolutionary history of a run,
and then allows evaluating the performance of all the best-of-generation
models in the test set after which you can choose the model with better
performance. And obviously you can then do with this model all the usual
things: generate code, use it as seed, make predictions, etc.
Questions &
Answers
***
|