Q&A from a 2001 Interview
This email interview was done in March 26, 2001, but since it didn't see the
light of day I took the liberty of updating some facts, you know, things
like what-are-the-next-steps or what-are-you-hoping-to-achieve, now that
they are done deals. And now is exactly March 8, 2007, six years after I
was asked to answer these questions, which, by the way, were not
modified:
What gave you the
idea to create Gene Expression Programming?
I developed the basic ideas of Gene Expression Programming (GEP)
in September and October of 1999 almost unaware of their uniqueness. I
was reading Melanie Mitchell’s book
An Introduction to Genetic Algorithms
and meticulously solving all the computer exercises provided at the end
of each chapter. When the Genetic Programming (GP) technique presented
to discover Kepler’s third law was introduced, I implemented an
algorithm in C++ to solve it. And, surprisingly, this new algorithm
found the solution in the first run I made using an initial population
of only 10 individuals and iterated for 10 generations. I was so
impressed and so happy with these results that I proceeded immediately
to apply it to more challenging problems. I bought John Koza’s first
book then (Genetic Programming: On the Programming of Computers by Means of Natural Selection) and applied my algorithm to several
problems and the comparisons consistently showed that it tremendously
surpassed GP. By then the details of the GP technique were unknown to me
and I couldn’t understand the reasons for this difference in
performance. Only later did I realize that, as a biologist, I had
created a new, more powerful algorithm borrowing several new ideas from
biology, especially the creation of a perfect genotype-phenotype mapping.
How exactly does the
algorithm work? Can you describe it so I can picture it?
Gene Expression Programming starts by randomly creating
the chromosomes of the individuals of the initial population. Something
like this:
|
Generation 0 or Initial
Population:
--+/*aaaaaa//+/*aaaaaa+a+--aaaaaa-[0] = 0.0305998
*+-aaaaaaaa-*-//aaaaaa/*-*-aaaaaa-[1] = 0
/+/--aaaaaa**-a-aaaaaa*a-/aaaaaaa-[2] = 0.0306187
--aa*aaaaaa/*-a-aaaaaa-*-aaaaaaaa-[3] = 0
-*+/-aaaaaa+/*-+aaaaaa*/a--aaaaaa-[4] = 0
/-*/*aaaaaa/a+-*aaaaaa+aaa/aaaaaa-[5] = 0.0307552
-/*+-aaaaaa*aa++aaaaaa//-*aaaaaaa-[6] = 0
+**/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[7] = 0.5177571
/*a-aaaaaaa*+a/*aaaaaa+*+a+aaaaaa-[8] = 0.0330132
/*-++aaaaaa-//a-aaaaaa**a+aaaaaaa-[9] = 0 |
These strings code for nonlinear entities or expression trees (ETs). The
ETs are simple diagram representations and their structure is easily
inferred from the respective chromosome. So, after the creation of the
initial population, each chromosome is expressed as an ET (the phenotype
or the ‘organism’). The expression trees are in fact simple computer
programs and it is possible to test them against a selection environment
(the set of fitness cases). Using an appreciable number of individuals
and an appreciable number of fitness cases, the probability of finding
one or a few programs that do something useful is relatively high (in
the population above, the values after each chromosome are a measure of
their fitness). These are selected to reproduce, and the higher the
fitness the higher the probability of leaving more offspring. Therefore
the genetic information of the selected individuals is passed on to the
next generation. But here, as in nature, the genetic information suffers
a little mutation and recombination before it is transmitted and
therefore the descendants of these individuals, most of the time, are
not identical copies of their parents. For instance, all the individuals
of the population below are descendants of chromosomes 0, 2, 5, 7, and 8
above (note that chromosomes 1 and 4 are better than their parents):
|
Generation 1:
+**/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[0] = 0.5177571
***/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[1] = 7.3964260
+**/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[2] = 0.5177571
+/*/aaaaaaa*+a/*aaaaaa*/aa*aaaaaa-[3] = 0.0328448
+**/*aaaaaa+**/aaaaaaa+a***aaaaaa-[4] = 0.6016823
/*a-aaaaaaa**+a*aaaaaa+*+a+aaaaaa-[5] = 0.0349026
+**/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[6] = 0.5177571
*/-a+aaaaaa+a***aaaaaa*/aa*aaaaaa-[7] = 0.4814101
+**/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[8] = 0.5177571
+**/aaaaaaa+a***aaaaaa*/aa*aaaaaa-[9] = 0.5177571 |
So, these new chromosomes are subjected to the same developmental
process: expression of the genetic information, confrontation of the
selection environment, followed by selection and reproduction with
modification. This process is repeated for a certain number of
generations or until a solution is found. For this simple example, after
10 generations a perfect solution with maximum fitness (1000 in this
case) was found
(chromosome 4):
|
Generation 10:
*/a/*aaaaaa*+*/aaaaaaa+a***aaaaaa-[0] = 888.98006
*/a/*aaaaaa*+*/aaaaaaa+a***aaaaaa-[1] = 888.98006
*/aa*aaaaaa*+*-aaaaaaa+a**+aaaaaa-[2] = 0.0374567
*/a/*aaaaaa+*/a-aaaaaa+a***aaaaaa-[3] = 0.4814390
*/-/*aaaaaa*+*/aaaaaaa+a***aaaaaa-[4] = 1000
*/a/*aaaaaa*+*/aaaaaaa+a***aaaaaa-[5] = 888.98006
*/a/*aaaaaa*aaa+aaaaaa+a***aaaaaa-[6] = 0.5146930
*/a/*aaaaaa*+*/aaaaaaa+a***aaaaaa-[7] = 888.98006
*/a/+aaaaaa*+*/aaaaaaa+a***aaaaaa-[8] = 666.66575
*/a/*aaaaaa*+-/aaaaaaa+a***aaaaaa-[9] = 0.4812073 |
You describe the algorithm as having
linear chromosomes that are afterwards expressed as nonlinear entities.
What is the advantage to this, and how does it make the algorithm
different from Genetic Algorithms and Genetic Programming?
The advantage is that individuals are encoded in simple structures
(chromosomes) and all the genetic modifications (mutation,
recombination, transposition and so on) that occur during reproduction
are made on these structures and, therefore, one only needs to pass on
these structures to the next generation. On the other hand, the
expression trees that develop from the chromosomes have an independent
existence and may assume various forms, and therefore assume various
functions. In nature, organisms are also encoded in linear chromosomes,
and when they reproduce only the genetic material is transmitted. Then
this material is expressed as a particular body that assumes various
forms and functions. The body is what selection sees, and according to
its fitness the individual might be chosen to reproduce or not. And
during reproduction, only the genetic material is transmitted to the
next generation. If not for the blueprint of the organism, the
reproduction of complex beings would involve the copy of all the body
constituents. Well, it’s not hard to see that this is unfeasible and, in
fact, it is known that for life to move beyond a very rudimentary stage
the phenotype threshold should be crossed. Imagine our DNA going around
naked: it could do a couple of things, but not certainly walk, fly or
think. The interplay of the genotype and the phenotype allows the
evolution of different functionalities and greater complexity.
This is the fundamental difference between GEP and GAs or GP, for they
use only one kind of entity (linear chromosomes in GAs and ramified
structures or parse trees in GP) that works both as genotype and
phenotype, that is, there is no distinction between genotype and
phenotype. The entity is the body that goes around doing things and when
it reproduces everything must be copied. In addition, and especially for
GP, the modifications that occur during reproduction are greatly
constrained because there are lots of impossibilities at this level (it
is impossible to make an orange tree produce mangos only by grafting and
pruning) and therefore evolution is greatly limited. This kind of
limitation is common in GP, for it exhibits already a certain structural
complexity. This is one of the reasons GP uses almost exclusively a
special kind of recombination that permits the exchange of structurally
concise branches or building blocks. And evolution is, in this case, the
result of moving these blocks around, never really creating new
material. In GAs there are no such problems as their structure is very
simple; but then they are also less versatile in terms of
functionalities.
So, in GEP, every conceivable modification is allowed in the genome (I
use currently eight different genetic operators, but others may be
easily implemented) always producing syntactically correct expression
trees (in nature there is no such thing as a syntactically incorrect
protein). And therefore no time or resources are lost in editing or
ensuring the validity of the programs as happens in all GP
implementations. In GEP the chromosomes are so simple that they are
modified and reproduced as easily as the linear chromosomes of GAs. And
the expression trees they code for are more complex than the parse trees
of GP because GEP chromosomes code for multiple genes, each gene encoding a
sub-expression tree. And besides, no forbidden paths exist in GEP when
it comes to exploring the solution space.
How exactly does the
algorithm create computer programs? What's the input, what's the output
and what happens in between? Can you take me step-by-step through a
simple example?
The input is a set of fitness cases (the selection environment), for
instance, the production data of several firms (each row is a firm):
|
Labor |
Material |
Capital |
Production |
|
0.542228376455 |
0.337519500366 |
0.405527109223 |
0.400102626024 |
|
0.691221863614 |
0.488785181029 |
0.513851630110 |
0.548400414897 |
|
0.716405370429 |
0.537658971190 |
0.600028325136 |
0.587799795213 |
|
..... |
|
|
|
And you believe that the Production can be explained in
terms of Labor, Material, and Capital. So, you randomly create an
initial population of GEP chromosomes, express them and see which ones
return a value for Production within the limits you previously
established (say, a 10% error) for one or more fitness cases. These
individuals will have a fitness greater than zero and therefore they
might be selected to reproduce with modification. And this is all that
is necessary for getting the evolutionary process started. If you are
lucky, after a certain number of generations, you will end up with a
chromosome (an encoded computer program) that is a function of Labor,
Material, and Capital that satisfactorily explains Production. So, the
best chromosome (the chromosome with higher fitness) evolved in such an
experiment is the output, or in other words, the output is a computer
program written in Karva notation (the language of GEP genes) that can
be automatically converted into a more conventional language, such as
C++ or Visual Basic. Indeed, the latest release of
GeneXproTools (a modeling tool
from Gepsoft that uses Gene Expression Programming) creates computer
programs in a total of 16 different programming languages, which, at the
end of the day, are what you call the output.
Can you give me quick
definitions for the ways the chromosomes are modified and how this works
in algorithm vs. biology – mutation, inversion, transposition, root
transposition, gene transposition, gene recombination and one-point and
two-point recombination?
Mutation – as in biology, mutations are point mutations (substitutions
of a symbol by another at a particular position) and occur throughout
the chromosomes. With the difference that GEP uses different alphabets
which are used in the different regions of genes (the heads, the tails,
and special domains for handling random numerical constants).
Inversion – as the name suggests, the inversion operator inverts
sequences in the genes. In nature organisms also explore inversion and
it is known that during evolution some sequences or even entire genes
were inverted.
Transposition (the three kinds) – the mechanism of GEP transposition is
considerably different from biological transposition. Nonetheless, they
have a couple of things in common: some sequences in the chromosomes
are activated and jump to another place in the chromosome and, as a
result, they might get copied in the process (except for gene
transposition where the transposon – an entire gene – is deleted at the
place of origin). This kind of operators is very interesting because
they modify the chromosomes using elements that were already present in
the chromosome.
Recombination (the three kinds) – the mechanism is very similar to
biological recombination, for in both cases parent chromosomes are
paired, cut at one or two points (exactly the same position in both
parents), and exchange some genetic material between them. But in GEP
the paired chromosomes are not homologous and therefore the daughter
chromosomes are most of the time completely different from the parents.
Of the three types of GEP recombination, the more conservative is gene
recombination, as in this case entire genes are exchanged between the
parent chromosomes. In one-point and two-point recombination though, the
parent chromosomes are split at random positions and the daughter
chromosomes are usually very different from their parents.
Your paper (Ferreira,
C., 2001. Gene Expression Programming: A New Adaptive Algorithm for
Solving Problems. Complex Systems, Vol. 13, issue 2: 87-129)
seems to imply that the algorithm has more in common with biological
selection than Genetic Algorithms and Genetic Programming. Can you
explain this in detail?
Do you really mean biological selection here or do you mean biology?
Well, I use the same old roulette-wheel selection method used earlier
both in GAs and GP. So, nothing new or different here, and it mimics
nature quite accurately. But if you meant biology, I think that I have
already explained that (a complex genome composed of multiple genes and
the two separate entities – genotype and phenotype – that allow an
efficient adaptation and evolution of complex bodies encoded as simple
strings).
Can you give me quick
definitions for the computer problems you mentioned in your
paper – symbolic regression, sequence induction, block stacking and
density classification?
Symbolic regression is also called function finding and consists of
finding a mathematical expression that describes a certain property
(like the Production above). Sequence
induction is a special case of symbolic regression where the domain of
the independent variable are non-negative integers. Block stacking is a
toy planning problem much used in computer science. Cellular automata
rules for the density classification task is also a toy problem, but one
very complex and difficult for which only a few very powerful algorithms
could discover good rules. And there were also the 11-multiplexer and
the GP-rule problems which are complex problems of logic synthesis or
Boolean concept learning.
How does your work
fit into the body of work on evolutionary algorithms, and what is
different about it?
Like a GA or GP system, Gene Expression Programming is also a genetic
algorithm because it uses populations of individuals, selects them
according to fitness, and introduces genetic variation using one or more
genetic operators. However, it is closer to GP because the phenotype of
GEP individuals is structurally similar to the nonlinear entities of GP
(called parse trees in GP and expression trees in GEP). However, due to
the genotype/phenotype interplay, GEP is much more efficient: problems
that took weeks or months to solve using other learning algorithms can
now be solved in seconds or minutes by GEP. With GEP, thanks to the
multigenic genotype, it is also possible to evolve more complex
structures, for instance, multi-subunit expression trees composed of
several sub-expression trees. In their turn, these multigenic systems
are at the core of other, higher levels of organization such as the
unicellular and multicellular GEP systems for evolving automatically
defined functions. But there are many more GEP systems: systems for
parameter optimization that explore both the multigenic organization and
special gene domains for handling random numerical constants; systems
for inducing polynomials, neural networks, and decision trees, all of
which also explore the concerted action of different gene domains for
fine-tuning all kinds of numerical constants: polynomials' coefficients,
the weights and thresholds of neural networks, and the numerical
constants of decision trees with mixed/numeric attributes.
What are the next
steps in your research? What are you ultimately aiming for?
Continue to create as many new learning algorithms as I can and then
make them known both through publications (in many forms – papers,
books, and web publications) and software applications. Ultimately, I
hope this will contribute to make Gene Expression Programming very well
established in the field of Evolutionary Computation and Machine
Learning.
How could this
research eventually be applied practically? What types of uses might it
have or lead to?
We are at Gepsoft developing
commercial software for predictive modeling (software for
Classification/Logistic Regression, Function Finding, Time Series
Prediction, Logic Synthesis, and Parameter Optimization).
GeneXproTools is our flagship
application and is being used for mining huge databases with thousands
of variables, creating almost instantaneously computer models in a total
of 16 programming languages (Ada, C, C++, C#, Fortran, Java, Java
Script, Matlab, Pascal, Perl, PHP, Python, Visual Basic, VB.Net,
Verilog, and VHDL). Our mission is to continue to make this package a
very sophisticated and efficient AI lab through the integration of a
wide set of learning algorithms (all the algorithms of the GEP family,
including evolvable neural networks and decision trees, polynomials and
automatically defined functions, parameter optimization algorithms and
many others). Version 4.0 of
GeneXproTools was released in July of 2006 and implements already
two different learning algorithms (the GEP and GEP-RNC algorithms) and
supports Function Finding, Classification/Logistic Regression, Time
Series Prediction, and Logic Synthesis.
Questions &
Answers
***
|