A Quick Introduction to Gene Expression Programming
Introduction
Gene Expression Programming (GEP) is a learning algorithm and what it learns specifically is about relationships between variables in sets
of data and then builds models to
explain these relationships.
How learning algorithms build models or discover solutions to
problems varies, with some simulating networks of neurons and others
simulating evolution through natural selection. Gene Expression
Programming belongs to the latter group, the so called evolutionary
algorithms. And like all evolutionary algorithms, natural or
otherwise, GEP uses populations of individuals (in this case,
populations of candidate solutions to the problem at hand), selects and reproduces them
according to fitness, and introduces genetic variation using one or
more genetic operators such as mutation or recombination. And these
are obviously the prerequisites for evolution to occur.
Gene Expression Programming can be used to evolve very different
things: from conventional mathematical models [1] to sophisticated
decision trees [2] and neural networks [3].
However, the strategies it uses to create all of them are similar
and very easy to understand. For example, if we want to create
mathematical models, then we start by creating an initial population
of models and then allow it to evolve. This means that each
generation, we measure the fitness of all the models of the
population and then reproduce them accordingly. This reproduction,
however, is not perfect and therefore slightly different models will
start to appear and, given enough time, some of them will be better
than the ones we started with.
So these are the general principles of all evolutionary algorithms
and, in fact, there are several artificial evolutionary algorithms
that explore them. Of special interest to the GEP technique are the Genetic
Algorithms (GAs) and Genetic Programming (GP), for they
serve to illustrate some of the fundamental characteristics of the
GEP technique and why GEP surpasses the old GP technique by a factor
of 100-60,000.
First of all, both GAs and GP are simple replicator systems, with
the latter considerably more complex than the former. This means
that the things or entities these systems evolve go about their
business doing what has to be done and when their time comes, they
must reproduce their bodies (hopefully with some variation) to
perpetuate their line. But reproducing bodies with modification
might not be an easy task, especially if the bodies are complex
structures like the parse trees that GP evolves. Indeed, the canonical GP
technique is limited to the use of very conservative operators
(almost exclusively a tree-based crossover) that
change the trees in a way reminiscent of the grafting and pruning
familiar to gardeners and farmers around the world [4]. And when we
compare the achievements of grafting and pruning (and I'm not saying they are
not considerable) with the achievements of the DNA/protein
system of life on Earth (the most sophisticated
genotype/phenotype system known to science), it becomes clear why genotype/phenotype
systems are much more powerful than simple replicator systems:
firstly, in genotype/phenotype systems there are no restrictions
concerning the type and number of modifications in the genome and
therefore all the paths and crannies of the fitness landscape are
accessible to meander through,
and secondly, these systems are free to explore different levels of
organization (for instance, genes are expressed into proteins;
proteins form ribosomes, microtubules, membranes, etc.; these
macromolecules and structures then form cells; cells get organized
into tissues; tissues into organs and so forth, culminating with the
organism itself). Obviously these higher levels of complexity are
completely inaccessible to simple replicator systems, for no matter
how complex, they continue to be a single structure, forever
incapable of becoming a part of a much more complex being.
The GEP system is a full-fledged genotype/phenotype system
with expression trees of different sizes and shapes encoded in
linear chromosomes of fixed length. Also important is that GEP
chromosomes are multigenic, encoding multiple expression trees or
sub-programs that can be organized into a much more complex program.
So, like the DNA/protein system of life on Earth, the
gene/tree system of GEP can not only explore all the
crannies and paths of the solution space but it's also free to explore
higher levels of organization. But how is this achieved? In order to
answer this question we need to take a closer look at the structures
of the main players of GEP the chromosomes and the expression
trees and see how they work together.
The Architecture of GEP Programs
So, there are two main players in GEP: the
chromosomes and the expression trees (ETs), being the latter the expression of the genetic information encoded in the former. As in nature, the process of information decoding is called
translation. And this translation implies obviously a kind of code and a set of rules. The
genetic code of GEP is very simple: a one-to-one relationship between the symbols of the
genes and the nodes they represent in the trees. The
rules are also very simple: they determine the spatial organization of nodes in the expression trees and the type of interaction between sub-ETs. Therefore, there are two languages in GEP: the language of genes and the language of expression trees and, thanks to the simple rules that determine the structure of ETs and their interactions, it is possible to infer immediately the expression tree given the sequence of a gene, and vice versa. This means that
we can choose to have a very complex program represented by its compact genome without losing in meaning. This unequivocal bilingual notation is called
Karva language. Its details are explained in the remainder of this
tutorial.
The structural organization of GEP genes is better understood in terms of
open reading frames (ORFs). In biology, an ORF or coding sequence of a gene begins with the start codon, continues with the amino acid codons, and ends at a termination codon. However, a gene is more than the respective ORF, with sequences upstream of the start codon and sequences downstream of the stop codon.
And although in GEP the start site is always the first position of a gene, the termination point does not always coincide with the last position of a gene. Consequently, it is common for GEP genes to have
noncoding regions downstream of the termination point. These noncoding
regions obviously do not interfere with expression but, nonetheless, they play a crucial role in evolution, for they alone allow the creation of valid programs no matter how profoundly their chromosomes are
modified (well get back to this later).
Consider, for example, the algebraic expression:
|
(1) |
It can also be represented as a diagram or an expression tree:
where Q represents the square root function.
This kind of diagram representation is what is called the phenotype
in Gene Expression Programming. And the
genotype can be easily inferred from the phenotype as follows:
which is the straightforward reading of the expression tree from left to right and from top to bottom (exactly as we read a page of text). The
expression (2) is an ORF, starting at Q (position 0) and terminating at d (position 7). These ORFs were named
K-expressions from Karva notation.
Consider another ORF, the following K-expression:
01234567890 |
|
Q*b**+baQba |
(3) |
Its expression as an ET is also very simple and straightforward. In order to express the ORF correctly, we must follow the rules governing the spatial distribution of functions and terminals. First, the start of a gene corresponds to the
root of the ET which is placed in the topmost line. Second, in the next line, below each function, are placed as many branch nodes as there are arguments to that function. Third, from left to right, the nodes are filled consecutively with the next elements of the K-expression. Fourth, the process is repeated until a line containing only terminals is formed. In this case, the following expression tree is
obtained:
Just by looking at the structure of GEP K-expressions, it is difficult or even impossible to see the advantages of such a representation, except perhaps for its simplicity and elegance. However, when
K-expressions are analyzed in the context of a gene, the advantages of this representation become obvious. As previously stated, GEP chromosomes have fixed length, and they are composed of one or more genes of equal length. Consequently, the length of a gene is also fixed. Thus, in
GEP, what varies is not the length of genes which is constant, but the length of the
K-expressions. Indeed, the length of a K-expression may be equal to or less than the length of the gene. In the
former case, the termination point coincides with the end of the gene, and in the latter, the termination point is somewhere upstream of the end of the gene.
What is the function of these noncoding regions of GEP genes? We will see that they are the essence of
Gene Expression Programming and evolvability, for they allow the modification of the genome using several genetic operators without restrictions, always producing syntactically correct programs. Thus, in
GEP, the fundamental property of genotype/phenotype systems
syntactic closure is intrinsic, allowing the totally unconstrained restructuring of the genotype and, consequently, an efficient evolution.
Lets analyze then the structural organization of GEP genes in order to understand how they invariably code for syntactically correct programs and why they allow an unconstrained application of virtually any genetic operator.
The genes of Gene Expression Programming are composed of a head and a
tail. The head contains symbols that represent both functions and terminals, whereas the tail contains only terminals. For each problem, the length of the head
h is chosen, whereas the length of the tail t is a function of
h and the number of arguments n of the function with more arguments (also called
maximum arity) and is evaluated by the equation:
Consider a gene for which the set of functions F = {Q, *, /, -, +} and the
set of terminals T = {a, b}. In this case n = 2; if we chose an
h = 15, then t = 15 (2 - 1) + 1 = 16; thus, the length of the gene
g is 15 + 16 = 31. One such gene is shown below (the head is shown in
blue):
0123456789012345678901234567890 |
|
*b+a-aQab+//+b+babbabbbababbaaa |
(5) |
It codes for the following ET:
In this case, the K-expression ends at position 7, whereas the gene ends at position
30.
Suppose now a mutation occurred at position 6, changing the Q into *. Then the following gene is obtained:
0123456789012345678901234567890 |
|
*b+a-a*ab+//+b+babbabbbababbaaa |
(6) |
And its expression gives:
In this case, the termination point shifts one position to the right (position
8), changing slightly the daughter tree.
Consider another mutation in chromosome (5) above, the substitution of a at position 5 by +,
resulting in the following chromosome:
0123456789012345678901234567890 |
|
*b+a-+Qab+//+b+babbabbbababbaaa |
(7) |
And its expression gives:
In this case, the termination point shifts 12 positions to the right (position
19), enlarging and changing significantly the daughter tree.
Obviously the opposite might also happen, and the daughter tree might shrink. For example, consider again
gene (5) above, and suppose a mutation occurred at position 2, changing the + into Q,
giving:
0123456789012345678901234567890 |
|
*bQa-aQab+//+b+babbabbbababbaaa |
(8) |
And now its expression results in the following ET:
In this case, the ORF ends at position 3, shortening the original ET in
four nodes.
So, despite their fixed length, each gene has the potential to code for expression trees of different sizes and shapes, where the simplest is composed of only one node (when the first element of a gene is a terminal) and the largest is composed of as many nodes as the length of the gene (when all the elements of the head are functions with maximum arity).
It is evident from the examples above, that any modification made in the genome, no matter how profound, always results in a structurally correct program.
Now this is unique to GEP and is obviously at the heart of its
superior performance: for instance, in the simple replicator system
of GP, most modifications result in syntactically invalid programs
(imagine, for instance, what would happen if a terminal in a GP tree is
replaced by a function), which is why most GP implementations rely
exclusively on inefficient tree-specific crossover to create genetic
diversity [4].
Multigenic Chromosomes and Linking Functions
Genotype/phenotype systems are such well-oiled machines that their
expansion into more complex systems is rather easy, and the
introduction of multiple genes in Gene Expression Programming
clearly illustrates this.
So, the chromosomes of Gene Expression Programming are usually composed of more than one gene of equal length. For each problem or run, the number of genes, as well as the
size of the head, are a priori chosen. Each gene codes for a sub-ET and the sub-ETs interact with one another forming a more complex multi-subunit expression tree.
Consider, for example, the following chromosome with length 39, composed of three genes, each with length
13 (the
heads are shown in blue):
012345678901201234567890120123456789012 |
|
QaQ+-Qbbaaaba+Q+ab+abababa*-**b+aabbaba |
(9) |
It has obviously three K-expressions and each one of them is
expressed independently, resulting therefore in three different sub-ETs:
For the sake of simplicity, in the linear representation
above, the start of each K-expression is always given by position 0; the end of each
K-expression, though, is only evident upon construction of the corresponding sub-ET. As shown in the
Figure above, the first K-expression ends at position 1; the second at position
7; and the last at position 10. Thus, the multigenic chromosomes of
GEP contain multiple K-expressions of different sizes, each one of
them coding for a structurally and functionally unique sub-ET.
Obviously, these sub-ETs or sub-programs must interact with one another
and, in the canonical GEP system, they interact
through special functions, the so called linking functions.
These functions link the sub-ETs together one after the other in an
orderly fashion. For instance, the linking by addition of all the three sub-ETs shown in the
Figure above is illustrated
below (the linking functions are shown in gray):
Note that the final program represented in the Figure above could be linearly encoded as the following
K-expression:
01234567890123456789012 |
|
++*Q+-*aQ+*b+aab+abbaab |
(10) |
However, the use of multigenic chromosomes is more appropriate to evolve solutions to complex problems, for they permit the modular construction of complex, hierarchical structures, where each gene codes for a smaller and simpler building block. These smaller building blocks are separated from
one another and are therefore free to evolve independently, allowing
for the creation of concrete new units that might prove handy in a
new situation.
GEP with Random Numerical Constants
The implementation of a facility for handling random numerical
constants (RNCs) in Gene Expression Programming is another
example of how easy it is to create and explore higher levels of
complexity in genotype/phenotype systems and, indeed, RNCs are
elegantly and efficiently
implemented in GEP.
This elegant construct involves an extra
terminal ? that is used for representing the RNCs
and an additional domain Dc at the end of GEP genes. Structurally, the Dc comes after the
tail, has a length t equal to
the length of the tail, and is composed of the symbols used to represent the random constants. Therefore, another region (besides the head and
the tail) with defined boundaries and its own alphabet is created in the gene.
Consider the single-gene chromosome with a head size h = 7 (the Dc is shown in
blue):
01234567890123456789012 |
|
+?*+?**aaa??aaa68083295 |
(11) |
where the terminal ? represents the random constants. The expression of this kind of
gene is exactly done as
explained above for genes with just the head/tail
construct, giving:
Then the ?s in the expression tree are replaced from left to right and from top to bottom by the symbols (numerals, for simplicity) in Dc, obtaining:
The values corresponding to these symbols are kept in an array. For simplicity, the number represented by the numeral indicates the order in the array. For instance, for the 10 elements
zero-based array:
A = {0.611, 1.184, 2.449, 2.98, 0.496, 2.286, 0.93, 2.305, 2.737, 0.755}
chromosome (11) above gives:
Obviously, the GEP-RNC algorithm can also be used to evolve highly sophisticated programs composed of multiple sub-programs through the use of
multigenic chromosomes.
Obviously in this case the genes of the multigenic GEP-RNC system carry
the extra Dc domain for encoding the random numerical constants.
Higher Levels of Complexity
When the mapping is perfect, genotype/phenotype systems are
extremely efficient, and Gene Expression Programming is the only
GP-style algorithm with such a mapping. And this means obviously
that these full-fledged systems can grow: not ΰ la GP which can only bloat, but
rather grow in order to explore higher levels of complexity. And
we've already seen here different levels of organization: unigenic
systems, both with and without a Dc domain for handling numerical
constants, and multigenic systems also with and without Dc domains.
But there are many more GEP systems: unicellular and
multicellular systems for evolving automatically defined functions;
systems for parameter optimization that explore both the multigenic
organization and the Dc domain; systems for inducing polynomials,
neural networks, and decision trees, which also explore the
concerted action of different gene domains for fine-tuning all kinds
of numerical constants: polynomials' coefficients, the weights and
thresholds of neural networks, and the numerical constants of
decision trees with mixed/numeric attributes. You can learn all
about these systems in Ferreira's book
Gene Expression Programming: Mathematical Modeling by an Artificial
Intelligence [2].
References
- Ferreira, C., 2001.
Gene Expression Programming: A New Adaptive Algorithm for
Solving Problems. Complex Systems, Vol. 13, issue 2: 87-129.
- Ferreira, C., 2006.
Gene Expression Programming:
Mathematical Modeling by an Artificial Intelligence. 2nd Edition, Springer-Verlag, Germany.
- Ferreira, C., 2006.
Designing Neural Networks Using Gene Expression Programming.
In A. Abraham, B. de Baets, M. Kφppen, and B. Nickolay, eds., Applied Soft Computing Technologies: The Challenge of Complexity, pages
517-536, Springer-Verlag.
- Koza, J. R., 1992.
Genetic Programming: On the Programming of Computers by Means of Natural Selection.
Cambridge, MA: MIT Press.
Last modified: August 27, 2007
Cite this as:
Ferreira, C. "A Quick Introduction to Gene Expression Programming." From
GEP
Tutorials: A Gepsoft Web Resource.
https://www.gene-expression-programming.com/tutorial001.htm
More Tutorials
| GEP
Mailing List
***
|