Questions & Answers
about the GEP Book

Q&A about the GEP Book

These are Q&A from my personal correspondence with readers of my book, both of the first and second editions. It’s been great getting your feedback and your emails! Thank you all so much! I know the second edition of my book is a little bit pricey (but if it’s any consolation, it’s a huge, not to mention substantial :), book with 478 pages), but for those of you who cannot afford it, I made the first edition freely available online in html. So please take a look and join in on the discussions! As usual, for privacy sake, I removed any references that could identify the people involved.

I am impressed how much comes out of the treatment of constants. It seems you have taken them much farther than in GP. I can't remember, but I don't think GP has any analogue of the neat optimization techniques. After reading that and the decision tree chapter, I can really see how important the "Dc" part of the chromosome can be!

I’m really glad you understood the importance of the Dc in GEP (most people choose to very conveniently ignore it). To my knowledge GP has no analogue of the optimization techniques I propose but there are people that nevertheless claim that GP can be used to do parameter optimization by running a typical symbolic regression run – a perfect example of a Rube Goldberg machine, in my opinion.

In the chapter on numerical constants, I wondered why you have both the R and C vectors. In other words, why not just put the C values (or a subset of them) in the Dc part? Why the extra step of referencing C using R? I can see that some extra variability may result, but since you mutate C as well as R, the search space is larger. I'm sure there is something I missed here.

I don’t remember having weighted the pros and cons of this representation. I first tried to implement a system similar to the one used in GP but I wasn’t satisfied with the results and had a feeling that something better could be created. I then came up with this system and now that you ask, I think the use of both R and C has indeed some advantages. First, it gives more control over constant variability (the Dc length depends both on the head size and max arity and therefore in a direct representation the variability will be dictated by the Dc size). Second, it allows the appearance of duplicated constants (it’s much harder to evolve exactly the same floating-point constants with the direct representation) which seems to be very important as it is common to see the same constant being used in different places. And finally, I think it is more elegant.

I wasn't sure to what degree your criticism of homogenizing was directed properly at GP, though. GP does not use mutation usually, but Koza points out that the exchange of subtrees results in constant innovation in the population. For instance, even if all individuals are identical, a GP cross can still produce an individual that is different from the rest. What do you think?

That’s true, but there is however some homogenization going on there otherwise they wouldn’t say that in GP it is useless to continue evolving beyond 51 generations. I’ve never implemented a Koza-style GP so I cannot tell you exactly what goes on within all those huge numbers of individuals that he usually uses, but one can have an idea of what takes place there by running a simple experiment in GEP: by using just gene recombination and gene transposition both at 100% and a chromosome with say 10 genes, one is just recombining mathematically concise blocks as is done in GP and since genes are moving around in the chromosome you’ll also avoid chromosomes becoming equal. And indeed populations also stop evolving after a few generations. So perhaps what happens in GP is that innovation stops because it is rediscovering again and again old combinations of building blocks.

I keep coming back to how in GEP the average fitness is way below the best individual fitness, and your use of relatively high mutation rates. In GA, one says that you can't mutate too fast or as you approach the optimum, you'll "mutate back" (disadvantageously) more often than forward, and won't reach the optimum. Is it because you use elitism that you can mutate so fast, and also so disruptively?

In GEP the behavior of average fitness will depend, among other things, on the genetic operators you are using and, yes, if you use all the genetic operators at the rates I usually use, you will get populations with best fitness considerably higher than average. But if you use just recombination (and even add a little bit of mutation), you’ll get what one usually does for a GA. Obviously elitism plays a small part in all this, and of course the larger the percentage of individuals that are preserved from generation to generation the closer the gap between average and best fitness. And since in GEP I only clone one individual (the very best), this might also contribute to accentuate the gap even more. (I know Koza usually keeps a fair amount of individuals unchanged from generation to generation as I guess he cannot afford to lose genetic diversity, but I remember doing some early experiments and concluding that in GEP it is a waste of resources to keep so many unchanged individuals as the system works equally well by keeping just one of the best.)

About “mutating back” in GAs I have my doubts as people seem to be so fixated on crossover. From my experience with GAs, I don’t see that they behave so much differently than GEP. In all systems, there is a point where you start applying too much mutation and you are just delaying things or making things almost random, which obviously is not the goal. And the GA is no exception. One can call it “mutate back” but I guess this is just common sense.

I guess what surprises me is that the average fitness, after an initial rise, stays pretty much constant, whereas the best fitness keeps rising. Why doesn't average fitness follow best fitness up? I guess the answer is that there is so much mutation (of various kinds). But then, why does the best keep rising? I guess that's because it can't fall! (due to elitism) and is occasionally bettered. Well, I'm not very expert on this.

You got that right! These systems are usually optimized to solve problems quickly and efficiently and, thanks to elitism, one can use mutation rates considerably higher than the mutation rates used in nature as what counts in nature (and in some artificial evolutionary systems) is the performance of the overall population. Here, one is just interested in the performance of the best and the rest of the population is always being sacrificed for the benefit of the best. Obviously, without elitism you would have to be more careful and if you were to design good solutions you no doubt would have to be more conservative and use much smaller mutation rates (as is done in nature). In this case, the best and average fitnesses would be very close and both would slowly and painstakingly increase with time (as happens in nature).

I found the last chapter “Evolutionary Studies” stimulating, and I'm not sure I've drawn all the implications yet. I guess the analysis in Goldberg's recent book (The Design of Innovation) is based mainly on what he rather pompously calls "selectorecombinative" GAs – i.e., ones with crossover only. And he derives a lot of stuff having to do with the sizing of populations in order to assure a solution is found. From your analysis, this would seem to be simply that he wants the populations big enough so that there will be a significant founder effect, in order that the solution is found before the homogenizing process of crossover-only takes over. Is this how you see that?

I think you are correct. If he only uses recombination he will need big populations both to find good solutions and avoid premature convergence.

The combinatorial optimization results are startling indeed. Your point about the abandonment of inversion in GAs is interesting.

Why on earth do they continue fixated on recombination? Is it the allure of the schema theorem and the building block hypothesis? Or is this a case of trying to make the facts fit the theory?

I found the section on neutrality very nice, as it counters any possible objection to "wasted" elements in the chromosome.

I hope you are right because I’ve received the most astounding comments from supposed specialists in the EC field who think they know everything there is to know about non-coding sequences in the genome. This jewel is from a reviewer to a paper I submitted to a conference:

  “Also, it's difficult to consider the sequences at the end of the GEP genes to be ‘junk’. In the biological sense (unless I'm mistaken) the large amount of junk in gene sequences is really that – either they have shown it doesn't code for any proteins, or they have not managed to assign function to it yet. Whereas, you say yourself that your junk sequences in GEP genes are ‘extremely important’."  

The beautiful thing about the non-coding regions of GEP genes is that we know exactly why they exist, that is, they exist to ensure syntactic closure. That they may provide other benefits – such as provide places for safely exchanging genes (more precisely, ORFs or K-expressions) or places for the accumulation of neutral mutations – is just an extra bonus. Indeed, like so many times in evolutionary systems, natural or otherwise, a structure first appears to solve a particular problem but because it is also handy at solving others it brings unexpected benefits, making the system much more robust as a whole.

The only chapter I found a bit confusing was Chapter 7. I didn't know anything about the Kolmogorov-Gabor polynomials or STROGANOFF. I couldn't understand where you were going – I felt left out. Then I saw at the end how you solved those problems much more simply with GEP, and I realized that was the point of the chapter.

This is also the chapter I like less and I intend to rewrite it completely the next time around because this algorithm is very interesting especially if one uses simpler polynomials requiring just 2-3 different coefficients (the complete K-G series requires 6, making things excessively complex). And since there is no limit to the variety of functions one can implement, it is possible to create a varied enough set and just play with them as one does with any other mathematical function (you can even mix them all together and it works even better). I don’t think I stressed this enough in my book but anyone could reach this conclusion just by analyzing the kinds of functions that were being selected to build the best models (I actually found this out while rewriting this chapter for the 2nd edition but at the time I was unfortunately short on time to once again rewrite it).

Why do you weight all the functions in the function set twice in almost all classification problems? And what is the criterion for choosing the function set?

Well, what I found to work better for a classification problem such as this (diagnosis of breast cancer with 9 variables) is to have a function set with at least twice as many functions as there are variables. So, in this case, weighting the 10 functions two times does more or less this. Usually if overall we have less functions than terminals, we will be making the evolutionary process less efficient as the evolved models will have a tendency to become very simplistic since the terminals will have a greater probability of being integrated not only during the creation of the initial population but also during genetic modification. So, it is important to find a balance between the number of functions and variables. As for the functions chosen, my personal experience is that most of the time a simple function set with only the basic arithmetic operators (weighted a certain number of times) is a good starting point for classification problems. But you know how people are (myself included): we like to use more fancy functions and since GeneXproTools deals with more complex functions equally well, I just like to spice the evolutionary process a bit and see what happens.

Which values should be used for the 0/1 rounding threshold? Will the current model not adjust itself to any value you have chosen?

You are right: the plasticity of GEP is extraordinary and it can, indeed, find a way around this (you see the same plasticity operating with different function sets, linking functions, random numerical constants, program structures and so on). But for most problems there is a certain 0/1 rounding threshold that works best. It will depend mainly on three factors: (1) on the kind of data you have (only positive values, only negative values, a mixture of the two, normalized or non-normalized data, etc.); (2) on the kind of functions you are using (for instance, if you are using functions that by themselves already round the values they return to 0 or 1, in this case a rounding threshold between 0 and 1 will do); and (3) on the presence of random numerical constants (when a diverse range of random numerical constants is available it is always possible for the algorithm to work its way around even the most adverse situations, such as a negative rounding threshold combined with only positive inputs and a function set with only addition and multiplication).

Concerning the diagnosis of breast cancer example, you write that you got a 340 fitness value (number of hits) against the training set of 350 records and a 173 fitness value against a 174 records testing set. Did you wait until the model has provided the best training fitness to apply it against the test set and get this optimum result, or did you evolve many times, through a "stop and go again" process until the test set fitness has reached its best?

As you know, GeneXproTools allows you to stop the evolutionary process after which you can take a look at the performance in the test set. And if you are not satisfied with the generalization results you might choose to continue by using the evolved model as seed. Well, I do this a lot whenever I’m trying to find the best possible model to solve a real-world problem but, in the breast cancer example you mentioned, I did several runs from scratch using different settings (different functions, different rounding thresholds, different architectures and so on) as I went along and then I chose the best model.

Questions & Answers


Last update: 23/July/2013
© Candida Ferreira
All rights reserved.