In this classification problem the goal is to classify irises based on four measurements: sepal length, sepal width, petal length, and petal width. The iris dataset contains fifty examples each of three types of iris:
Iris setosa, Iris versicolor, and Iris virginica.
Classification problems with more than two classes can also be solved by GEP but the data must be rearranged. The classification of data into
n distinct classes C requires the processing of the data into
n separate 0/1 classification problems as follows:
1. C1 versus NOT
C1
2. C2 versus NOT C2
...
n. Cn versus NOT Cn |
Then n different models are evolved separately and afterwards combined in order to make the final decision.
For the iris data we are going to decompose our problem into three separate 0/1 classification problems. The first one is the
Iris setosa versus NOT Iris setosa; the second is Iris versicolor versus NOT
Iris versicolor; and the last is the Iris virginica versus NOT
Iris virginica.
For this problem F = {+, -, *, /} and the set of terminals included all the four attributes which were represented by
d0 - d3, corresponding, respectively, to sepal length, sepal width, petal length, and petal width. The 0/1 rounding threshold was set to 0.5 and the fitness was evaluated by equation
(4.28).
For all the sub-problems, I started with three-genic chromosomes with an h = 8 and sub-ETs linked by addition. The first dataset (setosa versus NOT
setosa) was almost instantaneously classified without errors and I soon found out that a very simple structure is required to classify correctly this dataset. The model below perfectly classifies all the irises into
setosa and NOT setosa:
double APSCfunction(double d[ ])
{
double
dblTemp = 0;
dblTemp
+= (d[1]-d[2]);
return
(dblTemp >= 0.5 ? 1 : 0); |
|
} |
(4.31) |
As you can see, only the difference between the sepal width and the petal length is relevant to distinguish
Iris setosa from the other two irises.
The classification of the remaining datasets was also extremely accurate, but on both cases only 149 out of 150 samples were correctly classified. The model below distinguishes
Iris versicolor from the other two irises:
double APSCfunction(double d[ ])
{
double
dblTemp = 0;
dblTemp
+= (d[3]*(((d[0]*d[3])-d[1])+((d[1]*d[2])-d[2])));
dblTemp
+= (((d[2]-(d[2]+d[2]))-(d[0]/d[0]))/d[3]);
dblTemp
+= (((d[0]-(d[2]*d[3]))*(d[2]-d[1]))*d[0]);
return
(dblTemp >= 0.5 ? 1 : 0); |
|
} |
(4.32) |
And the next model distinguishes Iris virginica from setosa and
versicolor:
double APSCfunction(double d[ ])
{
double
dblTemp = 0;
dblTemp
+= (d[1]/(d[0]*(d[0]/d[3])));
dblTemp
+= (d[2]-d[1]);
dblTemp
+= ((d[2]-(((d[0]+d[3])/d[1])/d[3]))-d[2]);
return
(dblTemp >= 0.5 ? 1 : 0); |
|
} |
(4.33) |
So, by combining the three models above and representing them by y1,
y2, and y3, the following classification rules are obtained:
IF
(y1 = 1 AND y2 = 0 AND y3 = 0) THEN
setosa;
IF (y1 = 0 AND y2
= 1 AND y3 = 0) THEN versicolor; |
|
IF (y1 = 0 AND y2 = 0 AND y3
= 1) THEN virginica; |
(4.34) |
which classify correctly 149 out of 150 irises and, therefore, this model, with a classification accuracy of 99.33% and a classification error of 0.667%, is one of the best models ever obtained by machine learning algorithms.
|