In this section:
In this section we are going to analyze two different approaches to the problem of constant creation in symbolic regression by comparing the performance of two different algorithms. The first uses the facility to manipulate random constants directly and the second does not include this facility. The comparison between the two approaches will be made on three different problems. The first is an artificial problem of sequence induction requiring integer constants; the second is a problem of function finding requiring floating-point constants; and the third is a real-world time series prediction problem also requiring floating-point constants.
For the sequence induction problem, the following test sequence was chosen:
an = 4n4
+ 3n3 + 2n2 + n |
(4.9) |
where n consists of the nonnegative integers. This sequence was chosen because it can be exactly solved by both algorithms and therefore can provide an accurate measure of their performance in terms of success rate.
For the function finding problem, the following “V” shaped function was chosen:
y = 4.251a2
+ ln(a2) + 7.243ea |
(4.10) |
where a is the independent variable and e is the irrational number 2.71828183. Problems of this kind cannot be exactly solved by evolutionary algorithms and, therefore, the performance of both approaches will be compared in terms of average best-of-run fitness and average best-of-run R-square.
For the time series prediction task, 100 observations of the Wolfer sunspots series were used
(Table 4.5) with an embedding dimension of 10 and a delay time of one (see
section 4.4 for more details). Once again, the performance of both approaches will be compared in terms of average best-of-run fitness and R-square.
Table 4.5
Wolfer sunspots series (read by rows).
101 |
82 |
66 |
35 |
31 |
7 |
20 |
92 |
154 |
125 |
85 |
68 |
38 |
23 |
10 |
24 |
83 |
132 |
131 |
118 |
90 |
67 |
60 |
47 |
41 |
21 |
16 |
6 |
4 |
7 |
14 |
34 |
45 |
43 |
48 |
42 |
28 |
10 |
8 |
2 |
0 |
1 |
5 |
12 |
14 |
35 |
46 |
41 |
30 |
24 |
16 |
7 |
4 |
2 |
8 |
17 |
36 |
50 |
62 |
67 |
71 |
48 |
28 |
8 |
13 |
57 |
122 |
138 |
103 |
86 |
63 |
37 |
24 |
11 |
15 |
40 |
62 |
98 |
124 |
96 |
66 |
64 |
54 |
39 |
21 |
7 |
4 |
23 |
55 |
94 |
96 |
77 |
59 |
44 |
47 |
30 |
16 |
7 |
37 |
74 |
|
|
|
|
For the sequence induction problem, the first 10 positive integers n and their corresponding term were used as fitness cases
(Table 4.6). The fitness function was based on the relative error and the fitness was evaluated by equation
(3.1b). A selection range of 25% and maximum precision (0% error) were chosen, giving
fmax = 250. This experiment, with its two different approaches, is summarized in
Table 4.7.
Table 4.6
Set of fitness cases for the sequence induction task.
n |
an |
1 |
10 |
2 |
98 |
3 |
426 |
4 |
1252 |
5 |
2930 |
6 |
5910 |
7 |
10738 |
8 |
18056 |
9 |
28602 |
10 |
43210 |
Table 4.7
General settings used in the sequence induction problem with (SI*) and without
(SI) random constants.
|
SI* |
SI |
Number
of runs |
100 |
100 |
Number
of generations |
100 |
100 |
Population
size |
100 |
100 |
Number
of fitness cases |
10 (Table
4.6) |
10 (Table
4.6) |
Function
set |
+
- * / |
+
- * / |
Terminal
set |
a ? |
a |
Random
constants array length |
10 |
-- |
Random
constants range |
{0,
1, 2, 3} |
-- |
Head
length |
6 |
6 |
Number
of genes |
5 |
5 |
Linking
function |
+ |
+ |
Chromosome
length |
100 |
65 |
Mutation
rate |
0.044 |
0.044 |
One-point
recombination rate |
0.3 |
0.3 |
Two-point
recombination rate |
0.3 |
0.3 |
Gene
recombination rate |
0.1 |
0.1 |
IS
transposition rate |
0.1 |
0.1 |
IS
elements length |
1,2,3 |
1,2,3 |
RIS
transposition rate |
0.1 |
0.1 |
RIS
elements length |
1,2,3 |
1,2,3 |
Gene
transposition rate |
0.1 |
0.1 |
Random
constants mutation rate |
0.01 |
-- |
Dc
specific transposition rate |
0.1 |
-- |
Dc
specific IS elements length |
1,2,3 |
-- |
Selection
range |
25% |
25% |
Precision |
0% |
0% |
Average
best-of-run fitness |
195.308 |
249.982 |
Average
best-of-run R-square |
0.798698299 |
0.9999999996 |
Success
rate |
24% |
98% |
For the “V” shaped function problem, a set of 20 random fitness cases chosen from the interval [-1, 1] was used
(Table 4.8). The fitness function was also evaluated by equation
(3.1b), but in this case a selection range of 100% was used, giving
fmax = 2000. This experiment, with its two different approaches, is summarized in
Table 4.9.
Table 4.8
Set of fitness cases used in the “V” function problem.
a
|
f(a)
|
-0.2639725157548 |
3.19498066265276 |
0.0578905532656938 |
1.99052001725998 |
0.334025290109634 |
8.39663703997286 |
-0.236334577564462 |
3.07088976972825 |
-0.855744382566804 |
5.87946763695703 |
-0.0194437136332785 |
-0.775326322328458 |
-0.192134388183304 |
2.83470225774408 |
0.529307910124627 |
12.2154726642137 |
-0.00788974118728459 |
-2.49803983418635 |
0.438969804950631 |
10.4071734858808 |
-0.107559292698039 |
2.09413635645908 |
-0.274556994377163 |
3.23927278010839 |
-0.0595333219604528 |
1.19701284767347 |
0.384492993958352 |
9.35580769189855 |
-0.874923020736333 |
6.00642453001302 |
-0.236546636250546 |
3.07189729043837 |
-0.167875941704557 |
2.67440053130986 |
0.950682181822091 |
22.4819639844149 |
0.946979159577362 |
22.3750161187355 |
0.639339910059591 |
14.5701285332337 |
Table 4.9
General settings used in the “V” function problem with (V*) and without
(V) random constants.
|
V* |
V |
Number
of runs |
100 |
100 |
Number
of generations |
5000 |
5000 |
Population
size |
100 |
100 |
Number
of fitness cases |
20 (Table
4.8) |
20 (Table
4.8) |
Function
set |
+
- * / L E K ~ S C |
+
- * / L E K ~ S C |
Terminal
set |
a, ? |
a |
Random
constants array length |
10 |
-- |
Random
constants range |
[-1,1] |
-- |
Head
length |
6 |
6 |
Number
of genes |
5 |
5 |
Linking
function |
+ |
+ |
Chromosome
length |
100 |
65 |
Mutation
rate |
0.044 |
0.044 |
One-point
recombination rate |
0.3 |
0.3 |
Two-point
recombination rate |
0.3 |
0.3 |
Gene
recombination rate |
0.1 |
0.1 |
IS
transposition rate |
0.1 |
0.1 |
IS
elements length |
1,2,3 |
1,2,3 |
RIS
transposition rate |
0.1 |
0.1 |
RIS
elements length |
1,2,3 |
1,2,3 |
Gene
transposition rate |
0.1 |
0.1 |
Random
constants mutation rate |
0.01 |
-- |
Dc
specific transposition rate |
0.1 |
-- |
Dc
specific IS elements length |
1,2,3 |
-- |
Selection
range |
100% |
100% |
Precision |
0% |
0% |
Average
best-of-run fitness |
1896.25 |
1953.057 |
Average
best-of-run R-square |
0.95129456 |
0.99647004 |
For the time series prediction problem, using an embedding dimension of 10 and a delay time of one, the sunspots series presented in
Table 4.5 result in 90 fitness cases (see section 4.4 for more details). In this case, a wider selection range of 1000% was chosen, giving
fmax = 90,000. This experiment, with its two different approaches, is summarized in
Table 4.10.
Table 4.10
General settings used in the sunspots prediction task with (SS*) and without
(SS) random constants.
|
SS* |
SS |
Number
of runs |
100 |
100 |
Number
of generations |
5000 |
5000 |
Population
size |
100 |
100 |
Number
of fitness cases |
90 (Table
4.5) |
90 (Table
4.5) |
Function
set |
4
(+ - * /) |
4
(+ - * /) |
Terminal
set |
a -
j, ? |
a - j |
Random
constants array length |
10 |
-- |
Random
constants range |
[-1,1] |
-- |
Head
length |
7 |
7 |
Number
of genes |
3 |
3 |
Linking
function |
+ |
+ |
Chromosome
length |
69 |
45 |
Mutation
rate |
0.044 |
0.044 |
One-point
recombination rate |
0.3 |
0.3 |
Two-point
recombination rate |
0.3 |
0.3 |
Gene
recombination rate |
0.1 |
0.1 |
IS
transposition rate |
0.1 |
0.1 |
IS
elements length |
1,2,3 |
1,2,3 |
RIS
transposition rate |
0.1 |
0.1 |
RIS
elements length |
1,2,3 |
1,2,3 |
Gene
transposition rate |
0.1 |
0.1 |
Random
constants mutation rate |
0.01 |
-- |
Dc
specific transposition rate |
0.1 |
-- |
Dc
specific IS elements length |
1,2,3 |
-- |
Selection
range |
1000% |
1000% |
Precision |
0% |
0% |
Average
best-of-run fitness |
86182.05 |
89009.66 |
Average
best-of-run R-square |
0.706437 |
0.801144 |
|