©1995 Brian M. Winn
2.1 Multi-Layer Perceptron and Backpropagation
2.2 Problems with Backpropagation
4.0 Evolving Neural Network Topologies
4.1 Chromosome Representation
4.2 Genetic Operations and Fitness Function
5.1 Neural Network Simulator
5.2 Genetic Algorithm Module5.2.1 Version 1
5.2.2 Version 2
5.2.3 Version 3
5.2.4 Version 4
8.1 Backpropagation Learning Algorithm
8.2 Glossary
8.3 References
There are many complex computational systems that contain adjustable parameters. For many of these systems, there are no known heuristics to deterministically set these parameters a priori. These parameters must be set or "tweaked" to optimal values to maximize performance of the system or achieve the desired results. The setting of these parameters can be viewed as an optimization problem.
Several techniques have been devised to approximate optimal solutions for these optimization problems, such as hill climbing, the Monte Carlo method, and genetic algorithms [1]. A genetic algorithm is an approach that incorporates the theory of natural selection and survival of the fittest in order to breed optimal solutions. Genetic algorithms include "genetic-like" operations such as mating, crossover, mutations, and inversion.
In this project, a genetic algorithm approach to optimization was utilized to determine effective neural network topologies.
Handwritten-character reading, continuous speech processing, and complex pattern recognition are problems which have traditionally been difficult to solve with computers. It is virtually impossible to devise a sequential algorithm, a step-by-step list of instructions, to solve these problems. However, all of these problems, with the possible exception of reading handwriting, are easily solvable by humans. The question then becomes, what makes these problems different than traditional computational problems that computers solve easily &emdash; and humans have a hard time solving? The answer comes in the fact that all of these tasks involve "...associating objects in one set with objects in another set. These associations are known as mappings or transformations" [9]. For example, if a person sees a small, flying rodent in the night he or she instantly processes the image and associates it with an animal known as a "bat." The individual probably learned this association while watching old Dracula movies as a child. The human mind is brilliant at forming these mappings, while algorithms are not. A new form of artificial intelligence, known as neural networks, holds the potential of solving transformation problems.
A fundamental difference between the philosophies of traditional artificial intelligence (AI) and neural networks involves the emphasis on logic versus perceptive abilities. Rules and their logic combination usually drive AI expert systems. Neural networks, however, use the ability to recognize patterns through experience, rather than the deduction and recollection of rules [4].
The human brain is made up of approximately one hundred billion neurons (figure 1). Many cognitive scientists believe the massively interconnected neurons hold the collective consciousness of the human mind [7]. Although this has been disputed [6], it is known that the brain's neural structure is a key component of the visual and auditory systems in the brain [13].
Figure 1 - Neuron anatomy. Dendritic tree extends from the cell nucleus or soma. Longer branches are axons. Synapses bridge axons from other neurons to the cell membrane or its dendrites [9].
Considerable biological research has been done to unlock the mysteries of the neuron. Synapses are the bridges that connect a neuron to several other neurons. Through the synapses (approximately 1,000 per neuron), electrical signals in a "firing" neuron are transmitted as chemical signals to other neurons. These signals are classified as either excitatory or inhibitory. Excitatory signals increase the likelihood that a neuron will fire, while inhibitory signals decrease the likelihood that a neuron will fire. If the sum of the incoming signals in a neuron reaches a predefined threshold (i.e. there is enough excitation), the neuron will fire, sending signals out to other neurons. Collectively these neurons "process data" in a massively parallel architecture [7].
Artificial neural networks borrow many general concepts from the basic neural structure of the brain. In a neural network, the processing elements are the neurons and the interconnections are the synapses. A processing element generally receives input through interconnections, processes the input, and decides what value it should output, using an activation function. This output is then transferred as the input to other processing elements functioning in parallel execution. Although neural networks utilize these primitive neurobiological concepts, they cannot be considered a replication or even a simulation of the brain.
The processing elements in a network have a method of adapting or learning as it processes the inputs. This is usually done by making adjustments to the processing elements and their interconnections as the network is being trained.
One of the first neural networks, the simple perceptron (figure 2), was introduced in 1962 by Rosenblatt. The perceptron, which is based on the McCulloch and Pitts' model, generated a great deal of excitement in the artificial intelligence community because of its ability to learn the mapping between input and output data. Further, the Perceptron Learning Rule (figure 3), proposed by Rosenblatt, was proven to converge to a solution, if a solution existed for a given problem. In 1969, however, Minsky and Papert published, Perceptron, which proved that the simple perceptron was unable to solve some basic problems, such as the XOR function. This is due to the fact that the simple perceptron is unable to solve nonlinearly separable problems.
Figure 2 - Binary Threshold Simple Perceptron with three inputs.
A linearly separable problem is one in which a single hyperplane can be found in the pattern space separating the output +1 patterns from the output -1 patterns. If there are several output units, one must be able to find one such plane for each pair of outputs. Unfortunately, most problems are non-linearly separable. This led to the near death of neural network research until the late 1970's.
Figure 3 - Simple Perceptron Learning Rule [10].
The multi-layer perceptron is a descendent of the simple perceptron. Unlike the simple perceptron, however, the multi-layer perceptron is capable of classifying nonlinearly separable data sets.
Figure 4 - Typical two-layer backpropagation architecture.
A typical multi-layer perceptron is illustrated in figure 4. This network is composed of several processing elements or nodes (figure 5) arranged in hierarchical layers. The input layer is where the input pattern is originally presented. This input is propagated up to the next layer, often a hidden layer, along its interconnections. Since there is no actual processing done in the input layer, it is generally not considered as a "layer". Each processing element in this layer receives its given input and multiplies it by a predetermined weight (often initialized randomly at the onset of the training), which creates a weighted value. Much like the human neuron, positive weighted values will be excitatory and negative weighted values will be inhibitory. The processing element then sums all the weighted values. The resulting value plus a bias unit is presented to an activation function to determine the output from the processing element. This value is then propagated to the next layer. There may be several hidden layers, but typically there is only one or two. The output from the last hidden layer propagates to the processing elements in the output layer. These processing elements perform the same function as those in the hidden layer. The resulting output is the final output pattern from the network. Hence, the neural network has created a mapping from the original input pattern to an output pattern.
Figure 5 - Details of a Multi-layer Perceptron Processing Element, a descendent of the Simple Perceptron.
The processing elements in the multi-layer perceptrons commonly utilize either a linear or nonlinear (i.e. sigmoidal function) activation function:
« f(x) = x
« f(x) = 1/(1+e-x)
The sigmoidal function is often used when binary (1 or 0) output is desired. The sigmoidal function is well suited for this purpose "...since the sigmoid is output-limiting and quasi-bistable but is also differentiable" [8]. Nonlinear activation functions should be used when the transformation problem itself is nonlinear.
The multi-layer perceptron learns by using an enhancement of the generalized delta rule, that is call the backpropagation algorithm. This method iteratively adjusts the weights of the processing elements in the network in order to achieve a mapping from input to output data. It does this by comparing the network's output to the target output for a given pattern. The error is computed and then backpropagated through the network, from the output layer to the first hidden layer. The learning rule adjusts the weights by determining how much each weight contributed to the final error. The process of forward propagation of data and backpropagation of error is repeated for every input-output pattern in the data set. This is called a cycle of training.
Unfortunately the backpropagation algorithm will not always converge to the global error minimum as does the Simple Perceptron Learning Rule [10]. As learning progresses in backpropagation, the network will often get "stuck" in a local minimum. In order to get out of this gully, a momentum value is often added to the backpropagation algorithm. The momentum keeps the network adjusting the weights in the direction they were moving before reaching the local minimum. If the network is in fact passing through the global minimum, the momentum value will decrease and eventually reverse. The network will then hopefully settle back into the global minimum. The enhanced delta rule, along with the entire backpropagation algorithm, is presented in the Appendix.
Figure 6 - A cross-section of an error surface. The arrows along the curve illustrate the momentum.
A multi-layer perceptron using backpropagation cannot be looked upon as a black box. There are many parameters in the that must be set "optimally" in order for the network to learn the input to output mapping successfully. These parameters include the following:
« the initial weights of the connections
« the learning rate and momentum constants
« the number of hidden layers
« the number of processing elements (nodes) in each hidden layer
« the interconnections between layers
« the activation function(s) for the processing elements
« the amount of learning or training cycles to perform on the network to achieve convergence
Most of these are data set specific &emdash; meaning the nature of the data set itself will determine how the parameters are defined. Often, however, the data set is not understood enough to define these parameters. Further, the knowledge of how each parameter effects the outcome of the entire system is not known well enough to develop standard heuristics to specify the parameters. In this case, the neural network designer often has to "grope in the dark", performing test after test to try an find an optimal configuration of parameters.
Genetic algorithms (GA) were pioneered by John Holland during the mid-seventies [11]. The algorithm functions on the basic concepts of Darwinian evolution. There is a population of chromosomes (strings of data) each containing a number of genes (elements within the string). The genes are used to code the possible solutions to the problem. The algorithm assigns a fitness to each chromosome using some criteria defined by the domain of the problem. The chromosomes that are more fit are more likely to survive and create offspring. This basic process is known as survival of the fittest [5]. Offspring chromosomes are created by performing some operation on the surviving parent chromosomes. The operations include selection and crossover (recombination), followed by mutation and inversion [3].
The algorithm will search for an optimal solution in an iterative process. There will be an initial random population of chromosomes. A fitness value is then assigned to each chromosome. The GA then sorts them in decreasing order of fitness. The GA then selects the parent chromosomes with a probability proportional to their fitness. Therefore, the more fit chromosomes are more likely to be paired together. The operators are then applied to the parent chromosomes in order to create the offspring. The offspring then becomes the population for the next generation. This process is repeated until some condition is met, such as a predetermined number of generations is executed or a good enough solution is obtained.
There are several considerations that need to be addressed in the use of genetic algorithms to evolve neural network topologies. These considerations include the chromosome representation of the topology, the genetic operators, and the fitness function used to gauge the performance of the network.
Our initial problem was to find a suitable representation for encoding a multi-layer perceptron topology in a chromosome. Two representations were investigated, an approach called L-Systems [2] and a direct encoding approach [1].
An L-System is a parallel string rewriting mechanism designed by Aristid Lindenmayer in 1968. This system was designed in order to model the biological growth of plants. This system has also been applied to model the growth of neural networks [2]. The approach involved encoding production rules in the chromosomes. These production rules are evolved with the genetic algorithm and then run through an L-System to create a neural network topology. The system will create multi-layer perceptrons with varying numbers of connections (not necessarily fully-connected). Basically the production rules represent a recipe (genotype) for a given network topology instead of the topology itself (phenotype). The topology is then tested to determine its fitness. This fitness is then mapped back to the original production rule(s). The idea is that each production rule represents a chunk or module of a neural network. The rules with breed to combine modules. Hence, the network is constructed in a modular fashion.
The direct encoding approach consists of encoding the actual topology specifically in the chromosome [1]. In this way, the parameters of the neural network topology are actual fields of genes in the chromosome. For example, the number of hidden nodes could be encoded as a bit string.
Due to our limited temporal resources, we chose the direct encoding method because of its simplicity and ease of implementation. The L-Systems approach is complex and was originally a two-student masters thesis. However, we feel that the L-Systems approach may be a better way to represent the network topology. There is no guarantee that when a genetic operation is performed on the direct representation chromosome that any features of the original network topology are preserved because the actual parameters which created the network topology are being altered. In the L-System, however, only production rules are being altered not the parameters themselves.
The general procedure of any genetic algorithm begins by generating an initial pool of chromosomes, usually setting each gene to a random value in some range. In this project, genes assume values on the binary set.
A cycle of operations is then performed a predetermined number of times. Each cycle is referred to as a generation. The first step in each generation is to compute the fitness of each chromosome. In the present application, this operation is performed by constructing the neural network prescribed by each chromosome, and training the network on the target data set. A test data set may also be applied to the network.
Figure 7 - Fitness Function
The fitness of each chromosome and its resulting neural network is measured in four parameters: training error, test error (if applicable), complexity of the network (number of units), and number of training cycles required for convergence. The actual fitness function consists of a polynomial with four independent terms, one for each of the above parameters (figure 7). These terms are weighted using coefficients that can be tweaked as the user sees fit. The optimal coefficients are likely to be domain specific. For example, for a given target data set, the system may create networks which are very good at reducing training error but do not reduce the complexity of the network to an acceptable level. It is then advisable to increase the coefficient of the complexity term relative to the training error term. Since we are interested in minimizing each of the four critical parameters, the reciprocal of the polynomial is taken to be the fitness. The task of the system is then to maximize the fitness function. This is actually an arbitrary choice but it is conceptually helpful to view the task in terms of maximizing fitness.
The next step of each generation is to sort the chromosome pool in order of increasing fitness. The sort operation is performed, in place, using a conventional quicksort procedure. By sorting the chromosome pool, chromosomes that are likely to recombine are brought into close proximity. The actual pairs of chromosomes that will undergo the recombination process are determined by the selection task.
Selection refers to designating two chromosomes which are both fit enough according to some criterion and which both have similar fitness (figure 8). To undergo the recombination process with some element of chance, selection is stochastic. Chromosomes that are not fit enough are eliminated from the pool, i.e. they become extinct. In the present application selection proceeds in two steps. In both steps, only chromosomes which have not yet been selected in the current generation and are sufficiently fit (described in more detail later in this section) are considered. In step (a), the fittest chromosome in the pool is deterministically selected. In step (b), another chromosome is selected stochastically by proceeding through the pool beginning with the fittest chromosome and selecting with some probability called the selection rate. This rate was established at 80% early in the project and was not modified. The selection task results in a set of chromosome pairs that will undergo recombination.
Recombination is simulated by the crossover operation. In crossover, the chromosomes of a selection pair exchange some contiguous range of genes in a stochastic manner. This exchange maintains the previous position rank of each gene, i.e. if the 5th gene numbered from left to right on a chromosome is designated for crossover, then it is guaranteed to be at position 5 on the other chromosome.
Figure 8 - Genetic Operations
The final operations to be performed during any generation introduce noise into the system. Introducing noise is important in any optimization task to avoid becoming trapped in local maxima or minima. The most common method of introducing noise in genetic algorithms is via mutation and inversion. In mutation, there is a specified probability that any given gene will flip over to its opposite binary value. In inversion, there is a specified probability that each gene in an entire chromosome will flip over to its opposite binary value. In both cases of mutation and inversion, the probabilities were established at 0.1% early in the project and were not modified.
The chromosomes that result from the above cycle constitute a new gene pool and the process is repeated unless the specified number of generations has been reached.
The objective of this project is to build a basic system that evolves backpropagation topologies for a specific task using genetic algorithms.
The system is composed of a genetic algorithm engine and neural network simulator. The system also contains a intermediate interface to connect the GA to the neural network simulator (figure 9). This interface will decode a chromosome into a backpropagation topology recognizable by the simulator. The simulator will then train the neural network for a predetermined number of iterations or a stopping condition is reached. The simulator will then determine the fitness of the network based on its training error, testing error, complexity, and convergence time. This value is passed back to the interface and assigned to the chromosome. The process is repeated for all chromosomes in the populations. The genetic algorithm is then called to generate a new population of chromosomes. The entire loop will continue for a predetermined number of generations.
Figure 9 - System Diagram
The module will build, train, and test fully connected multi-layer perceptrons using the backpropagation learning algorithm described in the Appendix.
The module input includes:
« the number of inputs and outputs
For efficiency, the data set files for training and optional testing patterns are loaded only once at the onset of the GAPROP code. The output patterns are scaled between 0 and 1 in the GAPROP code because the processing elements use a sigmoidal activation function. The backpropagation module uses this data each time it is called.
« the number of hidden layers (zero, one, or two)
The module was implemented so it could handle any size network with the only limitations due to memory constraints. However, because a three layer network (two hidden layers) can find any arbitrary decision boundary [10], the GAPROP code species either zero, one, or two hidden layers to the backprop module.
« the number of nodes per layer
The module will handle any specified number of nodes in each layer.
« the learning rate
« the momentum term
« an error limit
The error limit is essentially a stopping condition. If the network reaches the error limit before the maximum number of cycles is run, the training will terminate. The maximum number of cycles, which defaults to 10,000 cycles, can be defined by the user by using command-line arguments.
The module will then build the network from the topology definition. The weights are initialized to values between -1 and 1. Training of the network then occurs. During training, the module presents the input patterns incrementally in a pseudo-random fashion to make the path through the weight-space stochastic. "This allows a wider exploration of the cost surface" [10].
The code was written to maximize performance. The two-dimensional arrays are incremented by the second dimension in the inner parts of loops in all functions except for the backpropagation of error function (which could not be avoided). This reduces the number of memory jumps that the code has to perform (C stores two dimensional arrays in a row-column format).
After the stopping condition is reached, the module will propagate the training data through the network one last time to compute the final training error or resubstitution error. This gages how well the network learned the given data set. The module will then test the network for generalization if a test data set is available.
The module will output a variety of information as it learns and tests networks. The amount of information that the modules generates can be defined by the user with command-line arguments.
The network output includes:
« the number of cycles the network trained
« the training error
The training error is the least squared-error between the networks output after training and the desired output when run on the training data set.
« the test error
The testing error is the least squared-error between the networks output after training and the desired output when run on the test data set. If no testing data is available, the test error is equal to zero.
« the complexity
The complexity of the network is defined as the number of nodes in the network, including input and output nodes.
« the fitness value
The fitness function is defined in the previous section.
Upon exiting the module, all memory that has been allocated for the current network topology is freed. This was deemed essential since the many thousand network topologies would be constructed during one execution of GAPROP.
The development and effective utilization of a genetic algorithm approach to solving optimization problems is, at best, an inexact science. The exact details of any practical genetic algorithm are highly domain specific, and difficult to determine in a deterministic fashion prior to actual implementation.
It was therefore deemed that the best approach to developing an effective genetic algorithm for the task at hand was to implement a trial version of the system, and modify the design as needed to improve the performance of the system. As a result, three versions of the system were developed, each providing improved performance over its predecessor. Further, modifications of the final version are proposed, forming the basis of a hypothetical version 4.
The initial version of the genetic algorithm module used in this project used a chromosome pool of size 50. Each chromosome comprised a string of 28 genes, where in particular, 8 genes were used to specify the learning rate eta, 8 genes to specify the momentum alpha, 4 genes to specify the number of units in hidden layer 1, 4 genes to specify the number of units in hidden layer 2, and 4 genes to specify the minimum training error. This training error constitutes the stopping condition for training the network. Once this error threshold is achieved, the network is considered to have converged.
Eta and alpha are identically computed from the chromosome in a quasi-binary-coded-decimal fashion. The 8 bits (genes) of each parameter are interpreted as a pair of 4 bit fields. In each case, the first 4 bits represent tenths decimal place, while the second set of four bits the hundredths decimal place. Since each field can range in decimal values from 0 to 15, the first field of four bits contributes a value between 0.0 and 1.5 to eta or alpha, while the second field contributes a value between 0.00 and 0.15. These two contributions are summed to obtain values between 0.00 and 1.65 (precision down to hundredths place).
The number of units in both hidden layers are computed via direct binary to decimal conversion. Since both layers have four bits (genes) devoted to them, the number of possible units in each layer are constrained to be between 0 and 15. A value of 0 implies the absence of that hidden layer. The two cases of one or the other hidden layer having 0 units are interpreted to be a network with one hidden layer where the number of units in that layer are the number units in the hidden layer with non-zero number of units. The case of both hidden layers having zero units is interpreted as a one layer network (output layer only). Of course the number of units in the input and output layers are determined directly from the given data set.
The training error threshold constituting convergence is computed by direct binary to decimal conversion of its four representative bits (genes). This value is divided by ten so that the training error threshold can assume a value between 0.0 and 1.5 (precision down to tenths place).
Crossover in version 1 consisted of stochastically selecting one position on the chromosomes (a different position for each pair), and exchanging all genes to the right of and including that position. All chromosomes in the pool were eligible for crossover and were recombined exactly once for each generation. Every chromosome in the pool was consequently replaced by one of the two resulting chromosomes from its recombination.
Five major problems were identified in the performance of version 1.
1. Eta could attain a value greater than unity which is considered by conventional wisdom to be too large for effective learning in multi-layer perceptrons.
To solve this problem, the more significant field representing eta was decreased to 3 bits. This reestablished the range of possible learning rates to 0.00 to 0.85. The same was done to the momentum term alpha (see discussion in section 5.2.4). This decreased the overall size of the chromosomes to 26 genes (figure 10).
Figure 10 - Chromosome Representation
2. Good solutions were found and then discarded by the system during a subsequent generation. This was due to the nature of the crossover process. Every chromosomes was replaced during each generation by a new chromosome.
To solve the problem, the pocket algorithm was introduced to keep track of the 5 fittest chromosomes seen through all generations.
3. The most unfit chromosomes were carried from one generation to the next and allowed to recombine. It is unlikely, since these chromosomes constituted the very bottom of the pool and were characterized by very low fitness values, that these samples contributed anything useful to future generations. Moreover, for this very reason, most useful genetic algorithms allow some amount of extinction to occur during each generation.
For this reason, it was decided to render extinct the lower 50% of the chromosome pool during each generation. These chromosomes were replenished as described in item 4 below.
4. There was concern about two aspects of the crossover operation. First, there was some concern about the possibility of good parameters being destroyed during the crossover operation, by performing crossover at a position other than one on the boundary between two parameters.
Second, there was concern about all genes being exchanged from the crossover position to the end of the chromosome. While a chromosome may benefit from modifying or exchanging a parameter in the middle of the chromosome, it may lose significant performance by mandatorily exchanging all parameters to the right. Expressed slightly differently, it was formerly impossible to improve one parameter that needs improvement, say for example alpha, and maintain parameters to the right of alpha on the chromosome, such as the number of units in hidden layer 2, though these parameters may already be optimized.
Two modifications of the crossover operation were used to overcome these difficulties. First, an additional type of crossover was introduced that exchanged exactly one parameter chosen at random. That is, two crossover positions were selected, together delimiting one parameter field. All genes between these positions are exchanged while all other genes are maintained.
The second modification was to perform conventional crossover, only now by selecting two crossover positions at random, and exchanging only the genes between them. This type of crossover, however, does not necessarily select crossover positions that delimit a parameter field.
During each generation, both of these crossover types are performed for each selected pair. Thus, beginning with two parent chromosomes, we perform both types of crossover resulting in 4 daughter chromosomes. This exactly replenishes the lower 50% of the pool that was rendered extinct at the onset of the current generation.
5. The immediate consequence of the new type of crossover was that we have no way of knowing the relative benefit of each crossover type. That is, which crossover type was most likely, if any, to improve the overall population? For that matter, how useful were mutation and inversion?
As a result of these intriguing questions, tags were added to each chromosome tracking the average number of each operation seen throughout the evolution of the chromosome.
Three major problems were identified in the performance of version 2.
1. After testing the system on the boolean AND data set, it became clear that the system was not producing networks with no hidden layers, which should be considered mandatory in any system that purports to find optimal network architectures. It was realized that this was due to a fundamental property of the crossover operation. Since crossover is strictly orthogonal, i.e. genes can only be exchanged for one another if they are at the same position, it cannot independently create parameter fields consisting of all zero bits, unless one zero sample of each bit in the parameter field already exists in the population. Even then, these samples must find their way into the same chromosome to produce the desired results.
Mutation could theoretically result in a parameter field consisting of all zero bits, but the probability of this occurring would be very small. Moreover, even if it did occur, one additional mutation within that field could disrupt the zero field. At any rate, even if mutation did result in a field consisting of all zero bits, this would be due to dumb luck rather than the employment of an intelligent problem solving heuristic. We demand more than dumb luck from our system.
Note that inversion could also produce a field of zero bits only if there is at least one sample of a one bit at each position in the parameter field somewhere in the population which of course is just as unlikely as having one sample of a zero bit at each position. Moreover, once again such a zero field would be the product of dumb luck.
To remedy this problem, we introduced a third type of noise into the system. As in the case of mutation and inversion, this type of noise occurs with probability 0.1% per chromosome generation. This noise we affectionately call "zapping" for lack of a better term. If a "zap" occurs on a chromosome, one parameter field, chosen at random, is set to all zero bits or all one bits with 50% probability either way. In general, this type of noise assures that the extreme limits of the possible range of values for any parameter are adequately represented during evolution.
2. It was judged that the level of extinction exacted in version 2 seemed more like the results of a plague than natural extinction. Deterministically, eliminating the bottom 50% of the pool was replaced by the milder stochastic extinction of 20% of the population. This extinction technique was implemented by traversing the chromosome pool starting at the bottom (lowest fitness), and rendering each chromosome extinct with 50% probability. This continues until 20% of the pool is rendered extinct. Extinction implies two things: (a) the chromosome will definitely be replaced by a new chromosome before the end of the generation, and (b) it may not be selected for recombination.
3. The average fitness of the chromosome population improved over the generations, but not in a monotonically increasing fashion. The average fitness might improve one generation and then decrease the next.
In attempt to improve the stability of the system, and achieve monotonically increasing average fitness of the overall population, it was decided to increase the pocket size to 20% of the size of the population and maintain this pocket in the population through all generations. Thus, the very best breeding stock over the course of the entire evolution remains available at all times to pass their features on through genetic operations.
The results of this modification appear to be exactly what we expect - monotonically increasing average population fitness.
Note that the extinction process described in item 2 above does not impinge upon the pocket. If the pool traversal process reaches the pocket before marking its quota of 20%, it wraps around to the bottom of the pool again and continues until it has its quota.
In addition to the above modifications, the chromosome population size was increased to 100 for version 3. All of these modifications called for a drastic change in selection and crossover strategy.
The new and current strategy is as follows:
a. Mark 20 of the 100 chromosomes for extinction using the process described in item 2 above. Once again, despite the fact that extinction is stochastic with probability inversely proportion to fitness, no chromosome in the pocket can ever be rendered extinct; they may only be replaced if an unpocketed chromosome achieves a fitness great enough to displace a pocketed chromosome from the pocket.
b. Select chromosome pairs for recombination in the same manner as described in version 1 (although chromosomes marked for extinction may never be selected). If the search for the stochastic counterpart of a recombination pair reaches the bottom of the pool, the search wraps around to the top and continues.
c. For each pair, either crossover of type 1 or 2 is performed with equal probability either way. This results in two daughter chromosomes which replace two chromosomes which neither are in the pocket nor have yet recombined if eligible to do so.
Thus, 80 out of 100 of the fittest chromosomes are stochastically selected with a deterministic constraint of selecting all chromosomes in the pocket. The recombination of these selected chromosomes result in 80 daughter chromosomes which deterministically replace the lowest 80 chromosomes (figure 11).
Figure 11 - Chromosome Pool
The following ideas were compiled for version four (not implemented):
1. Increasing the number of genes dedicated to the momentum term alpha back to eight, thus allowing it to assume values from 0.00 to 1.65 rather than merely the range 0.00 to 0.85 [10]. In retrospect, though it made sense to constrain eta to 0.85, we do not know why alpha was included in this modification decision since momentum up to 1.65 is more than acceptable in multi-layer perceptrons, and in fact sometimes useful.
2. It may have been fun to allow the amount of noise in the system to vary depending on the current state of the solution search in the problem space. That is, if the overall improvement rate of the average population fitness fell below a certain threshold or the pocket did not change for some number of generations, then the probability of injecting noise into the system using mutation and inversion would increase for a period of time. This technique may serve to pop the system out of local maxima and minima.
3. Another idea was to maintain two separate populations which could follow their independent fates and then combine the two at the end of the process. Each population would hopefully bring beneficial features from different points in the problem space [12].
4. Obviously, both genetic algorithms and multi-layer perceptrons lend themselves to be implemented in parallel. It might have been interesting to attempt a parallel implementation. In fact, a parallel implementation is a necessity when the code is run on large data sets for large numbers of generations.
The summarized results obtained from several experiments run of Version 3 of the code follow. The actual batch outputs are included in the Appendix.
Boolean operators were used to test the validity of our methods. If the genetic algorithm could not evolve optimal AND and XOR network topologies, there was no hope to solving more complex problems. In particular, AND is well known to be linearly separable and hence can be classified with a single layer perceptron. XOR, on the other hand, is nonlinearly separable and therefore requires at least one hidden unit in order for proper classification.
| |
|
|
| |
Topology |
No Hidden Layers |
eta |
0.85 |
alpha |
0.85 |
Error |
0.0 |
Training error |
0.204527 |
Training Cycles |
1001 |
Complexity |
3 |
|
|
| |
Fitness |
0.001651 |
Generation |
89 |
Inversions |
2 |
Mutations |
9 |
Zaps |
6 |
Cross1 |
11 |
Cross2 |
14 |
| |
|
|
| |
Topology |
1 Hidden Layer with 2 nodes |
eta |
0.84 |
alpha |
0.85 |
Error |
0.0 |
Training error |
0.185278 |
Training Cycles |
1001 |
Complexity |
5 |
|
|
| |
Fitness |
0.001272 |
Generation |
44 |
Inversions |
0 |
Mutations |
13 |
Zaps |
6 |
Cross1 |
14 |
Cross2 |
13 |
The 80x data set and the IRIS data set are nonlinearly separable. Due to the limitation of the sequential implementation of our code (slow execution time), a 20% subset of 80x and IRIS were used in testing. The subset of 80x was, however, linearly separable.
| |
|
|
| |
Topology |
No Hidden Layers |
eta |
0.03 |
alpha |
0.81 |
Error |
0.0 |
Training error |
0.202332 |
Training Cycles |
1001 |
Complexity |
9 |
|
|
| |
Fitness |
0.000831 |
Generation |
45 |
Inversions |
1 |
Mutations |
4 |
Zaps |
2 |
Cross1 |
17 |
Cross2 |
18 |
| |
|
|
| |
Topology |
1 Hidden Layer with 2 nodes |
eta |
0.77 |
alpha |
0.71 |
Error |
0.100 |
Training error |
0.315290 |
Training Cycles |
180 |
Complexity |
7 |
|
|
| |
Fitness |
0.000967 |
Generation |
51 |
Inversions |
2 |
Mutations |
11 |
Zaps |
5 |
Cross1 |
10 |
Cross2 |
10 |
The 80x data set and the IRIS data set are nonlinearly separable. 80% of the entire data set was used for training and 20% was used for testing the generalization ability of the network topologies. Unfortunately, only the first 24 to 50 generations are shown below. It was determined by a rough estimate in order to run the data set for the desired 1000 generations would take over a 10 days of processing time on a Sparc 20.
1. The importance of several attributes of genetic algorithms in achieving good performance on an optimization task have been emphasized by this study. The implementation of these concepts yielded much improved performance between versions. Among them:
a) Stochastic extinction seems to be an important step during every generation. The number of chromosomes to render extinct is probably a particularly important consideration.
b) Performing crossover between two points on the chromosome, selected at random, seems to yield better performance than exchanging all genes from one point to the end of the network. The two crossover point approach affords more robust exchange of information between chromosomes since all possible substrings of the chromosome can be exchanged rather than some suffix of arbitrary length.
c) Using the pocket algorithm and reinjecting the pocket into every generation seems to add stability to the system, and keeps the evolution process moving towards the goal of optimization.
d) Adding domain specific elements to the general genetic algorithm can be both necessary and useful. This was demonstrated by the use of zapper to facilitate wider exploration of possible network topologies, in particular those with no hidden units.
2. The necessity of parallel implementation of a system such as this is now very clear. Although the system ran very well, and reasonably fast on boolean data sets, data sets of significant size or dimensions (such as 80x and Iris) can result in impractical running time. This is mainly a result of the explosive temporal complex of the backpropagation training in order to obtain the fitness value.
3. It appears that either the initial weight set or local minima are far more significant in their roles in determining the fate of a multi-layer perceptron than was previously believed. This was evident from some of the resulting neural networks, some of which classified the data sets very well and others performed very poorly, yet these networks often had nearly identical eta, alpha, network topologies, and training cycles. The only parameters which could account for the differences in performance are initial weights and values returned by the random number generator in association with stochastic operations. It may be useful to encode a prescription of the initial weight set on the chromosome. See section 8.5 for actual batch results. Another possible solution would be to a means for evolving less than fully-connected networks. This method was used by Boers and Kuiper in their master's thesis. The idea is that creating different non-fully connected network topologies can mould the weight-space in such a way that all local minima disappear, and no matter where you start the training, there will always be a clear path from that starting point to a global minimum in weight-space.
4. Although some of the "fittest" networks were actually very poor in performance when build and tested again with different initial weights, there were always good networks somewhere in the pocket. This suggests another benefit of using genetic algorithms. A genetic algorithm can be seen as constituting a batch trial-n-error approach in which one can ultimately select any one of the networks in the final population if he/she finds the "fittest" one unacceptable for some reason.
5. There seemed to be some merit in using the second crossover type where intact parameters are exchanged between chromosomes rather than arbitrary randomly determined substrings. This can be determined by examining the history of some of the fittest chromosomes as indicated by their tags which tracked the average number of operations seen over the generations. This suggests that encoding parameters in binary may not be necessary and that genes with decimal values may work very well. This fact is supported by Sergio Margarita, "The popular belief that binary representations is necessary in genetic algorithms is a myth" [14].
6. The system was very good at determining the optimal network topologies as evidenced by obtaining solutions with no hidden units for boolean AND and a subset of 80x and one hidden layer with minimum units for boolean XOR and a subset of the IRIS data set.
7. In all cases, the rate of improvement of fitness was very high, followed by a rapid decrease in the rate of change. This implies that the genetic algorithm's overall population fitness was approaching some sort of maximum, be it local or global, and continues to approach that maximum asymptotically. The fitness was never decreasing, however, throughout the generations. This result supports our enhancement idea for Version 4, in which we increase the rate of mutation when the rate of change of fitness decreases in order to introduce chromosomes in a different part of the problem space [12].
Backpropagation - Backpropagation is the most commonly used network learning algorithm because of its flexibility, ease of use, and reliability. This network is composed of several processing elements or nodes arranged in hierarchical layers. The input layer is where the input pattern is originally presented. This input is propagated up to the next layer, often a hidden layer, along its interconnections. Each processing element in this layer receives its given input and multiplies it by a predetermined weight (often initialized randomly at the onset of the training), which creates a weighted value. Much like the human neuron, positive weighted values will be excitatory and negative weighted values will be inhibitory. The processing element then sums all the weighted values. The resulting value plus a bias unit is presented to an activation function to determine the output from the processing element. This value is then propagated to the next layer. There may be several hidden layers, but typically there is only one or two. The output from the last hidden layer propagates to the processing elements in the output layer. These processing elements perform the same function as those in the hidden layer. The resulting output is the final output pattern from the network. Hence, the neural network has created a mapping from the original input pattern to the output pattern.
Binary - Binary is a term for the base 2 number set (1 or 0) often used in the computer world.
Chromosome - A chromosome is a bit string representing various aspects of a problem. The genetic algorithm performs operations on the chromosomes. Each chromosome has a fitness associated with it.
Connection - A connection is a signal transmission pathway between processing elements, corresponding to the axons and synapses of neurons in a human brain, that connects the processing elements into a network.
Crossover - Crossover is a process in which a certain set of features of two chromosomes are exchanged.
Cycle - A Cycle represents one complete iteration of the entire data set through the forward and backward propagation in a multi-layer perceptron.
Daughter Chromosome - A daughter chromosome is a chromosome that results from the process of crossover.
Delta Rule - The Delta Rule is the mathematical basis for the backpropagation learning algorithm. It was original derived from the Least Mean Squared (LMS) rule. It attempts to iteratively adjust processing elements' weights in a network in order to achieve a minimum error between the network's output (given the input parameters) and the target output.
Derivative - A method used in Calculus to find the rate of change of a function.
Extinction - The death of certain unfit chromosomes in a chromosome pool.
Generation - A generation is one complete iteration of a genetic algorithm creating a offspring population from a parent population.
Genotype - Genotype is the set of genes describing a recipe for the Phenotype (ex: DNA).
Learning Algorithm - A learning algorithm is a means by which a neural network adapts or learns. It can be classified as a supervised or unsupervised algorithm. There are hundreds of different learning algorithms. The backpropagation is an example of a supervised learning algorithm.
Learning Rate - A learning rate is a coefficient (between values 1.0 and 0) used in the Generalized Delta Rule (backpropagation learning algorithm) to control how fast the network changes a network's weights.
Momentum - A modification to the Generalized Delta Rule to prevent the network from getting caught in local error minima. The momentum coefficient (between values 1.0 and 0) controls how large the momentum adjustment will be.
Mutation - The process of flipping a bit randomly in a chromosome bit string.
Neural networks - Neural networks are a form of artificial intelligence that models the basic structure of the human brain's neural structure. Neural networks use the ability to recognize patterns through experience (learning), rather than the deduction and recollection of rules (programming).
Noise - Noise refers to a slight corruption of data. It is analogous to a TV channel coming in with a little static on your TV set. You can recognize the what the image is suppose to be, but yet, it is not perfectly clear.
Phenotype - The phenotype is the manifestation of the genotype and the environment.
Population - The population is the complete set of chromosomes the genetic algorithm is working on.
Pool - see Population.
Processing Element - A processing element (PE) is an artificial neuron in a neural network, consisting of a small amount of local memory and processing power. The output from a PE is transferred through the PE's output connections to other PEs, where it becomes input.
Recombination - see Crossover.
Sigmoidal - A logistic function that has the basic shape of a S-shape curve. A sigmoidal function is often used in neural networks as the activation function of a processing element.
Sequential Algorithm - A sequential algorithm is a step-by-step list of instructions. It is processed sequentially one step at a time by a computer.
Weights - A weight is between two processing elements. It is an adaptive coefficient associated with a single input connection. The weight determines the intensity of the connection, depending on the network's design and the information it has learned.
[1] Bellwood Research Center. "Optimizing Neural Network Architecture with Genetic Algorithms." Bellwood Research Center. August 1993.
[2] Boers, Egbert and Kuiper, Herman. "Biological Metaphors and the Design of Modular Artificial Neural Networks." Unpublished. Leiden University, the Netherlands. August 1992.
[3] Brooker, L.B.; Goldberg, D.E.; Holland, J.H. Classifier Systems and Genetic Algorithms. Artificial Intelligence. Volume 40. September 1989.
[4] Burke, Laura Ignizio. 1989. Introduction to Artificial Neural Systems for Pattern Recognition. Bethlehem, PA: Lehigh University.
[5] Darwin, Charles. "The Origin of Species". John Murray. 1859.
[6] Donald, Merlin. Origins of the Modern Mind. Cambridge, MA: Harvard University Press. 1991.
[7] Fetzer, James H. Philosophy and Cognitive Science. New York, NY: Paragon House. 1991.
[8] Freeman, James A.; Skapura, David M. Neural Networks Algorithms, Applications, and Programming Techniques. Reading, MA: Addison-Wesley Publishing Company. 1991.
[9] Hecht-Nielsen. "Neurocomputing: Picking the Human Brain." IEEE Spectrum. Vol. 25, No. 3, March 1988.
[10] Hertz, John; Krogh, Anders; Palmer, Richard G. Introduction to the Theory of Neural Computation. Redwood City, CA: Addison-Wesley Publishing Company. 1991.
[11] Holland, John H. Adaptation in Natural and Artificial Systems. Ann Arbor, MI: The University of Michigan Press. 1975.
[12] Kennedy, Scott A. "Five Ways to a Smarter Genetic Algorithm." AI Expert. December 1993.
[13] Kuehn, David P.; Lemme, Margaret L.; Baumgartner, John M. Neural Bases of Speech, Hearing, and Language. Boston, MA: Little Brown. 1989.
[14] Margarita, Sergio. "The Towers of Hanoi: A New Approach." AI Expert. March 1993.