### Introduction

Randomized Optimization involves a collection of optimization techniques allowing the computation of global minima in otherwise non-straightforward functions. Making these algorithms unique are tweaks which invoke randomness to expand the search space, preventing their halting when finding local minima. In this paper, we will discuss four common randomized optimization algorithms: randomized hill climbing, simulated annealing, the genetic algorithm, and the MIMIC algorithm.

We will begin by evaluating the first three algorithms’ performance in determining neural network weights, compared to backpropagation. In order to assess the accuracy of our neural network, we will utilize a multilayer perceptron with the hyperparameters we calculated in Assignment 1 (Table 1), using the Phishing Websites data set [1]. We will continue by introducing three interesting optimization problems, and determining which algorithms are best-suited to solve each one.

Hidden Layers 1

Hidden Nodes 5

Epoch Count 200

Table 1. Optimal neural network hyperparameters on the Phishing Websites dataset, found in Assignment 1.

Randomized Hill Climbing

Perhaps the simplest of randomized optimization algorithms, Randomized Hill Climbing consists of a simple greedy hill climbing search with random restarts after each iteration. As such, with enough iterations, the algorithm will cover the entire search space without isolating itself to a single local minima. Due to the straightforward nature of the algorithm, there aren’t any hyperparameters to tune. We use a 70/30 train/test split in our experiment.

We applied the algorithm to calculate the weights of a neural network using multilayer perceptron hyperparameters listed in Table 1; a curve relating accuracy to iterations can be seen in Figure 1. Although the curve itself displays a clear upwards trend, there are several interesting phenomena to unpack. Firstly, the algorithm’s accuracy remains fixed to 44% for approximately the first 100 iterations; this is an indication that the algorithm’s search space was isolated to a positioning of weights that did not influence the accuracy of the classifier during this time. Furthermore, after the curve began to increase near iteration 100, there were still some large fluctuations – one of which decreased the model’s accuracy to 44%. These outliers likely involve cases where the algorithm’s search space allowed for modification of crucial network weights with greater effect on the model.

Figure 1. NN feed-forward accuracy over 1,000 iterations of the Randomized Hill Climbing algorithm.

Through 1,000 iterations, the algorithm was unable to reach the testing accuracy of 97.38% generated using backpropagation in Assignment 1; the testing curve plateaus just above 90%. Given exponentially more iterations the algorithm could likely converge to the same accuracy of the backpropagated model, but this would be infeasible (especially as only 200 iterations were necessary utilizing backpropagation). Further, 1,000 iterations of this algorithm lasted 21.27 seconds, longer than any other randomized optimization technique tested. Aside from increased iterations, one could potentially achieve a higher accuracy by averaging results over multiple tests, but due to the simplicity of the Randomized Hill Climbing algorithm and lack of hyperparameters to tune, the results likely do not have much room for increase.

Simulated Annealing

Simulated Annealing follows a somewhat similar strategy as Randomized Hill Climbing; random positions are chosen at the start of each iteration to spread out the search space, but rather than choosing adjacent neighbors, the algorithm picks a neighbor based on a probability distribution given by a ‘temperature’ value (hence the name ‘annealing’) [2]. The algorithm will always move to ‘better’ neighbors, but will also move to worse neighbors given a certain acceptance probability (calculated with the temperature). As the temperature cools, the acceptance probability decreases proportionally until the algorithm essentially resembles basic hill climbing and a local (ideally global) minima is found.

Two important hyperparameters are involved in tuning a simulated annealing model: the initial temperature of the annealing algorithm, and a ‘cooling factor’ (also referred to as a ‘cooling schedule’) which determines the rate of temperature decrease. Using a 70/30 train/test split, we ran a grid search on these two hyperparameters over 1,000 iterations, results of which can be seen on Table 2. Note that we began initial temperature at 1,000 as too low of a temperature would result in the algorithm not expanding its search space to the full scope of the problem and operating as basic hill climbing, as mentioned previously.

Initial Temperature

Cooling Factor 1000 1.0E5 1.0E7 1.0E9

0.25 0.909 0.913 0.901 0.909

0.5 0.894 0.908 0.910 0.890

0.75 0.847 0.898 0.845 0.879

0.99 0.441 0.441 0.441 0.441

Table 2. Grid search on Simulated Annealing cooling factor and initial temperature. Entries define the final testing accuracy (%) after 1,000 iterations. Best accuracy marked in bold and underlined.

Upon running the grid search, it is clear that an initial temperature of 100,000 and cooling factor of 0.25 offer optimal performance of the model. Next, we generated a curve demonstrating accuracy of our model’s training and testing set classification over 1,000 iterations (Figure 2).

Figure 2. NN feed-forward accuracy over 1,000 iterations of the Simulated Annealing algorithm.

The curve generated for our Simulated Annealing model looks strikingly similar to that generated using Randomized Hill Climbing; the algorithm runs into the same delayed reaction and accuracy anomalies, though at different iterations. This can be attributed to the algorithm’s similarities to Randomized Hill Climbing, as well as evidence that interactions of the different attributes within the neural network’s hidden layer are largely responsible for the behavior.

Running 1,000 iterations of the Simulated Annealing algorithm lasted 19.56 seconds, placing the algorithm among the fastest within randomized optimization. Looking forward, the accuracy would likely converge higher if the algorithm given more iterations to run. Aside from this, the accuracy could be measured marginally higher by averaging results over multiple trials, or performing a grid search with smaller steps to find a more optimal hyperparameter combination. Ultimately, although achieving slightly higher accuracy than Randomized Hill Climbing, the Simulated Annealing algorithm still falls fall short of backpropagation impressive 97.38% testing accuracy.

Genetic Algorithm

The Genetic Algorithm applies biological principles to optimization problems; using a system of mating populations in order to find the fittest member of a search space. From a high level, the algorithm encodes relevant information of a search space into ‘chromosomes’ of population individuals, and randomly crosses-over these chromosomes over each generation using ratios encoded as constants. Through each generation, the best members are carried over and the worst are discarded. Over enough iterations, the population will contain optimal members of the search space.

Three important hyperparameters are necessary in order to fine-tune a Genetic Algorithm optimization model: the mutation count, or the number of individuals who have their chromosomes spontaneously mutated, the mating count, or the number of individuals who mate to produce offspring, and the population size, or the number of individuals to maintain each generation. Too many mutating individuals can prevent the algorithm from maintaining a genetically superior population. These mutations are crucial, however, as they prevent the algorithm from halting at local minima by applying randomness. As such, we will confine the mutation count to be no more than ¼ the population size. Constraining our population size to be 200, we perform a grid search on the mating and mutation count to find an optimal set of hyperparameters (Table 3). Like before, we use a grid search with a 70/30 train/test split.

Mutation Count

Mating Count 12 24 36 48

50 0.865 0.891 0.916 0.902

100 0.900 0.896 0.882 0.906

150 0.893 0.886 0.911 0.881

200 0.875 0.901 0.889 0.905

Table 3. Grid search on Genetic Algorithm mating count and mutation count. Population size fixed to 200. Entries define the final testing accuracy (%) after 1,000 iterations. Best accuracy marked in bold and underlined.

It is clear that a mutation count of 36 and a mating count of 50 provide the highest accuracy. We will apply the values gathered as ratios compared to the population size, and use these ratios as we run a linear search on the optimal population size. As seen in Figure 3, a population size of 300 is optimal. The curve rises from a population size of 100 to 300 and then plateaus, indicating the population size becomes large enough to represent the majority of the hypothesis space. Next, we use these hyperparameters to generate a curve comparing accuracy to the number of genetic algorithm iterations (or rather, generations); the results can be seen on Figure 4.

Figure 3. NN feed-forward accuracy compared to the Genetic Algorithm’s population size. Using the hyperparameters gathered from Table 3.

Figure 4. NN feed-forward accuracy over 1,000 iterations of the Genetic Algorithm.

The Genetic Algorithm’s train/test curve is strikingly different to that of Random Hill Climbing and Simulated Annealing; during the first ~300 generations, the training and testing curves are very turbulent as the population mutates from baseline accuracy. Interestingly, upon reaching the 300th iteration, the algorithm remains fairly consistent around 90% accuracy. Genetic algorithms are not known to scale well to large search spaces [3]. In these cases where a search space contains many local minima, these algorithms frequently can halt before finding the global optima. The Phishing Websites data set likely matches this definition with its over 30 attributes; this could be the cause of the convergence at such a low accuracy.

With enough computing power, one could attempt to further optimize hyperparameters by running a three-dimensional grid search over the population size, mutation count and mating count, with smaller step sizes. However, as we utilize ratios for our mutation and mating rates, it is likely that a two-dimensional grid search over these two hyperparameters would be acceptable. Of course, one could also run more iterations to further fine-tune the model.

The Genetic Algorithm was the longest-running of our set of randomized optimization strategies by far, and converges at an accuracy just below 90% — the lowest of our tested algorithms. The algorithm takes specifically long per each iteration as the fitness function must be evaluated for each individual, each generation. Despite this, the algorithm still doesn’t manage to come close to the accuracy levels achieved by backpropagation, and required much more time (186.39 seconds for 1,000 Genetic Algorithm iterations vs 22.58 seconds for 200 backpropagation iterations).

Optimization Problems

Traveling Salesman Problem – Genetic Algorithm

The Traveling Salesman Problem is a famous NP-complete problem involving the generation of the shortest route connecting nodes within a graph, with the condition of starting and stopping at the same node. Given the problem’s classification as NP-complete, there is no polynomial algorithm which can perfectly solve this problem. However, randomized optimization can be utilized to calculate approximate solutions. We will apply our collection of optimization algorithms – Randomized Hill Climbing, Simulated Annealing, Genetic Algorithms, and a newcomer, MIMIC – in order to determine the best-performing optimization algorithm for this specific problem.

In order to determine the optimal algorithm, we designed two experiments; one to see how each algorithm’s accuracy scales to increased problem complexity, and another to observe each algorithm’s optimization efficiency (by determining how accuracy converges over a fixed amount of iterations). We used the default testing hyperparameters provided by our ABAGAIL implementation, which are listed in Table 4.

SA Starting Temp Cooling Factor

1E12 0.95

GA Pop. Size # to Mate # to Mutate

200 150 20

MIMIC Sample Count # to Keep

200 100

Table 4. Optimization algorithm hyperparameters, pulled from ABAGAIL’s Traveling Salesman testing implementation. Randomized Hill Climbing not listed, as no hyperparameters are applicable.

We ran our complexity experiment over Traveling Salesman problems with size from 50 to 250 (with steps of 50), where the size N represents the number of nodes in the graph (Figure 5). It’s evident that for all algorithms, increased graph nodes result in poorer fitness. This isn’t indicative of the algorithm failing to scale, however; the fitness function evaluates the inverse of the calculated path’s distance, which would be expected to be larger given a larger graph. As such, the Simulated Annealing and Genetic Algorithm curves do not have too steep a downward slope, as these algorithms tend to perform identically across differently-sized search spaces. Despite this, the Genetic Algorithm clearly wins this test, maintaining the highest fitness throughout.

Figure 5. Optimization algorithm fitness compared to Traveling Salesman problem graph nodes (N). Using hyperparameters listed in Table 4 and 2 seconds of iterations.

Next, we ran our efficiency experiment, running each algorithm for 5,000 iterations on the Traveling Salesman problem with a fixed graph node count (N) of 50. The results show an even more decisive victory for Genetic Algorithms than before; it appears that an optima is reached within the first 100 iterations, and the algorithm converges to this point throughout the remainder of execution. Such a quick convergence suggests that the algorithm may have discovered an abnormally fit local optima, or perhaps the global optima, as a Genetic Algorithm’s population early on in execution tends to rapidly fluctuate as random individuals mutate and mate. The population was likely quickly filled up with similarly optimal hypotheses.

The domination of the Genetic Algorithm within the Traveling Salesman problem could perhaps be attributed to the ABAGAIL engineers’ domain knowledge; the algorithm’s crossover function was specifically tailored to efficiently create ‘offspring’ of two paths. With such an efficient crossover function, the most efficient sub-paths of two parent paths could be merged multiple times through each generation, allowing the algorithm to quickly converge to an optimal solution. Ultimately, in problems like the Traveling Salesman problem where the search space is not well defined (for example, a random graph), Genetic Algorithms tend to be most effective.

Figure 6. Traveling Salesman fitness results compared to optimization algorithm iterations.

Flip Flop Problem – Simulated Annealing.

The Flip Flop problem is, by far, the simplest optimization problem used throughout our analysis. At its core, Flip Flop involves a rudimentary fitness function which looks to find the total number of consecutive bit alternations within a bit string. In other words, while a bit string of ‘000’ would score 0, a bit string of ‘101’ would score 2. As the optimal configuration of bits within a bit string of length N would consist of continuously alternating bits, the global optima of such a problem would be exactly N – 1.

Our goal is to determine which optimization algorithm performs best on this problem. Like before, we ran two experiments; one to observe an algorithm’s ability to scale to larger search spaces (increasing the size of the bit string), and another to determine optimization efficiency. We utilized ABAGAIL’s default hyperparameters in our testing, which can be seen in Table 5.

SA Starting Temp Cooling Factor

100 0.95

GA Pop. Size # to Mate # to Mutate

200 100 20

MIMIC Sample Count # to Keep

200 5

Table 5. Optimization algorithm hyperparameters, pulled from ABAGAIL’s Flip Flop testing implementation. Randomized Hill Climbing not listed, as no hyperparameters are applicable.

For the complexity test, we ran five experiments per algorithm, comparing N values (from 100 to 500, steps of 100) to resulting accuracies (Figure 7). Each algorithm was allowed exactly 2 seconds to run. It is immediately evident that Simulated Annealing scales the most effectively; aside from its curve staying remarkably flat throughout the bit count increases, no other algorithm comes close to its accuracy – remaining near 100% throughout the experiment.

It’s apparent that Simulated Annealing (and Randomized Hill Climbing too) scale well with the size of a problem’s search space. Alternatively, the Genetic Algorithm and MIMIC both strong inverse correlation to the increase in the problem’s bit count (or search space). This is almost certainly due to the fact that these algorithms rely on keeping a certain number of hypotheses within memory (the population size for Genetic Algorithms and the sample count for MIMIC), and the algorithms are therefore unable to scale their scopes to larger search spaces. Different results would likely ensue if we were to tweak these parameters, but this would not be productive given the staggering effectiveness of simulated annealing in this case.

Figure 7. Optimization algorithm fitness compared to Flip Flop problem bit count (N). Using hyperparameters listed in Table 5 and 2 seconds of iterations.

Next, we ran our efficiency experiment. We fixed the number of iterations to 5,000 and ran each of our algorithms on a Flip Flop problem with N = 100 (Figure 8). At first glance, it may seem as though MIMIC has the upper hand, especially in the beginning. However, given that MIMIC’s runtime throughout this experiment was 12.871 seconds compared to Simulated Annealing’s 0.009, the figure doesn’t seem as impressive – especially as the algorithms converge to very similar values toward the final iterations. Ultimately, it is clear that Simulated Annealing’s strategy of effective early search space exploration give it an advantage in optimizing the Flip Flop problem, where a few bad bit flips could make it hard to escape a local minima.

Figure 8. Flip Flop fitness results compared to optimization algorithm iterations.

Knapsack Problem – MIMIC

The Knapsack problem is another NP-complete optimization problem, involving the determining of the most efficient way to place a set of objects with weights and values into a ‘knapsack’ with a specified weight limit. Again, as there’s no known polynomial solver for this problem, we are forced to rely on optimization to approximate a solution. Below, we will determine the best algorithm for doing so.

We follow the same system of experiments used previously; one to determine the algorithm’s performance under increased complexity, and another to determine its fitness over increased iterations. Our optimization algorithms will utilize the hyperparameters predetermined by ABAGAIL, which are listed in Table 6. Further, we will assign a knapsack item’s max value and max weight to be 50, creating 4 copies of each to attempt to place in the bag.

SA Starting Temp Cooling Factor

100 0.95

GA Pop. Size # to Mate # to Mutate

200 150 25

MIMIC Sample Count # to Keep

200 100

Table 6. Optimization algorithm hyperparameters, pulled from ABAGAIL’s Knapsack testing implementation. Randomized Hill Climbing not listed, as no hyperparameters are applicable.

For our first experiment, we vary the number of possible knapsack items from 40 to 200 in steps of 40, and for each case, ran each optimization algorithm for 2 seconds (Figure 9). While the results show a clear positive association between fitness and the problem’s complexity, this again can be misleading, as the fitness function evaluates based on the total value of the items placed in the knapsack (which will obviously scale with the number of possible items). However, it appears that the MIMIC curve begins to diverge from the others as it approaches the higher item counts; this suggests that MIMIC’s solution space optimization gives it an advantage over the other algorithms.

Figure 9. Optimization algorithm fitness compared to Knapsack problem item count (N). Using hyperparameters listed in Table 6 and 2 seconds of iterations.

Finally, we ran our efficiency experiment, using a fixed item count (N) of 40 over 5,000 iterations (Figure 10). Here, Knapsack dominates the other three algorithms, quickly converging to a fitness result of roughly 4,000. Interestingly, all the algorithms converge rather fast, suggesting this problem consists of a search space with many isolated local minima that are hard to escape. The convergence also suggests that additional iterations would not increase the model’s fitness; in fact, it appears that fewer than 800 iterations were needed for all four algorithms to converge.

Figure 10. Knapsack fitness results compared to optimization algorithm iterations.

The high performance of the MIMIC algorithm on this type of problem can be attributed to its underlying function. MIMIC attempts to model a search space’s probability distribution, finding an inner ‘structure’ within the data. This allows it to use these assumption as it operates to find points it deems ‘good’ with a level of insight, rather than randomly selecting them as done by the other algorithms. In a problem like Knapsack, where certain combinations of items result in local minima with poor fitness, the algorithm is able to avoid these spots and converge to a more optimal local (or global) minima.

Conclusion

In this paper, we have looked at four unique randomized optimization algorithms, and compared their performance on entirely different problems. To summarize, we will compare where each of our algorithms exceled, and further, what the solution of our problems tells us about the overall application of randomized optimization.

Optimization Algorithms

Randomized Hill Climbing comes with two distinct advantages; the algorithm itself is incredibly simple, and further, given enough iterations, a properly-implemented algorithm will eventually find a global optima. It can be generalized that such an algorithm can be effective in environments without much computing power, especially when the environment is ‘bumpy’ with many local optima.

Simulated Annealing is an impressively fast optimization algorithm with an effective randomization technique which is capable of rapidly probing a search space. The algorithm shines on relatively simple problems (like Flip Flop), especially when finding the true global minima is not the ultimate concern.

Genetic Algorithms add a novel biological approach to randomized optimization. When faced with relatively unknown, non-complex search spaces, Genetic Algorithms can locate fit minima quickly.

Finally, the MIMIC algorithm adds a whole new level of complexity to randomized optimization which allows for the building of a ‘model’ of a search space’s probability distribution to make informed neighbor choices. With domain knowledge that a data contains such structure (and enough computing power), MIMIC can prove to be a very powerful tool.

Optimization Problems

The application of Genetic Algorithms, Random Hill Climbing and Simulated Annealing to the calculation of neural network weights, while inciting interesting analysis, ultimately proves that backpropagation should be the sole trainer for these models; none of the algorithms were able to come close to backpropagation’s impressive results from Assignment 1. However, our discussion of further optimization problems shows important uses of these algorithms; the Flip Flop problem proves the effectiveness of Simulated Annealing in almost instantaneously locating optima in large, simple search spaces. Furthermore, the successful application of our advanced MIMIC and Genetic Algorithm to challenging NP-complete problems show that these algorithms are crucial to developing accurate approximations when true solutions do not exist.

### References

[1] Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

[2] What Is Simulated Annealing? (n.d.). Retrieved from https://www.mathworks.com/help/gads/what-is-simulated-annealing.html

[3] Jyoti, & Gupta, N. (2012). GENETIC ALGORITHMS : A PROBLEM SOLVING APPROACH. Retrieved from https://pdfs.semanticscholar.org/b021/b5b778d5e28f421aba79e411b2a0858b01b8.pdf

**...(download the rest of the essay above)**