Optimization Without Gradients
Sections
This is a quick run-through of some popular metaheuristics that try to find the global maximum of a real-valued function \( f \) when we are unable to calculate gradients for gradient methods. There are countless variations of these algorithms. The simplistic ways I will describe here are just one particular way of implementing each algorithm.
This article is actually a short summary of chapter 4 of the textbook Artificial Intelligence: A Modern Approach, 4th Edition, which did a fantastic job at describing these algorithms in an accessible way.
One last thing to point out is that one of the No Free Lunch Theorems proved that none of these algorithms can be said to be strictly better than their peers in general. However, if you know some specifics about the function you are trying to optimize, you might have a reason to use one of them over the others. I wrote about this here in an article here.
Hill Climbing
The idea of this algorithm is to start at a random input location, then keep looking in the local neighborhood for better locations to move to.
-
Start at a random input location \( x_1 \).
-
Calculate \( f(x_1) \) as well as the values of \( f \) in some local neighborhood of \( x_1 \).
-
Pick a different location \( x_2 \) in this neighborhood that has a higher \( f \) value. That is, \( f(x_2) > f(x_1) \).
-
Keep repeating this process until we can no longer find better locations to move to. This means that all points in the local neighborhood result in lower values of \( f \).
You might be wondering, wouldn’t we get stuck at a local optimum? And you would be correct. The hill climbing algorithm returns a local maximum, which need not be the global maximum. This will be a recurring pattern among all the algorithms described here.
Simulated Annealing
-
We have a temperature variable \( T \) that starts at some arbitrary number and goes to \( 0 \) over time.
-
If \( T = 0 \), return the current location.
-
Start at a random input location \( x_1 \).
-
Select a new random input location \( x_2 \).
-
Move to the new location if \( f(x_2) > f(x_1) \).
-
Else, move to the new location with probability \( \exp^{\frac{f(x_1) - f(x_2)}{T}} \).
Note that \( f(x_2) <= f(x_1) \implies f(x_2) - f(x_1) <= 0 \). So, \(0 <= \exp^{\frac{f(x_1) - f(x_2)}{T}} <= 1\) and can be used as a probability.
In contrast to hill climbing, simulated annealing attempts to avoid getting stuck at a local maximum by randomly exploring the search space. As the temperature approaches \( T = 0 \) over time, the algorithm is less likely to jump to suboptimal points. We have the freedom to choose how the temperature changes over time.
There are various results showing that if we pick the correct temperature \( T \), then simulated annealing will find the global optimal with probability \( 1 \). But do note that it might still take a tremendously large amount of time to do so!
Local Beam Search
-
Start with \( k \) random input locations.
-
At each step, generate the next input location based on those \( k \) locations.
-
Select the best \( k \) input locations from the entire pool of locations.
-
Repeat the three previous steps.
The next input location in the second step could be found by searching the local neighborhoods, like in hill climbing, or by any other method.
One problem with this approach is that because we are always selecting the \( k \) best input locations from the entire pool, all \( k \) threads could get stuck at the same local maximum. Instead of always selecting the \( k \) best, we could select \( k \) locations at random, weighted by their \( f(x) \) values, similar to the random jumps in simulated annealing.
Evolutionary and Genetic Algorithms
Evolutionary algorithms are a massive topic containing many different algorithms inspired by the process of biological evolution. Again, do keep in mind that the No Free Lunch Theorems proved that there is no general advantage to be gained from using one of them over another. Broadly speaking, the steps in most evolutionary algorithms are roughly as follows:
-
Start with a population of \( n \) agents.
-
Measure the fitness of each of these agents using a fitness function.
-
Then, agents might produce children with a probability that increases with their fitness.
-
Randomly choose children to mutate.
-
Replace previous low-fitness agents with these children.
Genetic algorithms are a type of evolutionary algorithm where agents are binary strings. This representation tries to mimic DNA sequences.