The No Free Lunch Theorems

For Supervised Learning, Optimization, & Search

Sections

The Papers and What They Say
Supervised Learning
Optimization and Search
Coevolutionary Free Lunch
When Can One Algorithm Outperform Another?
Follow-Up Research

This article summarizes several key points about the no free lunch (NFL) theorems.

These are a set of mathematical results about the efficacy of methods in search, optimization, and supervised learning. Especially metaheuristics like evolutionary or nature-inspired swarm algorithms. Do they really “work”? What does it even mean to say that they “work”?

The Papers and What They Say

The no free lunch theorems are frequently brought up in online discussions, but often without any depth or mention of the nuances, which is part of my motivation for writing this article. These discussions tend to cite the NFL theorems as saying “there is no best algorithm” whenever someone asks if one algorithm is better than another. This is, of course, too simplistic of a description. What I am hoping to do here is to go just one step further and summarize the key points of each paper.

Another key reason for the existence of this article is that the authors specifically single out the existence of free lunches in coevolutionary algorithms.

Before we dig into the materials, I would like to point out that the authors of these theorems have a website dedicated to this topic. However, do note that some of their links are broken at the time of writing.

Supervised Learning

The very first paper on the NFL theorems was for supervised learning:

David Wolpert 1996 - “The Lack of A Priori Distinctions Between Learning Algorithms”.

This paper is infamous for being “often cited but rarely read” because it consists of 52 pages and appears to be a rather difficult read. I must admit that I have not read it in its entirety as well. In fact, the author wrote a version much later that is a lot easier to read:

David Wolpert 2001 - “The Supervised Learning No-Free-Lunch Theorems”.

The author introduced a new “Extended Bayesian Formalism” framework and analyzed the expected error or cost \( \mathbb{E}(C | d) \) of a supervised learning algorithm, given training dataset \( d \).

Key Points

The datasets used for calculating the expected error must be “off-training” data points. These are data points that do not appear in the training dataset. This strict assumption prevents algorithms that memorize the training data from having an advantage.
Let \( d \) be the training dataset. Then, an algorithm is equivalent to \( P(h \mid d) \), where \( h \) is the algorithm’s estimate of the true objective function \( f \). The h here also stands for “hypothesis”.
The authors used \( P(f \mid d) \) to express our uncertainty about the true objective function \( f \) that the algorithm is trying to learn.
Finally, the authors showed that \( \mathbb{E}(C | d) \) is the same for any two algorithms when uniformly averaged over all objective functions \( f \)! The same is true when uniformly averaged over all priors \( P(f) \) and also when uniformly averaged over all training sets \( d \), or over all sizes of the training set!

In English, this means that if we do not know anything about the objective function \( f \) that we are trying to learn, then any function is equally possible, and any two algorithms will have the same expected error on new unseen data points.

Optimization and Search

Two more papers were published in 1996 containing NFL theorems for search and optimization:

Wolpert & Macready 1996 - “No Free Lunch Theorems for Search”
Wolpert & Macready 1996 - “No Free Lunch Theorems for Optimization”

I have not read the one for search algorithms, but the reasoning should be similar to optimization, which can be seen as searching for the maximum or minimum of an objective function.

Key Points

Let \( f : X \rightarrow Y \) be an objective function on finite sets \( X \) and \( Y \). The paper considers the space of all possible objective functions on these two sets. This space has size \( |Y|^{|X|} \).
Let \( d^y_m = \{ d^y_m(1), d^y_m(2), ..., d^y_m(m) \} \) be a time-ordered sequence of \( m \) of \( Y\)-values. Note that \( d^y_m(i) = f(d^x_m(i)) \), where \( d^x_m(i) \) is the \( X\)-value visited by the algorithm at time \( i \).
For any pair of algorithms \( a_1 \) and \( a_2 \), this version of the NFL theorem says that \( \Sigma_f \, P(d^y_m | f, m, a_1) = \Sigma_f \, P(d^y_m | f, m, a_2) \).

In English, this means that any two algorithms will have the same probability of generating any particular outcome when averaged over all possible objective functions!

Coevolutionary Free Lunch

Finally, the authors wrote a paper much later showing the existence of free lunches for coevolutionary algorithms:

Wolpert & Macready 2005 - “Coevolutionary Free Lunches”

I was very excited to read this at first, but it turns out that this result has a massive caveat and is not the discovery of an actual free lunch.

Key Points

There are two players: the agent and an opponent. Let \( \underline{x} \) be the agent’s moves (choice of input value \( x \)) and \( \bar{x} \) be the opponent’s responses to the agent’s moves.
Play is sequential. The agent selects a move, then the opponent selects a response.
Let \( f: X \times X \rightarrow Y \) be the payoff (or objective) function of the agent. Let \( (\underline{x}, \bar{x}) \in X \times X \rightarrow Y \) denote an input element. The agent wants to maximize their payoffs as calculated by \( f \).
The conclusion section of the paper explains succinctly why free lunches exist in this setting: summing over all payoff functions \( f \) is not the same as summing over all functions \( \min_{\bar{x}} f( \cdot , \bar{x}) \).

In English, this means that for algorithms whose payoffs depend on the opponent’s response, some algorithms can do better than others. Trivially, an algorithm that chooses to ignore the opponent’s moves will perform worse than those that try to get information from the opponent’s moves.

This Stack Exchange discussion has a very nice one-line summary of why free lunches exist in this setting:

“Lunches are free because the opponent is buying.”

When Can One Algorithm Outperform Another?

The NFL theorems say that the performance of all algorithms is the same when averaged over all possible objective functions. However, it is still possible for one algorithm to beat another in several ways.

The NFL theorems do not compare computational time and resource usage. An algorithm can outperform another by running much faster or using much fewer resources.
If we have an inductive bias about the structure of our dataset or objective function, some algorithms can exploit this structure and outperform others.
The NFL theorems are not known to hold for “higher order” algorithms that use information such as the function’s gradient (1st order) or the Hessian (2nd order). This Stack Exchange discussion contains a great explanation.

Follow-Up Research

There are also numerous directions for future research.

What kind of datasets and objective functions would the NFL theorems fail to hold? For example, Schumacher, Vose, and Whitley (2001) showed that the NFL theorems hold if and only if the set of objective functions we are averaging over is closed under permutation. For example, if \( f(x_1,x_2) = y \) is in the set, then there must be a function satisfying \( g(x_2,x_1) = y \) in the set. The inputs have been permuted while leaving the output unchanged.
Which algorithms perform better on these datasets and objective functions? Why do they outperform?
When do metaheuristics have an advantage? For example, instead of saying that all genetic algorithms are useless because of the NFL theorems, we could research conditions under which genetic algorithms will have an advantage.

Here is a review paper not written by the original authors, but one that I find very useful to keep up with the various developments in this area:

Adam et al. 2019 - “No Free Lunch Theorem: A Review”