What are the "no free lunch" theorems?

3 min read

Suggest changes in Google Docs

“No free lunch” theorems assert that, on average, every learning algorithm does equally well over all possible learning tasks. An algorithm that does better than chance at predicting some sequences must "pay for lunch" by doing worse at some other sequences.

Some have argued that these theorems imply that fully general intelligence is impossible, and therefore worries about AGI are overblown. This argument proves too much since it also implies human intelligence is impossible.

"No free lunch" holds only on the entire set of all theoretically possible sequences. The ones our algorithm does worse at may just be fully random, or designed to trick it. But if we start out knowing that the environment that our algorithm operates in has a certain structure, then the “no free lunch” results are not an impediment to designing algorithms with superior predictive or optimizing abilities.

Therefore, for practical AI design purposes, these theorems are often irrelevant. We aren't interested in "predicting" completely random sequences, and we don't mind if another algorithm outperforms us on that "task". No system can be so general as to perform well in every possible universe, but AGI is only required to perform well in one universe¹ - ours. Our universe is far from maximally random, and its laws provide a lot of structure that lets us make good predictions while "paying for lunch" in other possible universes without that structure.

"No free lunch" hasn't prevented humans from exploiting the universe's structure for research and development, and won't prevent artificial systems from doing so much more effectively. The generality needed for AGI to exceed human abilities across the board is not the same kind of generality forbidden by these theorems.

David Dalrymple points out that another “interpretation of the NFL theorems is that solving the relevant problems under worst-case assumptions is too easy, so easy it's trivial: a brute-force search satisfies the criterion of worst-case optimality.” The fact that designing problem solvers for extremely general environments is too easy crops up elsewhere. E.g., in Solomonoff induction, the problem is to assume a computable environment and predict what will happen next. The best algorithm is just to run through every possible computable function and average their predictions. That’s 5 lines of pseudocode. Likewise, extremely specialized problems, such as designing a solver for tic-tac-toe, are trivial. But between total generality and total specialization? This is where things become difficult. But it is also where we find most of the questions that matter to us. ↩︎

What is the orthogonality thesis?