What is scaffolding?
"Scaffolding" is a fuzzy term referring to code built up around an LLM in order to augment its capabilities1. This does not typically include code which alters the LLM's internals, such as fine-tuning or activation steering.2
People use scaffolding because it can allow LLMs to use tools, reduce error rates, search for info, etc., all of which make LLMs more capable.
Scaffolding is an important aspect of safety evaluations because, once an LLM is deployed, users will inevitably attempt to use it to make the LLM more powerful. Dedicated effort should be put into pushing the functionality of the LLM to its limits during these evaluations, as the latent capabilities of LLMs are a) what's relevant for dangers and b) hard to tease out with prompts alone.
Another reason to use scaffolding is interpretability. Most of an LLM’s reasoning is done within the neural network itself in streams of gigantic tensors which humans don’t yet know how to interpret. But reasoning done via the scaffold is typically done in plain sight via text or code, which is much easier to inspect. E.g., two LLMs communicating with one another to solve a problem might use English to express their reasoning to each other.
Common types of scaffolds include:
-
Prompt templates: Perhaps the simplest scaffolds, “prompt templates” are just prompts with some blank sections to be filled in at runtime using some local data. E.g., a template like “You are a good assistant. The date is {INSERT DATE HERE}. Tell me how many days are left in the month.” can be filled in with the current date before being used to prompt an LLM.
-
Retrieval Augmented Generation (RAG): For tasks which have some associated data, RAG is a way of automatically adding task-relevant info to prompts. RAG does this by looking for snippets of text/data in the database which are relevant to the user's prompt, and then adding those snippets to the prompt.
-
Search engines: Similar to RAG, giving an LLM access to a search engine lets it find info relevant to its prompt. Unlike RAG, the LLM decides what to search for and when.
-
Agent scaffolds: Scaffolds which try to make LLMs into goal-directed agents, i.e., giving an LLM some goal/task to perform, and ways to observe and act on the world. Usually, the LLM is put into a loop with a history of its observations and actions, until its task is done. Some agent frameworks give LLMs access to much more powerful high-level actions, which saves the AI from repeating many simple primitive actions and therefore speeds up planning.3
-
Function calling: Giving an LLM the ability to call a function in a programming language while it is running, e.g., giving it access to addition and multiplication functions to reduce arithmetic errors, or access to a spreadsheet API for tasks like accounting.
-
“Bureaucracies” of LLMs: Running multiple LLMs and giving them different roles in solving a task, somewhat like in a team or larger organization. One common variant is to task one AI with coming up with solutions, and another with critiques of these proposed solutions. This can be generalized into an entire tree of thoughts.4
In practice, for most applications, a simple scaffold will capture most of the value. For instance, the same simple scaffold can reduce the error rates of an AI's output in many types of tasks by a factor of 2 or greater.5 More specialized scaffolding has allowed LLM to do some very impressive things in narrow areas.6
While we have only been building scaffolding for a few years, some people initially believed that progress in scaffolding would be much faster, almost trivial. That this progress would rapidly improve capabilities of LLMs, perhaps even resulting in AGI.7 This has not yet been the case. That said, there are still people working on scaffolding as an alternative path to AGI, as compared to scaling. This includes Conjecture, who are working on their CoEM agenda because they believe it has a shot at working in addition to being safer than simply scaling LLMs.
Ultimately, it is unknown what the limits of scaffolding are.8
The term “scaffolding” is based on a metaphor of an LLM as a building with structures built up around it, but the origin of it being used in the AI context is unclear. One early use of the term in connection to LLMs is from the paper “PromptChainer: Chaining Large Language Model Prompts through Visual Programming”. ↩︎
Another way to phrase this is that a scaffold is any structure which calls an LLM.as a subroutine. ↩︎
For instance, in Minecraft Voyager, GPT-4 was given access to the Mojo API which let GPT-4 automate actions, and a vision-module to turn images into text so the AI sees what's happening. This worked very well: the Minecraft Voyager can make diamond pickaxes from scratch in Minecraft. In 2023, this was considered a very difficult task for an AI to do without explicitly being trained to do so. ↩︎
This approach works well when a task is too hard for a single LLM to solve, but can be factored into smaller tasks that single LLMs can solve. This is essentially the assumption of Factored Cognition, which underlies the alignment technique of HCH. ↩︎
There are several features which make LLMs suited to a task: 1) Observations are not noisy, 2) Feedback from actions is fast and noiseless, 3) All of the state required to perform the task is easily compressible to the LLM’s modalities, 4) The interfaces remain stable. ↩︎
There are some very narrow areas which benefit greatly from specialized scaffolding which exploits the structure of the area. An example of such an area in mathematics is the “cap set problem”, a central problem in extremal combinatorics. FunSearch is an LLM scaffold which generates new solutions to this problem. These solutions significantly improve on human generated solutions, both in finite dimensions and asymptotically. These new solutions to the cap set problem are one of the first examples of a scientific discovery which critically depended on an LLM’s capabilities. ↩︎
See Zvi Mowshowitz’s post “On AutoGPT”. ↩︎
One potential limit to the use of scaffolding is that its metrics can be subject to Goodhart’s law. This means that more computationally intensive scaffolds can perform worse as they apply greater optimization pressure to the gap between what we want and our metric. ↩︎