What is Infra-Bayesianism?
Infra-Bayesianism tries to solve the problem of agent foundations A research agenda which tries to understand the nature of agents and their properties.
Infra-Bayesianism tries to construct a realistic model of agents and a mathematical structure that would point at agents aligned with humans, such that these agents could be found by means of gradient descent.
With these goals, the research starts by solving some problems with traditional reinforcement learning A machine learning method in which the machine gets rewards based on its actions, and is adjusted to be more likely to take actions that lead to high reward.
Infra-Bayesianism is a theory of imprecise probability that solves this problem of non-realizability by considering hypotheses in the form of convex sets of probability distributions; in practice, this means, for example, a hypothesis can be “every odd-positioned bit in the string of bits is 1”. (This is a set of probability distributions over all possible bit strings that only assign positive probabilities to strings that have 1s in odd positions; a mean of any two such probability distributions also doesn’t assign any probability to strings that have a 0 in an odd position, so it’s also from the set, so the set is convex.)
If a problem can be solved, but we can’t specify how we’d solve it given unlimited compute, we’re just confused about it. Going from thinking that chess was impossible for machines to understanding minimax was a really good step forward for designing chess AIs, even though, in practice, calculating the minimax solution of chess is computationally intractable.
Thus, we should seek to figure out how alignment might look in theory, and then try to bridge the theory-practice gap by making our proposal ever more efficient. The first step along this path is to figure out a universal RL setting that we can place our formal agents in, and then prove regret bounds in.
A key problem in doing this is embeddedness. AIs can't have a perfect self model — this would be like imagining your entire brain, inside your brain. There are finite memory constraints. Infra-Bayesianism allows agents to have abstract models of themselves, and thus works in an embedded setting.
Infra-Bayesian Physicalism (IBP) is an extension of this to reinforcement learning (RL). It allows us to
Figure out what agents are running (by evaluating the counterfactual where the computation of the agent would output something different, and seeing if the physical universe is different).
Give a program, classify it as an agent or a non agent, and then find its utility function.
Researcher Vanessa Kosoy uses this formalism to describe PreDCA, an alignment proposal based on IBP. This proposal assumes that an agent is an IBP agent, meaning that it is an RL agent with fuzzy probability distributions (along with some other things). The general outline of this proposal is as follows:
Find all of the agents that preceded the AI
Discard all of these agents that are powerful / non-human like
Find the utility functions in the remaining agents
Use combination of all of these utilities as the agent's utility function
Kosoy models an AI as a model-based RL system with a world model
A system’s internal representation of its environment, which it uses to predict what will happen, including as a result of its own possible actions.
An event in which an AI’s capabilities suddenly generalize, but its alignment doesn’t also generalize.
It is open to show that this proposal also solves inner alignment
When an AI system ends up pursuing a different objective than the one that was specified.
This approach deviates from MIRI's plan, which is to focus on a narrow task to perform the pivotal act
An action that ensures a good long-term outcome for humanity — typically, this means a way to use future advanced AI to prevent catastrophes, including those caused by other AI.