What is Brain-like AI safety?
Brain-like AGI safety is an AI alignment research agenda pursued by Steve Byrnes that asks: "Suppose we someday build an AGI algorithm using similar principles of learning and cognition as the human brain. How would we use such an algorithm safely?”
The brain-like AGI safety agenda stands apart from most other alignment approaches in its foundational assumption that AGI will ultimately use brain-inspired architectures rather than scaled-up versions of current AI paradigms. While many alignment researchers focus on aligning systems built from LLMs, recursive self-improvement, or other approaches that don't closely mimic brain structure, Byrnes's work specifically addresses the alignment challenges that would arise if AGI emerges from algorithms that implement brain-like learning and cognitive processes.
This research agenda emphasizes that a key feature of the brain is that it is able to learn from scratch. Most human knowledge and abilities are not hard-coded in the neural structure by evolution. Rather, the brain has a generic learning structure that allows it to adapt as well as form complex values from simpler innate drives.
This approach suggests several practical implications for alignment work. Since these systems would learn values similarly to humans, Byrnes proposes that alignment might be achieved through careful design of the "steering system" - the components that provide reinforcement signals during learning. By designing these systems with the right initial values and learning mechanisms, we might create AGIs that naturally develop human-compatible goals.
This approach also emphasizes the importance of environment design during training. Just as human values develop through interaction with our environment and society, a brain-like AGI would form its values based on its experiences during development. This suggests that controlled, carefully designed training environments may be critical for alignment.