Full Abstract:
Methods are currently lacking to prove artificial general intelligence (AGI) safety. An AGI
‘hard takeoff’ is possible, in which first generation AGI1 rapidly triggers a succession of more powerful
AGIn that differ dramatically in their computational capabilities (AGIn << AGIn+1). No proof exists
that AGI will benefit h umans o r o f a s ound v alue-alignment m ethod. N umerous p aths toward
human extinction or subjugation have been identified. We suggest that probabilistic proof methods
are the fundamental paradigm for proving safety and value-alignment between disparately powerful
autonomous agents. Interactive proof systems (IPS) describe mathematical communication protocols
wherein a Verifier queries a computationally more powerful Prover and reduces the probability of the
Prover deceiving the Verifier to any specified low probability (e.g., 2 −100). IPS procedures can test AGI
behavior control systems that incorporate hard-coded ethics or value-learning methods. Mapping
the axioms and transformation rules of a behavior control system to a finite set of prime numbers
allows validation of ‘safe’ behavior via IPS number-theoretic methods. Many other representations
are needed for proving various AGI properties. Multi-prover IPS, program-checking IPS, and
probabilistically checkable proofs further extend the paradigm. In toto, IPS provides a way to reduce
AGIn ↔ AGIn+1 interaction hazards to an acceptably low level.
91
0
Provably Safe Artificial General Intelligence via Interactive Proofs
attributed to: Kristen Carlson
Methods are currently lacking to prove artificial general intelligence (AGI) safety. An AGI
‘hard takeoff’ is possible, in which first generation AGI1 rapidly triggers a succession of more powerful
AGIn that differ dramatically in their computational capabilities (AGIn << AGIn+1). No proof exists
that AGI will benefit h umans o r o f a s ound v alue-alignment m ethod. N umerous p aths toward
human extinction or subjugation have been identified. We suggest that probabilistic proof methods
are the fundamental paradigm for proving safety and value-alignment between disparately powerful
autonomous agents.
0
Vulnerabilities & Strengths