Only use AI agents that pursue aspiration-based goals rather than maximization goals
attributed to: Jobst Heitzig
Design and exclusively use AI agents that: (1) Do not aim to maximize some form of objective function (such as expected total reward / return) but rather aim to fulfill goals specified via "aspirations", which are constraints on the expected value of one or more measurable features of the world trajectory (such as produced pieces of some good, financial costs, energy use, time, etc.). (2) Learn these quantitative aspirations from humans by actively and continuously inquiring about their goals, and refuse to pursue aspirations that appear unsafe. (3) Learn a world model that allows predicting aspiration- and safety-related consequences of possible policies. (4) Use a number of generic safety criteria (such as avoiding extreme actions, avoiding changes in the environment, not seeking power, being predictable, etc.) to choose from the typically very many possible policies that meet the given aspiration. (5) Whenever circumstances change, adjust aspirations to what is feasible and appears safe enough given the new circumstances. (6) Collude and cooperate with each other and with humans to prevent other agents from pursuing maximization-goals.
What part of the alignment problem does this plan aim to solve? This plan is incomplete in the sense that it does not aim to completely "solve" a part of the problem. It is rather meant to be a helpful (and potentially necessary) ingredient of safe AI systems. It is based on the working hypothesis that "alignment" in the strict sense (identifying and implementing an objective function that is so aligned with human values that it is safe to maximize it) might be impossible to achieve or at least impossible to verify. Why has that part of the alignment problem been chosen? Because it seems that many doom scenarios are closely related to the idea of maximization, and non-maximizing agent designs appear to be under-researched. How does this plan aim to solve the problem? See above. It does not solve the problem but is meant to make AI systems generically much safer. The rationale for this expectation is that pursuing goals that are specified as concrete, finite, constraints on (the expected value of) one or more measurable features of the world trajectory has a number of generically safer properties than pursuing maximization goals: (1) The actual outcomes are more predictable. (2) The actions required to fulfill the goal are generically less "extreme" (in whatever sense) and thus less dangerous. (3) The set of possible policies that fulfill the goal is much larger (has a positive rather than a zero measure in policy space), so that one can use additional safety criteria to choose from it, and can in addition use a small amount of randomization to avoid accidentally choosing a policy that is extreme in some respect that is not explicitly take care off by the used criteria. What evidence is there that the methods will work? For one thing, common sense suggests that everyday plans get safer when goals are more modest and more well-defined. On the other hand, preliminary work by my lab suggests that designing aspiration-based agents of the required type is not hard because many existing designs and algorithms can readily be adapted to pursue aspiration-based goals instead of maximization goals. What are the most likely causes of this not working? I believe the main obstacle to be related to competition between aspiration-based and maximizing agents, because in principle, even with slightly misspecified objective functions, maximizing agents might individually outperform aspiration-based agents in terms of market value or fitness in the short run (just like generically "unsafer" agents might outperform safer ones in the short run). Even though one can show in theory (using evolutionary game theory in stylized environments) that there exist evolutionarily stable strategies by which aspiration-based agents can suppress the rise of maximizing agents, this finding might not generalize well to more complex environments or situations where maximizing agents are already prevalent. For this reason, an AI ecosystem of aspiration-based agents must likely be additionally protected by regulation. Further remarks (1) Even though the "aspiration" framework includes quantilizers and other forms of satisficers as a special case, where the aspiration is "make the expected value of X at least Y or larger", this type of unbounded aspiration is *not* what this plan is about since the missing upper bound will not sufficiently outrule extreme behavior. If a user specified such an aspiration, the system would reject it as unsafe and suggest an alternative aspiration with an upper bound sufficiently below the estimated maximum. (2) It is crucial that aspirations are in terms of *expected* values. This allows the system to accept individual realizations that violate the constraints, as long as the constraints are met in expectation (ex ante). This appears necessary in order to be able to react to stochastic events (good or bad "luck") by adjusting aspirations to avoid extreme actions even after significant good or bad luck. In particular, a properly designed aspiration-based agent will not aim to "make 100% certain that some variable X gets value Y" (since that would require unsafe actions like removing all stochasticity from the environment by killing everyone) but rather to "make X get value Y in expectation, with a preferably small (but not minimal!) variance".