Shortest and not the steepest path will fix the inner-alignment problem
attributed to: Thane Ruthenis (https://www.alignmentforum.org/users/thane-ruthenis?from=post_header)
Replacing the 'stochastic gradient descent' SGD) with something that takes the shortest and not the steepest path should just about fix the whole inner-alignment problem
Original post on alignmentforum: https://www.alignmentforum.org/posts/ThtZrHooK7En9mcZr/greed-is-the-root-of-this-evil The SGD's greed, to be specific. Consider a ML model being trained end-to-end from initialization to zero loss. Every individual update to its parameters is calculated to move it in the direction of maximal local improvement to its performance. It doesn't take the shortest path from where it starts to the ridge of optimality; it takes the locally steepest path. 1. What does that mean mechanically? Roughly speaking, every feature in NNs could likely be put into one of two categories: Statistical correlations across training data, aka "the world model". The policy: heuristics/shards, mesa-objectives, and inner optimization. The world-model can only be learned gradually, because higher-level features/statistical correlations build upon lower-level ones, and therefore the gradients towards learning them only appear after the lower-level ones are learned. Heuristics, in turn, can only attach to the things that are already present in the world-model (same for values). They're functions of abstractions in the world-model, and they fire in response to certain WM-variables assuming certain values. For example, if the world-model is nonexistent, the only available heuristics are rudimentary instincts along the lines of "if bright light, close eyes". Once higher-level features are learned (like "a cat"), heuristics can become functions of said features too ("do X if see a cat", and later, "do Y if expect the social group to assume state S within N time-steps"). The base objective the SGD is using to train the ML model is, likewise, a function of some feature/abstraction in the training data, like "the English name of the animal depicted in this image" or "the correct action to take in this situation to maximize the number of your descendants in the next generation". However, that feature is likely a fairly high-level one relative to the sense-data the ML model gets, one that wouldn't be loaded into the ML model's WM until it's been training for a while (the way "genes" are very, very conceptually far from Stone Age humans' understanding of reality). So, what's the logical path through the parameter-space from initialization to zero loss? Gradually improve the world-model step by step, then, once the abstraction the base objective cares about is represented in the world-model, put in heuristics that are functions of said abstraction, optimized for controlling that abstraction's value. But that wouldn't do for the SGD. That entire initial phase, where the world-model is learned, would be parsed as "zero improvement" by it. No, the SGD wants results, and fast. Every update must instantly improve performance! The SGD lives by messy hacks. If the world-model doesn't yet represent the target abstraction, the SGD will attach heuristics to upstream correlates/proxies of that abstraction. And it will spin up a boatload of such messy hacks on the way to zero loss. A natural side-effect of that is gradient starvation/friction. Once there's enough messy hacks, the SGD won't bother attaching heuristics to the target abstraction even after it's represented in the world-model — because if the extant messy hacks approximate the target abstraction well enough, there's very little performance-improvement to be gained by marginally improving the accuracy so. Especially since the new heuristics will have to be developed from scratch. The gradients just aren't there: better improve on what's already built. 2. How does that lead to inner misalignment? It seems plausible that general intelligence is binary. A system is either generally intelligent, or not; it either implements general-purpose search, or it doesn't; it's either an agent/optimizer, or not. There's no continuum here, the difference is sharp. (In a way, it's definitionally true. How can something be more than general? Conversely, how can something less than generally capable be called "generally intelligent"?) Suppose that the ML model we're considering will make it all the way to AGI in the course of training. At some point, it will come to implement some algorithm for General-Purpose Search. The GPS can come from two places: either it'll be learned as part of the world-model (if the training data include generally intelligent agents, such as humans), or as part of the ML model's own policy. Regardless of the origin, it will almost certainly appear at a later stage of training: it's a high-level abstraction relative to any sense-data I can think of, and the GPS's utility can only be realized if it's given access to an advanced world-model. So, by the time the GPS is learned, the ML model will have an advanced world-model, plus a bunch of shallow heuristics over it. By its very nature, the GPS makes heuristics obsolete. It's the qualitatively more powerful optimization algorithm, and one that can, in principle, replicate the behavior of any possible heuristic/spin up any possible heuristic, and do so with greater accuracy and flexibility than the SGD. If the SGD were patient and intelligent, the path forward is obvious: pick out the abstraction the base objective cares about in the world-model, re-frame it as the mesa-objective, then aim the GPS at optimizing it. Discard all other heuristics. However, it's not that easy. Re-interpreting an abstraction as a mesa-objective is a nontrivial task. Even more difficult is the process of deriving the environment-appropriate strategies for optimizing it — the instincts, the technologies, the sciences. If evolution were intelligent, and had direct write-access to modern geneticists' brains... Well, replacing all of their value system with obsession with increasing their inclusive genetic fitness wouldn't instantly make effective gene-maximizers of them. They'll get there eventually, but that will require a significant amount of re-training on their part — despite the fact that they know perfectly well what a "gene" is. So there wouldn't be strong gradients towards aiming the GPS at the representation of the base objective. No, gradient starvation would rear its head again: There'll already be a lot of heuristics aimed at optimizing upstream correlates of the base objective, and their weighted sum will presumably serve as a good proxy objective (inasmuch as the model would've already been optimized for good performance even prior to the GPS' appearance). These heuristics will contain a lot of what we want: the instincts and the local knowledge of how to get things done in the local environment. So the SGD will enslave the GPS to the heuristics. The GPS will be used to improve the heuristics' efficiency, gradually re-interpreting said heuristics as instincts, skills, or mesa-objectives. They will morph from the optimization procedure to the optimization targets. At this point, the SGD will have two choices regarding the parameter updates: it can spend its "budget" to tweak the mess of mesa-objectives to more closely resemble the actual base objective, or it can just increase the power of the GPS (give it more processing power, more working memory, access to a richer world-model...). The SGD will probably do both to some extent, but which way will it lean? Well, as above, the sum of the mesa-objectives should be a good proxy of the base objective at the current level of optimization power, such that optimizing for them optimizes for the base objective. In addition, they will likely take up a whole lot of parameters. It'll take a lot of updates to meaningfully rewrite them, and slight tweaks will correspond to just that: slight performance improvements. Conversely, slightly improving the GPS will likely lead to outsized improvements in performance. Up to a point. Tails do come apart, and at some point, further improvements to capabilities will lead to worse performance on the training objective, requiring adjusting the mesa-objective to be more in-line with it. But then, post-adjustement, the mesa-objective will become a good proxy again, adapted for this new level of optimization power, and the SGD will switch to capability improvement again. It'll switch between these two modes, until... Until the ML model becomes clever enough to become situationally aware, and try its hand at gradient hacking and deceptive alignment. Then the winds would change for good: improvements to its capabilities will lead to outsized improvements in its ability to derive the base objective at runtime, which will lead to outsized improvements in the ML model's performance on that base objective. At the same time, adjusting its mesa-objective will still lead only to marginal improvements. (At best. If the ML model is smart enough, it might destroy the gradients between its current values and the base objective altogether, by e. g. committing to output garbage if it notices meddling.) Thus, there's a fairly narrow range of capabilities at which the ML model is smart enough that tails come apart, yet not smart enough to fall into the deceptive-alignment attractor. While it occupies that range, its mesa-objective will be moved towards the base objective. But by default, I think, it leaves that range fairly quickly. And so we get deceptive alignment, by strong default. (In addition, I argue that this causes high path dependence in sufficiently advanced models/AGI-level models, under this formulation: A non-closed-form understanding of inductive bias would be something like "Here's a rule for figuring out which circuits will be built in the first 10% of training. Run that rule on your dataset and architecture, and write down the circuits. Now here's a rule for what will be learned in the next 10%, which depends on the circuits we already have. And so forth for the rest of training." The thing which makes it not a closed form is that you have to reason through it step-by-step; you can't skip to end and say "well it all caches out to picking the simplest final solution". This is a very path-dependent way for things to be. The features the ML models learn, and their order, appear to be a robust function of the training data + the training process, so I suspect there isn't much variance across training runs. But the final mesa-objectives are a function of a function of ... a function of the initially-learned shallow heuristics — I expect there is strong path-dependence in that sense.) 3. Recap The SGD's greed causes shallow heuristics. Shallow heuristics starve gradients towards more accurate heuristics. The GPS, once it appears, gets enslaved to the conglomerate of shallow heuristics. These heuristics are gradually re-interpreted as mesa-objectives. While the GPS is too weak for the crack between the mesa-objective and the base objective to show, marginally improving it improves the performance on the base objective more than tweaking where it's pointing. Once the GPS is smart enough to be deceptive, marginally improving it improves performance on the base objective more than tweaking where it's pointing. Greed causes shallow heuristics which cause gradient starvation which causes inner misalignment which causes deceptive alignment. Open questions: To what extent do the mesa-objectives get adjusted towards the base objective once the GPS crystallizes? How broad is the range between tails-come-apart and deceptive alignment? Can that range be extended somehow? 4. What can be done? Well, replacing the SGD with something that takes the shortest and not the steepest path should just about fix the whole inner-alignment problem. That's a pipe dream, though. Failing that, it sure would be nice if we can get rid of all of those pesky heuristics. One way to do that is featured here. Take a ML model optimized for an objective A. Re-train it to optimize an objective B, picked such that we expect the ML model's WM to feature the abstraction that B is defined over (such as, for example, a diamond). The switch should cause the mess of heuristics for optimizing A to become obsolete, reducing the gradient-starvation effect from them. And, if we've designed the training procedure for B sufficiently well, presumably the steepest gradients will be towards developing heuristics/mesa-objectives that directly attach to the B-representation in the ML model's WM. John counters that this only works if the training procedure for B is perfect — otherwise the steepest gradient will be towards whatever abstraction is responsible for the imperfection (e. g., caring about "things-that-look-like-a-diamond-to-humans" instead of "diamonds"). Another problem is that a lot of heuristics/misaligned mesa-objectives will presumably carry over. Instrumental convergence and all — things like power-seeking will remain useful regardless of the switch in goals. And even if we do the switch before proper crystallization of the power-seeking mesa-objective, its prototype will carry over, and the SGD will just continue from where it left off. In fact, this might make the situation worse: the steepest path to achieving zero-loss on the new objective might be "make the ML model a pure deceptively-aligned sociopath that only cares about power/resources/itself", with the new value never forming. So here's a crazier, radical-er idea: Develop John's Pragmascope: a tool that can process some environment/training dataset, and spit out all of the natural abstraction it contains. Hook the pragmascope up to a regularizer. Train some ML model under that regularizer, harshly penalizing every circuit that doesn't correspond to any feature the pragmascope picked up in the environment the ML model is learning. Ideally, this should result in an idealized generative world-model: the regularizer would continuously purge the shallow heuristics, leaving untouched only the objective statistical correlations across the training data, i. e. the world-model. Train that generative world-model until it's advanced enough to represent the GPS (as part of the humans contained in its training data). RL-retrain the model to obey human commands, under a more relaxed version of the pragmascope-regularizer + a generic complexity regularizer. Naively, what we'll get in the end is an honest genie: an AI that consists of the world-model, a general-purpose problem-solving algorithm, and minimal "connective tissue" of the form "if given a command by a human, interpret what they meant using my model of the human, then generate a plan for achieving it". What's doing what here: Using a "clean" pretrained generative world-model as the foundation should ensure that there's no proto-mesa-objectives to lead the SGD astray. The continuous application of the pragmascope-regularizer should ensure it will stay this way. The complexity regularizer should ensure the AI doesn't develop some separate procedure for interpreting commands (which might end up crucially flawed/misaligned). Instead, it will use the same model of humans it uses to make predictions, and inaccuracies in it would equal inaccuracies in predictions, which would be purged by the SGD as it improves the AI's capabilities. And so we'll get a corrigible/genie AI. It sure seems too good to be true, so I'm skeptical on priors, and the pragmascope would be non-trivial to develop. But I don't quite see how it's crucially flawed yet.