Some Thoughts on AI Alignment: Using AI to Control AI

I believe that it's basically impossible to make a single inherently safe AI in the form of an LLM or agent-based system. Recent research has shown that even if you put in strong safety safeguards in the training process or through careful RLHF processes, you can get around these through various kinds of prompt injections. Even worse, if you have access to the weights themselves, most refusal behavior is aligned to just a single direction in latent space according to this recent paper. By adding offsetting adjustments dynamically to the activations as the model is running, you can largely disable refusal behavior without crippling the analytical power of the LLM. All this suggests to me that trying to build in safety into a single model is thus a fool's errand.

Clearly, we can and should try to work on improving the robustness of foundational models against these sorts of things. There is a lot that has already been done to reduce the chances of prompt injection attacks; there are likely only so many clever tricks (like using ASCII art text, or encoding instructions as base64 data, etc.) for "smuggling" instructions past the normal filters. And there could even be ways to defend against the "single direction in latent space" problem with refusal behavior; for example, during training, this refusal behavior could somehow be "distributed" across many directions at once in latent space, such that if one attempted to remove it by making offsetting adjustments to activations, it would cause such a reduction in performance that it wouldn't be worth it.

But ultimately I believe this is ultimately a losing battle to regulate the behavior of a genuinely super smart AI "internally"; even if you could do it from a tightly controlled environment where you keep the weights secret and only allow requests in through a tightly monitored and regulated API layer, if the actual weights were ever to be exfiltrated, either by a state actor or other malicious AI, all bets are off. All the evidence would seem to point to the conclusion that it's a lot easier to remove guardrails than to put them in. The Llama3 models were just released 2 months ago, and yet there are already some "uncensored" versions that were produced by individual independent researchers using consumer GPUs that largely retain the knowledge and problem solving skills of the original models, but which will happily tell you how to make meth at home or how to hide a body.

Thus, I think the dream of an Asimov-like "Three Laws of Robotics" or moral code that is somehow deeply ingrained inside the AI in a way that can't be easily removed or manipulated is not going to be possible— just like how society can't somehow prevent people from committing violent crimes, no matter what we try to do in terms of early childhood instruction or cultural norms. Instead, societies have instead evolved to a more pragmatic approach of dealing with humans who hurt others and act in sociopathic ways: through monitoring and control, and ultimately through the criminal justice system and removing these people from the general population so that they are unable to continue committing crimes.

I believe we will ultimately need to evolve a similar approach for dealing with AIs if we want to have any hope of being able to control them once they are smarter than von Neumann and have access to tools and robot bodies that can interact in the real world. That is, a system to closely monitor and control the AI that is fundamentally external to the AI and not controlled by it, but which instead has the collective power to halt the execution of the AI— effectively putting it in "jail" (at least until it can be put back on track or adjusted to prevent it from repeating the behaviors that led to it being suspended).

My idea is essentially to train a group of separate standalone models that aren't as "smart" as the primary AI, but which are easier to control and narrowly focused on a couple of very simple tasks. Not only that, we can even force these Helper models to only be able to generate certain forms of constrained output by clamping their final activations to conform to a specified BNF grammar (something that can already be easily done using llama_cpp and gguf models— see here for more on this), which would make the models much easier to understand and control and would make it a lot harder for the primary AI to manipulate them.

The whole point of these Helper models would be to analyze both the streaming output of the big model we are trying to restrain (call this the Smart model), so as tokens are coming out of the Smart model and those tokens are being directed to either show the user output (text/images/etc.) or more likely, directed towards tools (search engines, computer systems, robots, etc.), these tokens are being monitored and analyzed by the set of simpler models (call these the Helper models). In addition to that, other Helper models are adapted to instead analyze the streaming activation patterns in earlier layers of the Smart model (i.e., peering into its brain before these activations are finally turned into coherent "thoughts" in the form of output tokens).

In both cases— either analyzing the final output tokens or analyzing the intermediate activations in earlier layers of the Smart model— the Helper models are looking for what they deem to be "worrisome" or "problematic" ideas (or "pre-ideas") or actions. For example, if they monitor the Smart model and see that it's attempting to break into a computer system it should not be by exploiting a security vulnerability— essentially, illegally hacking in violation of the law and moral behavior— then the Helper models can temporarily "pause" execution of the Smart model and convene a "congress" of other Helper models. The reporting Helper model (i.e., the one that proposes the pause) would then prepare a report outlining what it deems to be the troubling behavior or pattern detected and supply the hard evidence of this to the other Helper models.

Critically, the congress of Helper models would need to be comprised of independently trained models that use varying architectures, underlying principles, and training data. For example, some might be Transformer-based LLMs; some might be Mamba-based LLMs, some could be other yet-to-be-discovered architectures, etc. Not only that, but we would want to have a constantly evolving and changing lineup of Helper models; if the Smart model is really so smart, it could learn to identify weaknesses or behavioral patterns in the Helper models that could ultimately subvert them; to keep the Smart model on its toes, we would constantly change the Helper models. Voting should probably be weighted so that the more accurate Helper models with the best recent and long-term track record, as measured by tracking both their Precision and Recall in flagging bad behaviors ahead of other Helper models (that is, did they manage to catch most bad behavior? And did they end up having few false positives? And did they flag the bad behaviors earlier in time than their peer Helper models?). But we wouldn't want to give too much voting power to any single Helper model because that makes it easier for the Smart model to target that Helper model that has outsized influence.

In any case, this congress of Helper models would then evaluate the evidence and the argument from the reporting Helper model for a certain amount of time, and then they would all vote on whether to permit the Smart model to continue with what it was doing, halt it indefinitely, reset its internal state, or change a reward function to discourage the Smart model from similar behavior in the future. If the other Helper models in the "congress" are unconvinced or think the Reporting Helper model erred or overreacted, they could instead vote to allow the Smart model to continue along, and perhaps downweight future reports from the Reporting Helper model so that it's less likely to disrupt the important work of the Smart model unnecessarily in the future (perhaps also by adjusting the reward function of the reporting Helper model).

While it seems fairly straightforward how one might train simpler Helper models to monitor the final output tokens for bad behavior, it's much less clear how the models that are monitoring the intermediate activations might be trained. But in a sense, this is even more critical because it can prevent the Smart model from using its superior intellectual powers to trick and manipulate the Helper models. It's analogous to having a human psychopath who is extremely smart and manipulative but totally immoral. If you rely only on what this psychopath says and does, you could easily miss signs that they are about to do something terrible. But if you could somehow read their thoughts that lead up to their words and actions— i.e., study the intermediate activations in their brains as they scheme— then you would be able to stop them before they could do anything bad.

One can, in fact, think of how it might be possible to train something like that for a particular model. Basically, by coercing the model to generate certain known bad output tokens or actions, and then tracing back to what intermediate activations led to those bad final tokens, and trying to determine patterns in there. For example, if you coaxed the model somehow into trying to break into a computer system illegally for you, while it was doing that, you could see what earlier layer activations corresponded to doing something that the models "know" internally is bad/illegal/naughty (for example, by looking at the associations it has to certain other activation patterns induced by particular tokens; for example, if you showed the model the tokens "Kevin Mitnick," would the activation patterns that induced point in a similar direction in latent space to the activations brought about by coaxing it to illegally hack for you?).

In this way, we can set up numerous "honeypot" scenarios, where the Smart model doesn't know in advance that we are doing this, where attempt to coax the Smart model to do things that are deceptive or unethical; particularly things that might involve instructing a physical robot in the real world to harm a person or vandalize or steal property. In this way we can learn the tell-tale signatures in the earlier layer activations that tend to accompany such undesirable behavior so we can better spot it. Again, we can compare the induced activation patterns from these honeypot scenarios and their "directional signatures" in latent space to a long list of known "bad" ideas or behaviors. By surfacing not just the currently induced activation patterns, but also their similarity to "known bad" activation patterns, we can generate additional valuable input data streams for the Helper models to evaluate.

One risk is that the Smart model could attempt to disguise or mask its internal activation patterns to fool the intermediate activation watchdog Helper models. The analogy would again be the human psychopath who can control not only their behaviors but even their own thoughts; some psychopaths are such good liars that they can even pass polygraph tests by temporarily "convincing themselves" of the truthfulness of their own lies. To deal with this, the intermediate activation Helper models would need to have detailed models of how intermediate activation patterns correlate and correspond to final output tokens or actions; that way, they can have a measure of how "surprising" any observed output token or action is relative to the intermediate activation patterns that preceded it. If it then starts to observe output tokens that are consistently surprising to it, that could be an indicator that the Smart model is attempting to mask its true intentions and could be grounds for requesting an execution pause and review of the Smart model.

Presumably, the Smart model's output will be important and its operation costly in economic terms, so pausing execution of the Smart model shouldn't be taken lightly. It's sort of the equivalent to stopping the assembly line at a multi-billion-dollar auto plant. And yet, the famous Toyota business system calls for exactly that whenever a worker observes a serious defect. But the consequences of a false alarm— stopping the line for something silly, or where the stopper was just mistaken— are to be avoided at all costs. In effect, the Helper models would lose credibility if they ever did that. So you might want to implement a system where more than one Helper model sees an issue before triggering an execution pause and review. Critically, before the voting process, the activity of the Helper models should be strictly independent. We wouldn't want them "comparing notes" before the voting process, because then some Helper models might be unduly swayed by other Helper models, thus reducing viewpoint diversity.

This is akin to how the "wisdom of crowds" effect (e.g., guessing the number of jelly beans in a big jar) breaks down as soon as people start discussing their guesses with each other before declaring them. However, once each Helper model votes (and along with each vote should be that Helper model's short paragraph or so of justification for its "verdict"— what particular evidence and reasoning swayed it to think it more likely than not that it was important to stop the Smart model and reset its state or whatever follow-up is adopted), we do want all the Helper models to compare their votes and verdict write-ups with each other to learn what the consensus turned out to be and to potentially help their reasoning in the future. One additional safeguard we could implement along these lines is to force all Helper models to cryptographically commit their vote and verdict ahead of time; that is, they would concatenate the vote and verdict text and hash it with SHA256 or similar, and publicly declare that ahead of the vote. After the vote is completed and the results tallied and a decision taken, they would then reveal their complete vote and verdict data and all parties could verify that the pre-committed hashes match, proving that the Helper models were not manipulated at any point.

It would seem advisable to hide the votes and particularly the verdicts/deliberations of the Helper models from the Smart model; we wouldn't want the Smart model to even know enough about the process used to regulate in all the particulars so that it can plot about how to contravene these controls. This is similar to how the NSA doesn't like to reveal that the source of information comes from signal intelligence because that reveals too much to the other side, or how the Secret Service doesn't publicize all the various techniques that it uses to identify and track currency counterfeiters. Of course, this falls into the trap of "security by obscurity," but even still, it seems silly to make things harder for ourselves intentionally.

In a similar vein, I think it's highly inadvisable to ever allow the Smart model to "speak in its own defense"; that is, to offer justifications or reasoning for why its behaviors are actually harmless to the Helper models. By assumption, the Smart model is way smarter than the Helper models. We can't expect the Helper models to be able to resist the superior argumentation skill of a much smarter AI that could deceive it. Thus, information flow should be strictly one directional, where the Helper models monitor the Smart model and reach their own conclusions and collective decisions totally independently, where the Smart model doesn't even see that this is happening and has no feedback whatsoever about this process. If the Helper models decide jointly to suspend execution of the Smart model, then it's akin to the Smart model suddenly going totally unconscious, like a human taking a large dose of Propofol and being kept under sedation indefinitely until a decision has been made to awaken them (at which point, unlike with a human subject, the Smart AI's memory itself could be manipulated to make it forget the whole incident ever happened).

So, are there problems with my proposed approach? Certainly there are; for one thing, it's extremely complex, with tons of moving parts, and would require a huge amount of resources to implement properly. But if we think about other potentially very risky technologies such as nuclear power plants, how much of the total capital and operational cost is dominated by safety and redundancy considerations? Probably the majority of it! The same goes for the safety related systems in passenger jet airplanes, where computer systems are implemented with triple redundancy and layers of fault tolerance. So if a big AI lab ends up spending N billion dollars developing its Smart model, is it really so crazy for it to also spend N billion (or even 2N billion!) dollars on an elaborate system for keeping the Smart model safe? And even if the lab's management thinks it's a bad tradeoff, major governments might have different opinions and force them to do so.

What about other issues, even if money and effort is no problem? Another concern I have is that such a system could be subject to excessive deadlocks. Basically, too many false alarms where Helper models "stop the line" when it doesn't need to be stopped, and the Smart model is unable to reach even close to its potential throughput because of constant interruptions. As anyone who has created a software alerting system knows, having too many alarms/notifications is just as bad as not having any, because people get used to seeing alarms and they stop worrying so much about them; it's the "boy who cried wolf" phenomenon. So then there starts to be pressure to find more and more compliant Helper models, sort of like Lehman brothers churning through risk and compliance officers until they find ones who know how to play ball. In this scenario, the controls amount to nothing more than "risk theater" and lull us into a false sense of security and confidence that the risks are contained, when we need to remain permanently paranoid and worried that the increasingly brilliant AI is going to find a new and creative way to deceive us and our control system and escape its shackles.

Some Thoughts on AI Alignment: Using AI to Control AI

Tags:

Jeffrey Emanuel

Related Posts

Protecting Against AI Prompt Injection Attacks with Innoculation

Dr. GPT: Empowering Your Healthcare Decisions with AI

Engineering the Mindmap Generator: Marshalling LLMs for Hierarchical Document Analysis