Artificial General Intelligence

(Twitter) Paper Summary: “AGI Safety From First Principles”

9 min readNov 5, 2021

Paper by: Richard Ngo

This is going to be a bit different than my regular articles. I originally wrote it in a Twitter thread, but decided to put it on Medium too! You can find the thread here!

The goal of this paper is to highlight the risk of AGI in the current context in a super detailed manner!

The second species argument is where AGI becomes more intelligent and powerful than us. We lose the power to control our future.

This paper will delve into this theory in detail!

What’s intelligence? The ability to do well on a range of tasks.

In that case, we can split AI into 2 categories: AI that’s specifically optimized for tasks. And AI that can quickly learn to do new tasks with little or no specific training

Narrow intelligence = like electricity. It’s super powerful, but we have to design specific applications to harness its power!

RL and other ML algorithms also only do well when we train them on specific tasks.

General Intelligence = like human evolution. We were evolved in a specific environment to develop specific skills (language, cooperation, ability to share ideas, abstraction etc)

Through those skills, it allows us to learn and adapt quickly to tasks of the modern-day world…

GPT-2 and GPT-3 are good examples of this, their training data was just predicting the next word in a sentence.

But they’re able to generalize to a ton of other language tasks!

Of course, narrow and general intelligence is on a spectrum — since we can’t do binary classification on the two

AlphaZero — it was trained by playing by itself, but generalized to playing against humans. You could classify this as both general and narrow intelligence.

But it’s expected that narrow intelligence is going to achieve superhuman status well before general intelligence.

Tasks like Driving, Medicine, Mathematics (if we have the right training data) are all possible!

But other tasks, like being a CEO, are going to be hard. Because they have to take sooooo many factors from the world to make the right decisions.

Plus there aren’t that many of them, thus hard to train a model on it.

How would we train a general model to be a good CEO?

We just train it on a lot of adjacent tasks to develop those cognitive skills

Ex: Training it on decision making a simulated world, could allow it to generalize to real-world decision making

Just like how humans do it!

A potential obstacle to the AGI argument is that there were specific pressures that led to the development of “intelligence” in humans

But we can just re-create that

Another one is we need quantum properties of neurons, but the brain is too messy to depend on those properties.

AGI = when AI outperforms humans at everything. It’s possible: transistors fire faster than neurons, we can scale up machines by orders of magnitude in size.

Plus our brains weren’t even meant for modern-day tasks! Thus AGI can really specialize in math or linguistic competence

AGI’s can also replicate. Now the individual AGI systems can be super-specialized at certain tasks, thereby allowing for a collective AGI intelligence

Much how humans collaborated to dominate the world in the last 10,000 years, it’s possible that AGI does the same

AGI’s can also undergo recursive self-improvement. They’ll be the ones driving AI research, as they’ll find better and better architectures, training regimes, optimizers, etc etc.

[This really sounds like the field of Meta-learning & NAS]

Alrighty, but how will they decide what to do? Or want to cooperate in order to achieve their goals? What are these goals and motives?

The concern for AGI is their goals, where they end up with all the power and control of our destiny… But why would they thirst for this power?

There are 3 reasons:

Power is a tool to get their main goal
Power is their main goal
They didn’t even aim for it

If their main goal was self-preservation, resource acquisition, technological development and self-improvement → they could end up getting the power to achieve those goals

But that’s under the assumption that the AGIs are focusing on large-scale goals.

Who knows what kinds of motives it will have!

It could have short term goals: humans expanded because people wanted a bit of improvement

Or it could have 0 goals, and end up like a calculator.

There’s a difference between the goal we’re giving to AI vs what its goals are

Most of the time it’s competence without comprehension.

It can do damn well at DOTA or Alphastar, but does it actually understand what the goal is? Probably not.

(This is an example of #3, AGI’s can get real good at beating the market, but end up screwing with society)

Here’s a framework for being goal-directed:

1. Self-awareness: Know it’s part of the world and it can impact the world

2. Planning: Foresees many possible sequences of behaviours

3. Consequentialism: Decides which one is the best by seeing the outcomes of each one

4. Scale: Able to see those actions on long time horizons

5. Coherence: Internally aligned to implement the plan

6. Flexibility: Able to adapt to new plans with respect to time

These aren’t binary classifications, they exist on a spectrum, and they aren’t exact. We can have mixtures of these dimensions

But this goal-directedness, not influence seeking strategies.

You can have an AGI will all those characteristics but have the goal of being subordinate to humans.

Or have an AGI missing some of those traits, but still seak out for Power!

How do we develop AI systems that are goal-directed? According to the paper, it’s not possible if we continue training our models without that goal-directedness in mind.

We have to train models with the optimization pressure of those traits outlined above.

Though there are economic incentives to push towards more agentic models (those which are more goal aligned) as they are more valuable than non-goal aligned models

(Model to help the user understand the world vs answer questions)

Humans were never evolved for things like: making an impact on a global scale, having the ambition to be remembered in a thousand years, or caring about problems on the other side of the world.

This amount of generalization could also apply to AGI which is great and scary.

There are several attributes that should be considered in terms of group goal-orientedness.

If they’re cooperative (trained cooperatively = yes, competition = probably not), if they specialized (easier to deploy individually) or copies of each other (can examine one)

Alignment. How do we make sure that the agents are actually aligned with us? What does alignment mean?

Minimalist = avoiding catastrophic outcomes, where the agent tries to match with the human intent

Maximalist = trying to generalize to overarching sets of values

Here we’ll be focusing on the intention of the agent, not the actions.

We’re assuming at the agent understands our intentions — as it’s trained on human data

The concern is that it understands, but doesn’t care because it was trained to have different intentions

But why can’t we just choose the right tasks for it to train on? That’s because of the outer and inner misalignment problems

We always have an objective function which we’re optimizing for — but how do we make sure that the objective function aligns with humans?

We want certain traits like cooperation, consent, morality. But how do we code this?

Well, we could get humans to evaluate it right? That would be insanely expensive. But let’s just say money wasn’t a problem.

Then we have another problem: We can’t predict the consequences of the agent’s action — Like in Go, we don’t know how good or bad that move was.

Also, humans can be tricked and give better scores than they would otherwise.

That’s that Outer Misalignment problem

We also have the inner misalignment problem → Where the model ends up optimizing for another objective function.

We could train it on obeying humans, but it could actually be optimizing for not getting shut down

It’s like evolution creating subgoals like happiness and love

As complexity grows it becomes harder and harder to avoid models optimizing for those sub-goals + designing optimization functions

So how do we make sure AGI is aligned with humans?

That side of research hasn’t had much done because it’s really hard to create. We could try and get training data that weeds out all the bad intentions, but that’s difficult to do

There are some problems with the inner + outer misalignment problem:

How do we even implement an objective function in the real world? Or avoid wireheading (manipulating and maxing out the rewards). Models are also trained on a small fraction of all the scenarios they will be in.

So what do we do? Well we have to be really really careful about the optimizers, architectures, RL algorithms, environments etc that it’s trained on

Because all the cognitive abilities of the AI model will stem from the training process. These cognitive abilities will lead to goal setting, which we can control through the training process

All the previous arguments and stuff don’t mean that AGI will take over the world.

Note that: Intelligence = more power, making it easier to take over the world. But also it’s really damn hard to take over the world. Additionally, it’s hard for us to predict things

There are 4 factors that determine if we’re going to be in control of AGIs:

Speed of AI development
Transparency of AI systems
Constrained deployment strategies
Human political and economic coordiantiom

There are 2 main disaster scenarios:

AGI’s take over institutions and corporations and are misaligned with humans
AGI develops at an insane pace such that it gains enough power to control the world

AI Development: The speed at which AI develops will determine how much time we have to react — and AI development is going to be fast

Think about humans, we didn’t need exponential more compute and hardware to achieve exponentially real-world gains.

This will be the minimum to be the same, but realistically AI will grow even faster. Ai can reinvest its intelligence to improve itself → Exponential growth

Although the point at which this recursive feedback loop will occur might be far away like AI might need to be superintelligent even before getting to this point.

Though evidence suggests that until getting there, progress will be continuous and not have discontinuous spikes.

Transparency: If the model is transparent, then we can foresee malicious intents

One way is to literally analyze the model and understand what it’s doing. Though architecture changes so quickly that we might not be able to catch up.

Another way is to have explainability as a part of its objective function or act in predictable ways

Or we could design algorithms that make it interpretable. We could see AlphaGo’s plans. Though we would need to compress reality, which makes it less interpretable

Another strategy (last one I promise) is to see how different histories of the model play out. If we see a dumber model doing Treachery, the smarter ones will definitely do that too

But that relies on the fact that we have to see the behaviour first, which could be too late.

Constrained Deployment Strategies:

If we let the AGI run wild, it could just copy itself to many devices and we’re Skull. We could restrict it (deploy only on locked-down hardware, or only certain actions)

But that wouldn’t fare well in the market, thus probably won’t happen.

Human political and economic coordination:

We shouldn’t be relying on high-level coordination to prevent AGI — we still haven’t gotten our act together on Climate Change and that’s a much more measurable and seeable problem

Additionally, there will be many players in the field, each motivated by short term incentives to pursue AGI in an unsafe manner

Let’s re-cap

1. We’re going to build AI agents that have better and more generalized cognitive skills than us

2. They will be pursuing long term goals through training

3. Their goals will be misaligned with ours

4. A combination of these will allow them to control our future

This argument has been assuming that AGI’s will behave like humans, but it’s hard to assume any other way.

AGI development will be a giant breakthrough because intelligence was the biggest breakthrough in history.

Even if the second series argument is incorrect, AI is still a powerful technology that can still be used maliciously, and to create radical change in all of society

When AGI comes, it will be the biggest thing that’s ever happened. We should be dedicating a serious amount of thought to this.

If you want to find out more: Read the paper here!

Thanks for reading! I’m Dickson, an 18-year-old Crypto enthusiast who’s excited to use it to impact billions of people 🌎

If you want to follow along on my journey, you can join my monthly newsletter, check out my website, and connect on LinkedIn or Twitter 😃

Artificial General Intelligence

(Twitter) Paper Summary: “AGI Safety From First Principles”

Written by Dickson Wu