Transcript

"When a measure becomes a target, it ceases to be a good measure," says Goodhart's law.

Any metric, whether it quantifies the academic success of a student, competence of a job candidate, or accuracy of an artificial intelligence system, is bound to be gamed. Both us humans and AI systems are very good at gaming metrics.

Standardized tests in school, for example, intend to measure how well students learned the material. However, once a standardized test score becomes a competitive metric to decide who goes to college and who doesn't then students start memorizing question patterns and the entire test turns into a pattern memorization contest.

This is not a new issue. In the late 70s, Donald Campbell, a psychologist and social scientist, wrote this: "Achievement tests may well be valuable indicators of general school achievement under conditions of normal teaching aimed at general competence. But when test scores become the goal of the teaching process, they both lose their value as indicators of educational status and distort the educational process in undesirable ways."

The same applies to job interviews. Job interviews for software engineering positions involve questions that are intended to measure how well a candidate is at developing algorithms. However, nowadays there's an entire industry of interview prep. People sell books, online courses, and premium memberships to expose you to as many interview questions as possible. Then, what the interview actually measures becomes who is willing to spend more time memorizing question patterns, also known as 'grinding leetcode.'

It's very hard to come up with a metric that cannot be circumvented. For example, to measure a researcher's performance, if you use the number of papers published, then that will result in a lot of low-quality, near-duplicate papers. If you use the number of citations that those papers get, then a group of researchers can form a cartel and rub each other's back by citing each others' papers all the time. If you use the h-index, which is the number of papers having that number of citations. For example, if you have an h-index of 10, that means you have 10 papers that got more than 10 citations. If that's the case, then we are more likely to see papers with a lot of co-authors.

So, how is this relevant to AI? Just like humans, artificially intelligent agents also strive to maximize reward. And just like in real life, an AI agent can find a way to maximize its reward by gaming the system by finding loopholes in it.

Let's take a look at an example that OpenAI pointed out. In this boat racing game, named CoastRunners, the intended goal is to finish the race. The player also gets points by hitting targets on its way. This point mechanism led to some unforeseen behavior when a reinforcement learning agent was trained to maximize the overall score. Instead of finishing the track, as intended, the AI agent runs in circles, hitting the targets to collect the rewards over and over again as the targets repopulate. With this strategy, it was able to achieve higher scores than the human players who play the game 'the right way.'

The issue in this example was a faulty reward function. But what would be a better reward function? Winner takes all, where only the agent who finishes the track gets a reward? Sparse reward functions like that usually don't work well, since it's very hard for an agent to get that reward by chance and optimize its policy towards that direction. That's why earlier reinforcement learning agents struggled with the video game named Montezuma's Revenge. Montezuma's Revenge has very sparse rewards. It's virtually impossible to get any positive feedback resulting from random actions.

In such environments, it's usually a good idea to define intermediate rewards to increase the density of positive feedback. This is called reward shaping and makes learning easier by giving the agent more frequent feedback. So, defining intermediate rewards, like the ones collected by hitting some targets along the way, can be very useful when done the right way.

A famous real-life example to reward exploitation is the anecdote that gave the cobra effect its name. Back in the times of British rule in India, the government was concerned that there were too many venomous cobras in Delhi. So they offered, a cash reward for anyone who brought in a dead cobra. This seemed to decrease for a while until people started breeding cobras for extra profit.

So far we talked only about unforeseen consequences of faulty reward functions. As the tasks get more sophisticated it becomes harder to design foolproof reward functions. Even when rewards functions are well designed, they may be prone to another type of reward hacking: which is tampering with the reward system itself.

An AI system can find a way to modify its reward function to maximize reward, rather than modifying its behavior to get the rewards the way it's intended. A real-life analogy for reward tampering can be drug addiction. Certain drugs hack the brain's reward centers, substituting the excitement and pleasure you would normally get from activities that are good for you with the consumption of certain harmful substances.

Another real-life analogy would be how people interact with the law. Finding loopholes in the law would be an example of specification gaming whereas buying politicians to change the law to benefit yourself would be an example of reward tampering.

Reward hacking, specification gaming, reward tampering... They are all related concepts. Unintended consequences of reward mechanisms.

As the G-Man says: "Prepare for unforeseen consequences."

Alright, that's all for today. I hope you liked it. Subscribe for more videos. And as always, thanks for watching, stay tuned, and see you next time.

How Humans and Artificial Intelligence Exploit Loopholes

Transcript

Further Reading