stuart_russell's picture
Professor of Computer Science, Director, Center for Intelligent Systems, Smith-Zadeh Chair in Engineering, UC Berkeley; Author (with Peter Norvig) of Artificial Intelligence: A Modern Approach
Will They Make Us Better People?

The primary goal of AI is and has nearly always been to build machines that are better at making decisions. As everyone knows, in the modern view, this means maximizing expected utility to the extent possible. Actually, it doesn't quite mean that. What it means is this: given a utility function (or reward function, or goal), maximize its expectation. AI researchers work hard on algorithms for maximization—game-tree search, reinforcement learning, and so on—and on methods (including perception) for acquiring, representing, and manipulating the information needed to compute expectations. In all these areas, progress has been significant and appears to be accelerating.

Amidst all this activity, an important distinction is being overlooked: being better at making decisions is not the same as making better decisions. No matter how excellently an algorithm maximizes, and no matter how accurate its model of the world, a machine's decisions may be ineffably stupid, in the eyes of an ordinary human, if its utility function is not well aligned with human values. The well-known example of paper clips is a case in point: if the machine's only goal is maximizing the number of paper clips, it may invent incredible technologies as it sets about converting all available mass in the reachable universe into paper clips; but its decisions are still just plain dumb.

AI has followed operations research, statistics, and even economics in treating the utility function as exogenously specified; we say, "The decisions are great, it's the utility function that's wrong, but that's not the AI system's fault." Why isn't it the AI system's fault? If I behaved that way, you'd say it was my fault. In judging humans, we expect both the ability to learn predictive models of the world and the ability to learn what's desirable—the broad system of human values.

As Steve Omohundro, Nick Bostrom, and others have explained, the combination of value misalignment with increasingly capable decision-making systems can lead to problems—perhaps even species-ending problems if the machines are more capable than humans. Some have argued that there is no conceivable risk to humanity for centuries to come, perhaps forgetting that the interval of time between Rutherford's confident assertion that atomic energy would never be feasibly extracted and Szilárd's invention of the neutron-induced nuclear chain reaction was less than twenty-four hours.

For this reason, and for the much more immediate reason that domestic robots and self-driving cars will need to share a good deal of the human value system, research on value alignment is well worth pursuing. One possibility is a form of inverse reinforcement learning (IRL)—that is, learning a reward function by observing the behavior of some other agent who is assumed to be acting in accordance with such a function. (IRL is the sequential form of preference elicitation, and is related to structural estimation of MDPs in economics.) Watching its owner make coffee in the morning, the domestic robot learns something about the desirability of coffee in some circumstances, while a robot with an English owner learns something about the desirability of tea in all circumstances. The robot is not learning to desire coffee or tea; it's learning to play a part in the multiagent decision problem such that human values are maximized.

I don't think this is an easy problem in practice. Humans are inconsistent, irrational, and weak-willed, and human values exhibit, shall we say, regional variations. Moreover, we don't yet understand the extent to which improving the decision-making capabilities of the machine may increase the downside risk of small errors in value alignment. Nevertheless, there are reasons for optimism.

First, there is plenty of data about human actions—most of what has been written, filmed, or observed directly— and, crucially, about our attitudes to those actions. (The concept of customary international law enshrines this idea: it is based on observing what states customarily do when acting from a sense of obligation.) Second, to the extent that human values are shared, machines can and should share what they learn about human values. Third, as noted above, there are solid economic incentives to solve this problem as machines move into the human environment. Fourth, the problem does not seem intrinsically harder than learning how the rest of the world works. Fifth, by assigning very broad priors over what human values might be, and by making the AI system risk-averse, it ought to be possible to induce exactly the behavior one would want: before taking any serious action affecting the world, the machines engage in an extended conversation with us and an extended exploration of our literature and history to find out what we want, what we really, really want.

I suppose this amounts to a change in the goals of AI: instead of pure intelligence, we need to build intelligence that is provably aligned with human values. This turns moral philosophy into a key industry sector. The output could be quite instructive for the human race as well as for the robots.