2015 : WHAT DO YOU THINK ABOUT MACHINES THAT THINK?

Information Scientist and Professor of Electrical Engineering and Law, University of Southern California; Author, Noise, Fuzzy Thinking
Thinking Machines = Old Algorithms On Faster Computers

 

Machines don't think. They approximate functions. They turn inputs into outputs. A pocket calculator's square-root button turns the number 9 into the number 3. A well-trained convolutional neural network turns an image with your face in it into the output 1. It turns an image without your face in it into the output 0.

A multilayered or "deep" enough neural net maps any image to the probability that your face is in that image. So the trained net approximates a probability function. The process takes a staggering amount of computation to come even close to getting it right. But the result still just maps inputs to outputs. It still approximates a function even if the result resembles human perception or thinking. It just takes a lot of computer power.

"Intelligent" machines approximate complex functions that deal with patterns. The patterns can be of speech or images or of any other signals. Image patterns tend to consist of many pixels or voxels. So they can have very high dimension. The patterns involved can easily exceed what the human mind can grasp. That will only increase as computers improve.

The real advance has been in the number-crunching power of digital computers. That has come from the steady Moore's-law doubling of circuit density every two years or so. It has not come from any fundamentally new algorithms. That exponential rise in crunch power lets ordinary looking computers tackle tougher problems of big data and pattern recognition.

Consider the most popular algorithms in big data and in machine learning. One algorithm is unsupervised (requires no teacher to label data). The other is supervised (requires a teacher). They account for a great deal of applied AI.

The unsupervised algorithm is called k-means clustering. It is arguably the most popular algorithm for working with big data. It clusters like with like and underlies Google News. Start with a million data points. Group them into 10 or 50 or 100 clusters or patterns. That's a computationally hard problem. But k-means clustering has been an iterative way to form the clusters since at least the 1960s. What has changed has been the size of the problems that current computers can handle. The algorithm itself has gone under different AI-suggestive names such as self-organizing maps or adaptive vector quantization. It's still just the old two-step iterative algorithm from the 1960s.

The supervised algorithm is the neural-net algorithm called backpropagation. It is without question the most popular algorithm in machine learning. Backpropagation got its name in the 1980s. It had appeared at least a decade before that. Backpropagation learns from samples that a user or supervisor gives it. The user presents input images both with and without your face in them. These feed through several layers of switch-like neurons until they emit a final output. That output can be a single number. The teacher wants the number 1 as output if your face is in an input image. The teacher wants 0 otherwise. The net learns the pattern of your face as it sweeps back and forth like this over thousands or millions of iterations. At no one step or sweep does any intelligence or thought occur. Nor does the update of any of the hundreds or thousands of the network parameters resemble how real synapses learn new patterns of neural stimulation. Changing a network parameter is instead akin to someone choosing their next action based on the miniscule downstream effect that their action would have on the interest rate of a 10-year U.S. bond.

Punchline: Both of these popular AI algorithms are special cases of the same standard algorithm of modern statistics—the expectation-maximization (EM) algorithm. So any purported intelligence involved is just ordinary statistics after all.

EM is a two-step iterative scheme for climbing a hill of probability. EM does not always get to the top of the highest hill of probability. It does almost always get to the top of the nearest hill. That may be the best any learning algorithm can do in general. Carefully injected noise and other tweaks can speed up the climb. But all paths still end at the top of the hill in a maximum-likelihood equilibrium. They all end in a type of machine-learning nirvana of locally optimal pattern recognition or function approximation. Those hilltop equilibria will look ever more impressive and intelligent as computers get faster. But they involve no more thinking than calculating some sums and then picking the biggest sum.

Thus much of machine thinking is just machine hill climbing.

Marvin Minsky's 1961 review paper "Steps Toward Artificial Intelligence" makes for a humbling read in this context because so little has changed algorithmically since he wrote it over a half century ago. He even predicted the tendency to see computer-intensive hill climbing as something cognitively special: "perhaps what amounts to straightforward hill climbing on one level may sometimes appear (on a lower level) as the sudden jumps of 'insight.'"

There are other AI algorithms. But most fall into categories that Minsky wrote about. One example is running Bayesian probability algorithms on search trees or graphs. They have to grapple with exponential branching or some related form of the curse of dimensionality. Another example is convex or other nonlinear constrained optimization for pattern classification. French mathematician Joseph-Louis Lagrange found the general solution algorithm that we still use today. He came up with it in 1811. Clever tricks and tweaks will always help. But progress here depends crucially on running these algorithms on ever-faster computers.

The algorithms themselves consist mainly of vast numbers of additions and multiplications. So they are not likely to suddenly wake up one day and take over the world. They will instead get better at learning and recognizing ever richer patterns simply because they add and multiply faster.

It's a good bet that tomorrow's thinking machines will look a lot like today's—old algorithms running on faster computers.