Deep Learning, Semantics, And Society

Deep learning neural networks are the most exciting recent technological and scientific development. Technologically, they are soundly beating competing approaches in a wide variety of contests including speech recognition, image recognition, image captioning, sentiment analysis, translation, drug discovery, and video game performance. This has led to massive investments by the big technology companies and the formation of more than 300 deep learning startups with more than $1.5 billion of investment.

Scientifically, these networks are shedding new light on the most important scientific question of our time: "How do we represent and manipulate meaning?" Many theories of meaning have been proposed that involve mapping phrases, sounds, or images into logical calculi with formal rules of manipulation. For example, Montague semantics tries to map natural language phrases into a typed lambda calculus.

The deep learning networks naturally map input words, sounds, or images into vectors of neural activity. These vector representations exhibit a curious "algebra of meaning." For example, after training on a large English language corpus, Mikolov's Word2Vec exhibits this strange relationship: "King - Man + Woman = Queen." His network tries to predict words from their surrounding context (or vice versa). The shift of context from "The king ate his lunch" to "The queen ate her lunch" is the same as from "The man ate his lunch" to "The woman ate her lunch." The statistics of many similar sentences lead to the vector from "king" to "queen" being the same as from "man" to "woman." It also maps "prince" to "princess," "hero" to "heroine," and many other similar pairs. Other "meaning equations" include "Paris - France + Italy = Rome," "Obama - USA + Russia = Putin," "Architect - Building + Software = Programmer." In this way, these systems discover important relational information purely from the statistics of training examples.

The success of these networks can be thought of as a triumph of "distributional semantics," first proposed in the 1950s. Meaning, relations, and valid inference all arise from the statistics of experiential contexts. Similar phenomena were found in the visual domain in Radford, Metz, and Chintala's deep networks for generating images. The vector representing a smiling woman minus the woman with a neutral expression plus a neutral man produces an image of the man smiling. A man with glasses minus the man without glasses plus a woman without glasses produces an image of the woman with glasses.

Deep learning neural networks are now being applied to hundreds of important applications. A classical challenge for industrial robots is to use vision to find and pick up a desired part from a bin of disorganized parts. An industrial robot company recently reported success at this task using a deep neural network with eight hours of training. A drone company recently described a deep neural network that autonomously flies drones in complex real-world environments. Why are these advances happening now? For these networks to learn effectively, they require large training sets, often with millions of examples. This, combined with the large size of the networks, means that they also require large amounts of computational power. These systems are having a big impact now because the web is a source of large training sets and modern computers with graphics coprocessors have the power to train them.

Where is this going? Expect these networks to soon be applied to every conceivable application. Several recent university courses on deep learning have posted their students' class projects. In just a few months, hundreds of students were able to use these technologies to solve a wide variety of problems that would have been regarded as major research programs a decade ago. We are in a kind of "Cambrian explosion" of these networks right now. Groups all over the world are experimenting with different sizes, structures, and training techniques and other groups are building hardware to make them more efficient.

All of this is very exciting but it also means that artificial intelligence is likely to soon have a much bigger impact on our society. We must work to ensure that these systems have a beneficial effect and to create social structures that help to integrate the new technologies. Many of the contest-winning networks are "feedforward" from input to output. These typically perform classification or evaluation of their inputs and don't invent or create anything. More recent networks are "recurrent nets" which can be trained by "reinforcement learning" to take actions to best achieve rewards. This kind of system is better able to discover surprising or unexpected ways of achieving an outcome. The next generation of network will create world models and do detailed reasoning to choose optimal actions. That class of system must be designed very carefully to avoid unexpected undesirable behaviors. We must very carefully choose the goals that we ask these systems to optimize. If we are able to develop the scientific understanding and social will to guide these developments in a beneficial direction, the future is very bright indeed!