Differentiable Programming

Over the past few years, a raft of classic challenges in artificial intelligence which had stood unsolved for decades were conquered, almost without warning, through an approach long disparaged by AI purists for its "statistical" flavor: it's essentially about learning probability distributions from large volumes of data, rather than examining humans' problem-solving techniques and attempting to encode them in executable form. The formidable tasks it has solved range from object classification and speech recognition, to generating descriptive captions for photos and synthesizing images in the style of famous artists—even guiding robots to perform tasks for which they were never programmed! 

This newly dominant approach, originally known as "neural networks," is now branded "deep learning," to emphasize a qualitative advance over the neural nets of the past. Its recent success is often attributed to the availability of larger datasets and more powerful computing systems, or to large tech companies' sudden interest in the field. These increasing resources have indeed been critical ingredients in the rapid advancement of the state of the art, but big companies have always thrown resources at a wide variety of machine learning methods, and it's deep learning in particular that has seen such unbelievable advances; many other methods have also improved, but to a far lesser extent.

So what is the magic that separates deep learning from the rest, and can crack problems for which no group of humans has ever been able to program a solution? The first ingredient, from the early days of neural nets, is a timeless algorithm, rediscovered again and again, known in this field as "back-propagation." It's really just the chain rule—a simple calculus trick—applied in a very elegant way. It's a deep integration of continuous and discrete math, enabling complex families of potential solutions to be autonomously improved with vector calculus.

The key is to organize the template of potential solutions as a directed graph (e.g., from a photo to a generated caption, with many nodes in between). Traversing this graph in reverse enables the algorithm to automatically compute a "gradient vector," which directs the search for increasingly better solutions. You have to squint at most modern deep learning techniques to see any structural similarity to traditional neural networks, but behind the scenes this back-propagation algorithm is crucial to both old and new architectures.

But the original neural networks that used back-propagation fall far short of newer deep learning techniques, even using today's hardware and datasets. The other key piece of magic in every modern architecture is another deceptively simple idea: components of a network can be used in more than one place at the same time. As the network is optimized, every copy of each component is forced to stay identical (this idea is called "weight-tying"). This enforces an additional requirement on weight-tied components: they must learn to be useful in many places all at once, and not specialize to a particular location. Weight-tying causes the network to learn a more generally useful function, since a word might appear at any location in a block of text, or a physical object might appear at any place in an image.

Putting a generally-useful component in many places of a network is analogous to writing a function in a program and calling it in multiple spots—an essential concept in a very different area of computer science, functional programming. This is actually more than just an analogy: weight-tied components are actually the same concept of reusable function as in programming. And it goes even deeper! Many of the most successful architectures of the past couple years reuse components in exactly the same patterns of composition generated by common "higher-order functions" in functional programming. This suggests that other well-known operators from functional programming might be a good source of ideas for deep learning architectures.

The most natural playground for exploring functional structures trained as deep learning networks would be a new language that can run back-propagation directly on functional programs. As it turns out, hidden in the details of implementation, functional programs are actually compiled into a computational graph similar to what back-propagation requires. The individual components of the graph need to be differentiable too, but Grefenstette et al. recently published differentiable constructions of a few simple data structures (stack, queue, and deque), which suggests that further differentiable implementations are probably just a matter of clever math. Further work in this area may open up a new programming paradigm—differentiable programming. Writing a program in such a language would be like sketching a functional structure with the details left to the optimizer; the language would use back-propagation to automatically learn the details according to an objective for the whole program—just like optimizing weights in deep learning, but with functional programming as a more expressive generalization of weight-tying.

Deep learning may look like another passing fad, in the vein of "expert systems" or "big data." But it's based on two timeless ideas (back-propagation and weight-tying), and while differentiable programming is a very new concept, it's a natural extension of these ideas that may prove timeless itself. Even as specific implementations, architectures, and technical phrases go in and out of fashion, these core concepts will continue to be essential to the success of AI.