Datasets Over Algorithms

Perhaps the most important news of our day is that datasets—not algorithms—might be the key limiting factor to development of human-level artificial intelligence.

At the dawn of the field of artificial intelligence, in 1967, two of its founders famously anticipated that solving the problem of computer vision would take only a summer. Now, almost a half century later, machine learning software finally appears poised to achieve human-level performance on vision tasks and a variety of other grand challenges. What took the AI revolution so long?

A review of the timing of the most publicized AI advances over the past thirty years suggests a provocative explanation: perhaps many major AI breakthroughs have actually been constrained by the availability of high-quality training datasets, and not by algorithmic advances. For example, in 1994 the achievement of human-level spontaneous speech recognition relied on a variant of a hidden Markov model algorithm initially published ten years earlier, but used a dataset of spoken Wall Street Journal articles and other texts made available only three years earlier. In 1997, when IBM’s Deep Blue defeated Garry Kasparov to become the world’s top chess player, its core NegaScout planning algorithm was fourteen years old, whereas its key dataset of 700,000 Grandmaster chess games (known as the "The Extended Book") was only six years old. In 2005, Google software achieved breakthrough performance at Arabic- and Chinese-to-English translation based on a variant of a statistical machine translation algorithm published seventeen years earlier, but used a dataset with more than 1.8 trillion tokens from Google Web and News pages gathered the same year. In 2011, IBM’s Watson became the world Jeopardy! champion using a variant of the mixture-of-experts algorithm published twenty years earlier, but utilized a dataset of 8.6 million documents from Wikipedia, Wiktionary, Wikiquote, and Project Gutenberg updated one year prior. In 2014, Google’s GoogLeNet software achieved near-human performance at object classification using a variant of the convolutional neural network algorithm proposed twenty-five years earlier, but was trained on the ImageNet corpus of approximately 1.5 million labeled images and 1,000 object categories first made available only four years earlier. Finally, in 2015, Google DeepMind announced its software had achieved human parity in playing twenty-nine Atari games by learning general control from video using a variant of the Q-learning algorithm published twenty-three years earlier, but the variant was trained on the Arcade Learning Environment dataset of over fifty Atari games made available only two years earlier.

Examining these advances collectively, the average elapsed time between key algorithm proposals and corresponding advances was about eighteen years, whereas the average elapsed time between key dataset availabilities and corresponding advances was less than three years, or about six times faster, suggesting that datasets might have been limiting factors in the advances. In particular, one might hypothesize that the key algorithms underlying AI breakthroughs are often latent, simply needing to be mined out of the existing literature by large, high-quality datasets and then optimized for the available hardware of the day. Certainly, in a tragedy of the research commons, attention, funding, and career advancement have historically been associated more with algorithmic than dataset advances.

If correct, this hypothesis would have foundational implications for future progress in AI. Most importantly, the prioritized cultivation of high-quality training datasets might allow an order-of-magnitude speedup in AI breakthroughs over purely algorithmic advances. For example, we might already possess the algorithms and hardware that will enable machines in a few years to author human-level long-form creative compositions, complete standardized human examinations, or even pass the Turing Test, if only we trained them with the right writing, examination, and conversational datasets. Additionally, the nascent problem of ensuring AI friendliness might be addressed by focusing on dataset rather than algorithmic friendliness—a potentially simpler approach.

Although new algorithms receive much of the public credit for ending the last AI winter, the real news might be that prioritizing the cultivation of new datasets and research communities around them could be essential to extending the present AI summer.