Google spent $44 million to acquire a company started by Dr. Hinton and his two students. And their system led to the creation of increasingly powerful technologies, including new chatbots like ChatGPT and Google Bard. Mr. Sutskever went on to become chief scientist at OpenAI.
It did stuff, and a theory to try and explain why came after!!!
Hinton and Sejnowski described the Boltzmann Machine in a 1983 paper. “I read that paper when I was starting my graduate studies, and I said, ‘I absolutely have to talk to these guys—they’re the only people in the world who understand that we need learning algorithms,’ ” Yann LeCun told me. In the mid-eighties, Yoshua Bengio, a pioneer in natural-language processing and in computer vision who is now the scientific director at Mila, an A.I. institute in Quebec, trained a Boltzmann Machine to recognize spoken syllables as part of his master’s thesis.
“Geoff (Hinton) was one of the external reviewers,” he recalled. “And he wrote something like ‘This should not work.’ ” Bengio’s version of the Boltzmann Machine was more effective than Hinton expected; it took Bengio a few years to figure out why. This would become a familiar pattern. In the following decades, neural nets would often perform better than expected,
perhaps because new structures had formed among the neurons during training. “The experimental part of the work came before the theory,” Bengio recalled. Often, it was a matter of trying new approaches and seeing what the networks came up with.
…
Hinton was in love with the Boltzmann Machine. He hoped that it, or something like it, might underlie learning in the actual brain. “It should be true,” he told me. “If I was God, I’d make it true.” But further experimentation revealed that as Boltzmann Machines grew they tended to become overwhelmed by the randomness that was built into them. “Geoff and I disagreed about the Boltzmann Machine,” LeCun said. “Geoff thought it was the most beautiful algorithm. I thought it was ugly. It was stochastic”—that is, based partly on randomness. By contrast, LeCun said, “I thought backprop was super clean.” “Backprop,” or backpropagation, was an algorithm that had been explored by a few different researchers beginning in the nineteen-sixties. Even as Hinton was working with Sejnowski on the Boltzmann Machine, he was also collaborating with Rumelhart and another computer scientist, Ronald Williams, on backprop. They suspected that the technique had untapped potential for learning; in particular, they wanted to combine it with neural nets that operated across many layers
…used neurons to learn; therefore, complex learning through neural networks must be possible. He would work twice as hard for twice as long. When networks were trained through backprop, they needed to be told when they were wrong and by how much; this required vast amounts of accurately labelled data, which would allow networks to see the difference between a handwritten “7” and a “1,” or between a golden retriever and a red setter. But it was hard to find well-labelled datasets that were big enough, and building more was a slog. LeCun and his collaborators developed a giant database of handwritten numerals, which they later used to train networks that could read sample Zip Codes provided by the U.S. Postal Service. A computer scientist named Fei Fei Li, at Stanford, spearheaded a gargantuan effort called ImageNet; creating it required collecting more than fourteen million images and sorting them into twenty thousand categories by hand. As neural nets grew larger, Hinton devised a way of getting knowledge from a large network into a smaller one that might run on a device like a mobile phone. “It’s called distillation,” he explained, in his kitchen. “Back in school, the art teacher would show us some slides and say, ‘That’s a Rubens, and that’s a van Gogh, and this is William Blake.’ But suppose that the art teacher tells you, ‘O.K., this is a Titian, but it’s a peculiar Titian because aspects of it are quite like a Raphael, which is very unusual for a Titian.’ That’s much more helpful. They’re not just telling you the right answer—they’re telling you other plausible answers.” In distillation learning, one neural net provides another not just with correct answers but with a range of possible answers and their probabilities. It was a richer kind of knowledge.
Hinton was not in love with backpropagation. “It’s so unsatisfying intellectually,” he told me. Unlike the Boltzmann Machine, “it’s all deterministic. Unfortunately, it just works better.” Slowly, as practical advances compounded, the power of backprop became undeniable. In the early seventies, Hinton told me, the British government had hired a mathematician named James Lighthill to determine if A.I. research had any plausible chance of success. Lighthill concluded that it didn’t—“and he was right,” Hinton said, “if you accepted the assumption, which everyone made, that computers might get a thousand times faster, but they wouldn’t get a billion times faster.” Hinton did a calculation in his head. Suppose that in 1985 he’d started running a program on a fast research computer, and left it running until now. If he started running the same program today, on the fastest systems currently used in A.I., it would take less than a second to catch up. In the early two-thousands, as multi-layer neural nets equipped with powerful computers began to train on much larger data sets, Hinton, Bengio, and LeCun started talking about the potential of “deep learning.” The work crossed a threshold in 2012, when Hinton, Alex Krizhevsky, and Ilya Sutskever came out with AlexNet, an eight-layer neural network that was eventually able to recognize objects from ImageNet with human-level accuracy. Hinton formed a company with Krizhevsky and Sutskever and sold it to Google. He and Jackie bought the island in Georgian Bay—“my one real indulgence,” Hinton said.
The first artificial neural network, Perceptron Mark I, was developed in 1957 and could learn to tell whether a card was marked on the left side or the right. It had 1,000 artificial neurons, and training it required around 700,000 operations. More than 65 years later, OpenAI released the large language model GPT-4. Training GPT-4 required an estimated 21 septillion operations.
Google pioneered much of the foundational research that has since led to the recent explosion in large language models. Google AI was the first to invent the Transformer language model in 2017 that serves as the basis for the company’s later model BERT, and OpenAI’s GPT-2 and GPT-3. BERT, as noted above, now also powers Google search, the company’s cash cow.
A Short History Of ChatGPT: How We Got To Where We Are Today