{"id":36,"date":"2024-09-27T06:51:28","date_gmt":"2024-09-27T04:51:28","guid":{"rendered":"https:\/\/blog.iese.edu\/artificial-intelligence-management\/?p=36"},"modified":"2024-09-27T09:01:35","modified_gmt":"2024-09-27T07:01:35","slug":"openai-new-models","status":"publish","type":"post","link":"https:\/\/blog.iese.edu\/artificial-intelligence-management\/2024\/openai-new-models\/","title":{"rendered":"A Breakthrough for Large Language Models Arrives"},"content":{"rendered":"<div>\n<p class=\"Default\"><span lang=\"EN-US\">The OpenAI models released last week are a game changer for large language models (LLMs), which have for the last year-and-a-half been rather stable in performance. Some had caught up with OpenAI&#8217;s ChatGPT-4, but nobody had significantly surpassed it. The frontier had not advanced far. <\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span lang=\"EN-US\">This had raised speculation, including my own, over whether the scaling of ever larger training datasets, ever larger numbers of parameters in the models, and ever longer training runs had reached some kind of limit. That limit might be related to operational issues surrounding <\/span><span lang=\"IT\">compute cost <\/span><span lang=\"EN-US\">and capacity that would be relaxed over time. Or maybe there was a more substantive limit of the underlying paradigm of the transformer model architecture. The conclusion was that something else, a novel innovation, could be needed for further gains. <\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span lang=\"EN-US\">One such innovation appears now to have emerged.<\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span lang=\"EN-US\">The GPT-4 o1 model is a change of paradigm in the sense that it is not simply the same thing scaled bigger, but rather that lit thinks (or computes) more at inference time (i.e. at the time of use). It is shifting the distribution of computation from being almost entirely at training time, when the model is developed, towards increased computation at inference time, when the model is used. <\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span lang=\"EN-US\">This model is not scaling more data, parameters, or training compute to increase its knowledge, but it is scaling more compute and data to train specifically for problem solving and in particular it is scaling the compute at the time of use. It has in itself embedded steps to evaluate and improve its solutions to problems and shows much better performance than GPT-4 in <\/span><span class=\"Hyperlink0\"><span lang=\"EN-US\"><a href=\"https:\/\/openai.com\/index\/learning-to-reason-with-llms\/\">solving problems<\/a><\/span><\/span><span class=\"None\"><span lang=\"EN-US\"> that involve complex, multistep reasoning.\u00a0It\u2019s much more capable at generating detailed code from ambiguous descriptions and generating coherent, structured content over much longer texts, for instance.\u00a0 <\/span><\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><span lang=\"EN-US\">Before diving into this, here\u2019s a very brief glossary of key AI terms:<\/span><\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><b><span lang=\"EN-US\">Training Data<\/span><\/b><\/span><span lang=\"EN-US\">: The large, diverse text corpus used to teach AI models how to understand and generate language.<\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><b><span lang=\"EN-US\">Parameters<\/span><\/b><\/span><span lang=\"EN-US\">: Internal model values adjusted during training, capturing patterns in the data for generating language.<\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><b><span lang=\"EN-US\">Inference<\/span><\/b><\/span><span lang=\"EN-US\">: The process of generating outputs (like text) from a trained model based on new inputs.<\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><b><span lang=\"EN-US\">Compute<\/span><\/b><\/span><span lang=\"EN-US\">: The computational power required to train and run large AI models, typically involving high-performance hardware.<\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><b><span lang=\"EN-US\">Tokens<\/span><\/b><\/span><span lang=\"EN-US\">: Basic units of text (words, subwords, or characters) that models process when generating language.<\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><b><span lang=\"EN-US\">Autoregressive Models<\/span><\/b><\/span><span lang=\"EN-US\">: Models that generate text one token at a time, with each token partially depending on the preceding ones.<\/span><\/p>\n<\/div>\n<div>\n<p>&nbsp;<\/p>\n<p class=\"Default\"><span class=\"None\"><b><span lang=\"EN-US\">The Preliminaries<\/span><\/b><\/span><b><\/b><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><span lang=\"EN-US\">A key insight to get us started is in the classic 2019 article by the famous computer scientist Rich Sutton titled &#8220;<\/span><\/span><span class=\"Hyperlink0\"><span lang=\"EN-US\"><a href=\"http:\/\/www.incompleteideas.net\/IncIdeas\/BitterLesson.html\">The Bitter Lesson<\/a><\/span><\/span><span class=\"None\"><span lang=\"EN-US\">&#8220;:<\/span><\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><span lang=\"EN-US\">\u201c<\/span><\/span><span class=\"None\"><span lang=\"EN-US\">The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin\u2026Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, <\/span><\/span><span class=\"None\"><b><span lang=\"EN-US\">but the only thing that matters in the long run is the leveraging of computation.<\/span><\/b><\/span><b><\/b><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><span lang=\"EN-US\">One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great.\u00a0The two methods that seem to scale arbitrarily in this way are search and learning.\u201d<\/span><\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><span lang=\"EN-US\">This has the key implication that there is little secret sauce, little magic to A<\/span><\/span><span class=\"None\"><span lang=\"PT\">rtificial <\/span><\/span><span class=\"None\"><span lang=\"EN-US\">I<\/span><\/span><span class=\"None\"><span lang=\"IT\">ntelligence in general. It<\/span><\/span><span class=\"None\"><span lang=\"EN-US\">\u2019s computation and lots of it. It&#8217;s difficult to come up with useful, novel algorithms, and in the end, increasing computing power will trump any model advantage. It&#8217;s also difficult to have a competitive advantage based on algorithms. <\/span><\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><span lang=\"EN-US\">It is possible to have an advantage based on unique data or being able to throw more money (i.e. computation) at the problem than anyone else.<\/span><\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><span lang=\"PT\">Prior <\/span><\/span><span class=\"Hyperlink1\"><span lang=\"EN-US\"><a href=\"https:\/\/arxiv.org\/abs\/2001.08361\">research<\/a><\/span><\/span><span class=\"None\"><span lang=\"EN-US\"> has found that for LLMs, the key dimensions of scaling are the size of the model in terms of the number of parameters, the size of the training dataset, and the amount of computation used in training the model:<\/span><\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><span lang=\"EN-US\">The <\/span><\/span><span class=\"None\"><span lang=\"FR\">graph<\/span><\/span><span class=\"None\"><span lang=\"EN-US\"> below from that research paper illustrates two key points. First, the relationship between <\/span><\/span><span class=\"Hyperlink1\"><span lang=\"EN-US\"><a href=\"https:\/\/gwern.net\/scaling-hypothesis\">scaling<\/a><\/span><\/span><span class=\"None\"><span lang=\"EN-US\"> and these dimensions is very stable and holds over a very large range of values. More is always better, it seems. Second, to get the same marginal improvement in performance, the increase in the inputs of training data, model size, and training compute has to be ever bigger. There are declining marginal returns. This will create at least temporary slowdowns in development as the cost of adding more inputs becomes prohibitively expensive, but this is only temporary as eventually, as Sutton explains above, technological development in computation will drive down the costs again.<\/span><\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-37\" src=\"https:\/\/blog.iese.edu\/artificial-intelligence-management\/files\/2024\/09\/Picture-1-300x126.png\" alt=\"\" width=\"524\" height=\"220\" srcset=\"https:\/\/blog.iese.edu\/artificial-intelligence-management\/files\/2024\/09\/Picture-1-300x126.png 300w, https:\/\/blog.iese.edu\/artificial-intelligence-management\/files\/2024\/09\/Picture-1-768x322.png 768w, https:\/\/blog.iese.edu\/artificial-intelligence-management\/files\/2024\/09\/Picture-1-500x210.png 500w, https:\/\/blog.iese.edu\/artificial-intelligence-management\/files\/2024\/09\/Picture-1.png 964w\" sizes=\"auto, (max-width: 524px) 100vw, 524px\" \/><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><strong><span class=\"None\"><span lang=\"EN-US\">Imperfect Models <\/span><\/span><\/strong><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><span lang=\"EN-US\">There are some features and limitations of these LLMs that are important to understand:<\/span><\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><span lang=\"EN-US\">First, what they fundamentally do is predict the most likely next word (or more accurately, the next <\/span><\/span><span class=\"Hyperlink2\"><span lang=\"NL\"><a href=\"https:\/\/docs.ai21.com\/docs\/large-language-models\">token<\/a><\/span><\/span><span class=\"None\"><span lang=\"EN-US\">) given the prompt and previously predicted words. There is no real &#8220;intelligence&#8221; behind it. It is simply replicating patterns of language that it has &#8220;learned&#8221; from the training data. Given the enormous amounts of data, the enormously large models, and the enormous amounts of compute, these patterns can be very complex and appear to be very &#8220;intelligent&#8221;.<\/span><\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><span lang=\"EN-US\">Second, the models are <\/span><\/span><span class=\"Hyperlink3\"><span lang=\"PT\"><a href=\"https:\/\/medium.com\/@nbettencourt2020\/are-auto-regressive-large-language-models-here-to-stay-603baee46a3c\">auto-regressive<\/a><\/span><\/span><span class=\"None\"><span lang=\"EN-US\"> in the sense that each word that it outputs influences the following words and thus the model output is path dependent. The first word that it predicted as the best output and with which it started shapes everything else that follows. This means it can get stuck on the wrong path and occasionally outputs an error when it&#8217;s unable to find a reasonable continuation.<\/span><\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><span lang=\"EN-US\">Third, there is no truth filter or understanding embedded in the model. The output is a probabilistic sequence of words that appears to be very reasonable given the immense complexity of the patterns embedded in the model. So the models can output completely non-factual information that appears reasonable, a process generally termed &#8220;hallucination&#8221;. This is a fundamental feature of the models themselves and not something that can ever be <\/span><\/span><span class=\"Hyperlink0\"><span lang=\"EN-US\"><a href=\"https:\/\/arxiv.org\/abs\/2409.05746\">eliminated<\/a><\/span><\/span><span class=\"None\"><span lang=\"EN-US\"> entirely:<\/span><\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><span lang=\"EN-US\">Fourth, prompting an LLM to &#8220;think&#8221; in steps, or generating a <\/span><\/span><span class=\"Hyperlink0\"><span lang=\"EN-US\"><a href=\"https:\/\/arxiv.org\/abs\/2201.11903\">chain of thought <\/a><\/span><\/span><span class=\"None\"><span lang=\"EN-US\">(CoT), can significantly improve its ability to do more complex \u201creasoning.\u201d Instead of asking simply for an answer, it is better to ask the model to think through the steps to reach the answer. Giving the model an example of a chain of thought is particularly helpful.<\/span><\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><span lang=\"EN-US\">This turns out to be immensely powerful. In May 2024, a group of researchers from Stanford, Toyota, and Google <\/span><\/span><span class=\"Hyperlink0\"><span lang=\"EN-US\"><a href=\"https:\/\/arxiv.org\/abs\/2402.12875\">showed<\/a> <\/span><\/span><span class=\"None\"><span lang=\"EN-US\">that transformers can solve any problem \u201c<\/span><\/span><span class=\"Hyperlink0\"><span lang=\"EN-US\"><a href=\"https:\/\/x.com\/denny_zhou\/status\/1835761801453306089\">provided<\/a><\/span><\/span><span class=\"None\"><span lang=\"EN-US\"> they are allowed to generate as many intermediate reasoning tokens as needed.\u201d<\/span><\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><span lang=\"EN-US\">In other words, if we were able to let the model run infinitely long, a transformer model could in principle solve any problem. In practice, infinity is a long time. And this is what the new o1 models improve: They search better for solutions to problems.<\/span><\/span><\/p>\n<\/div>\n<div>\n<p class=\"Default\"><span class=\"None\"><span lang=\"EN-US\">And therein lies the paradigm shift of the new OpenAI models, which I\u2019ll cover in the next blog post. <\/span><\/span><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>The OpenAI models released last week are a game changer for large language models (LLMs), which have for the last year-and-a-half been rather stable in performance. Some had caught up with OpenAI&#8217;s ChatGPT-4, but nobody had significantly surpassed it. The frontier had not advanced far. This had raised speculation, including my own, over whether the [&hellip;]<\/p>\n","protected":false},"author":2240,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[102155,120810,120566,102157],"class_list":["post-36","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-artificial-intelligence","tag-chain-of-thought","tag-llms","tag-openai"],"_links":{"self":[{"href":"https:\/\/blog.iese.edu\/artificial-intelligence-management\/wp-json\/wp\/v2\/posts\/36","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.iese.edu\/artificial-intelligence-management\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.iese.edu\/artificial-intelligence-management\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.iese.edu\/artificial-intelligence-management\/wp-json\/wp\/v2\/users\/2240"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.iese.edu\/artificial-intelligence-management\/wp-json\/wp\/v2\/comments?post=36"}],"version-history":[{"count":4,"href":"https:\/\/blog.iese.edu\/artificial-intelligence-management\/wp-json\/wp\/v2\/posts\/36\/revisions"}],"predecessor-version":[{"id":41,"href":"https:\/\/blog.iese.edu\/artificial-intelligence-management\/wp-json\/wp\/v2\/posts\/36\/revisions\/41"}],"wp:attachment":[{"href":"https:\/\/blog.iese.edu\/artificial-intelligence-management\/wp-json\/wp\/v2\/media?parent=36"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.iese.edu\/artificial-intelligence-management\/wp-json\/wp\/v2\/categories?post=36"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.iese.edu\/artificial-intelligence-management\/wp-json\/wp\/v2\/tags?post=36"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}