Chain of Thought Reasoning, the New LLM Breakthrough

In my previous blog post, I offered a background on large language models (LLMs) and some of their shortcomings and wrote of a recent breakthrough brought by new OpenAI models. Now, let’s take a deep dive into that paradigm shift.

Going back to the quote from The Bitter Lesson:

One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.

The key difference in the o1 model is that it scales search (computation at the time of using the model) in addition to learning (computation at the time of training the model). As OpenAi explains in its introduction of 01, it has been trained to use chain of thought reasoning automatically in creating its answers:

“Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem. Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working. This process dramatically improves the model’s ability to reason.”

In contrast to prompting an LLM to do chain of thought, to think in steps, this o1 model has been trained to do that by itself. What this implies is that there still appears to be a base model, quite possibly GPT-4o, trained with a fixed set of training data. But in addition to that, there’s a separate training for solving problems. Supporting this is the fact, shared by OpenAI research Scientist Noam Brown, that o1 appears to perform about the same as GPT-40 when it is not doing problem solving requiring logical steps:

“Our o1 models aren’t always better than GPT-4o. Many tasks don’t need reasoning, and sometimes it’s not worth it to wait for an o1 response vs a quick GPT-4o response.”

Some additional pieces of information were revealed in an OpenAI session for developers:

Notes from OpenAI Developers AMA [Ask Me Anything] session:

  • o1 is single model, there is no multiple models, “monologue” etc
  • o1 is able to generate very long responses while keeping coherence unlike all previous models
  • no amount of prompt engineering on GPT-4o can match o1’s performance
  • o1-preview and o1 are of same size
  • o1-mini dataset had much more STEM data than other data

Training to solve problems could in principle be done by training for right solutions or for right process. The explicit statement of “Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses” above strongly suggests the latter. And this is also what a May 2023 research paper from OpenAI found to be the better alternative:

“In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state- of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. … We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset.”

A natural way to train for a process to solve problems is to have a set of step-by-step solutions to problems as training material. And in January 2023, Semafor reported that OpenAI was collecting data that appears related:

“OpenAI, the company behind the chatbot ChatGPT, has ramped up its hiring around the world, bringing on roughly 1,000 remote contractors over the past six months in regions like Latin America and Eastern Europe, according to people familiar with the matter.

The other 40% are computer programmers who are creating data for OpenAI’s models to learn software engineering tasks.

A software developer in South America who completed a five-hour unpaid coding test for OpenAI told Semafor he was asked to tackle a series of two-part assignments. First, he was given a coding problem and asked to explain in written English how he would approach it. Then, the developer was asked to provide a solution. If he found a bug, OpenAI told him to detail what the problem was and how it should be corrected, instead of simply fixing it.”

The above-mentioned research paper from May 2023 also details a process to collect feedback on solution steps created by a model for reinforcement learning:

“To collect process supervision data, we present human data-labelers with step-by-step solutions to MATH problems sampled by the large-scale generator. Their task is to assign each step in the solution a label of positive, negative, or neutral, as shown in Figure 1.

We label solutions exclusively from the large-scale generator in order to maximize the value of our limited human-data resource. We refer to the entire dataset of step-level labels collected as PRM800K. The PRM800K training set contains 800K step-level labels across 75K solutions to 12K problems.

To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.”

Furthermore, in July 2024, OpenAI published a blog post describing how they developed CriticGPT, a model based on GPT-4, to evaluate the output steps produced by ChatGPT to speed up this reinforcement learning and to make better evaluations of the model output :

“We’ve trained a model, based on GPT-4, called CriticGPT to catch errors in ChatGPT’s code output. We found that when people get help from CriticGPT to review ChatGPT code they outperform those without help 60% of the time. We are beginning the work to integrate CriticGPT-like models into our RLHF [Reinforcement Learning with Human Feedback] labeling pipeline, providing our trainers with explicit AI assistance. This is a step towards being able to evaluate outputs from advanced AI systems that can be difficult for people to rate without better tools. ..

As we make advances in reasoning and model behavior, ChatGPT becomes more accurate and its mistakes become more subtle. This can make it hard for AI trainers to spot inaccuracies when they do occur, making the comparison task that powers RLHF much harder. This is a fundamental limitation of RLHF, and it may make it increasingly difficult to align models as they gradually become more knowledgeable than any person that could provide feedback.”

This additional training provides another potential for further scaling in developing AI models. OpenAI gave nice graphs to show exactly that. The longer this reinforcement learning was carried out, the better the model learned to solve problems. This is in addition to scaling the dataset, model size, and (pre-)training. The result appears to be log-linear in that exponential increases in training compute are required for linear increases in performance — of course, depending on the outcome measure.

In actually solving problems, the o1 model then appears to perform chain of thought reasoning. What is different though from simply prompting for chain of thought is that the model appears iteratively to evaluate each step and to redirect its own search in the solution space for the problem.

This last part is essential. LLMs in general are auto-regressive in that each token that they output influences the next token and hence the model output can get locked into wrong paths:

But the o1 models are far less susceptible to this as they appear to redirect the search for a solution at each step. “We believe iterated CoT genuinely unlocks greater generalization,” writes Mike Snoop, co-founder of Zapier, on ARC Prize. “Automatic iterative re-prompting enables the model to better adapt to novelty… If we only do a single inference, we are limited to reapplying memorized programs. But by generating intermediate output CoTs, or programs, for each task, we unlock the ability to compose learned program components, achieving adaptation. This technique is one way to surmount the #1 issue of large language model generalization: the ability to adapt to novelty.”

In other words, the o1 model is able to recombine solution steps on which it has been trained but is less likely to invent new methods of solving problems:

“In practice, o1 is significantly less likely to make mistakes when performing tasks where the sequence of intermediate steps is well-represented in the synthetic CoT [chain of thought] training data. … We expect to see systems like o1 do better on benchmarks that involve reusing well-known emulated reasoning templates (programs) but will still struggle to crack problems that require synthesizing brand new reasoning on the fly. … In summary, o1 represents a paradigm shift from “memorize the answers” to “memorize the reasoning” but is not a departure from the broader paradigm of fitting a curve to a distribution in order to boost performance by making everything in-distribution.”

If the o1 is trying different combinations of steps to solve a problem, the longer we let this process go on, the higher the chance of a correct solution. And this is exactly what OpenAI’s graphs nicely show. Again, the relationship appears log-linear and exponential increases in computation are required for linear increases in performance:

“o1 is trained with RL to “think” before responding via a private chain of thought. The longer it thinks, the better it does on reasoning tasks. This opens up a new dimension for scaling. We’re no longer bottlenecked by pretraining. We can now scale inference compute too.”

Image

The length of this process is hard coded at current time:

“When OpenAI released o1 they could have allowed developers to specify the amount of compute or time allowed to refine CoT at test-time. Instead, they’ve “hard coded” a point along the test-time compute continuum and hid that implementation detail from developers.”

OpenAI has also chosen to hide the intermediate steps in the solution process:

“Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.”

And if you keep asking about the hidden deliberations, OpenAI could cut off your access, tweeted programmer and computational linguist Theia Vogel. And apparently already has cut off access.

How much does scaling this test-time compute matter relative to scaling training-time compute? Turns out, it matters a lot and a bit of computation at test-time saves a lot of compute at training-time:

“In 2016, AlphaGo beat Lee Sedol in a milestone for AI. But key to that was the AI’s ability to “ponder” for ~1 minute before each move. How much did that improve it? For AlphaGoZero, it’s the equivalent of scaling pretraining by ~100,000x (~5200 Elo with search, ~3000 without)”

Others have found the same. A research paper from Google DeepMind in Aug 2024 found, holding total computation constant, moving computation from training time to test time can lead to a smaller model to outperform a 14x larger model:

“In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? … Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.”

And in another research paper revised recently, researchers from Stanford & DeepMind find that this scaling holds at least for four orders of magnitude:

Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit the amount of compute to only one attempt per problem. Here, we explore inference compute as another axis for scaling by increasing the number of generated samples. Across multiple tasks and models, we observe that coverage – the fraction of problems solved by any attempt – scales with the number of samples over four orders of magnitude.

Up next in the third post in this series: a look at the implications of this LLM breakthrough.