Learning to Reason with LLMs of OpenAI o1
AI Engineer by SAHAJ GODHANI Published 13 Sep 2024
OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), and among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME). It exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA). While the work needed to make this new model as easy to use as current models is still ongoing, we are releasing an early version of this model, OpenAI o1-preview, for immediate use in ChatGPT and to trusted API users(opens in a new window).
Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we continue to investigate them.
o1 performance smoothly improves with both train-time and test-time computing.
How works of OpenAI o1
We trained these models to spend more time thinking through problems before they respond, much like a person would. Through training, they learn to refine their thinking process, try different strategies, and recognize their mistakes.
In our tests, the next model update performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology. We also found that it excels in math and coding. In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13% of problems, while the reasoning model scored 83%. Their coding abilities were evaluated in contests and reached the 89th percentile in Codeforces competitions. You can read more about this in our technical research post.
As an early model, it doesn’t yet have many features that make ChatGPT useful, like browsing the web for information and uploading files and images. For many common cases, GPT-4o will be more capable in the near term.
But for complex reasoning tasks, this is a significant advancement and represents a new level of AI capability. Given this, we are resetting the counter back to 1 and naming this series OpenAI o1.
How to use OpenAI o1
ChatGPT Plus and Team users can access o1 models in ChatGPT starting today. Both o1-preview and o1-mini can be selected manually in the model picker, and at launch, weekly rate limits will be 30 messages for o1-preview and 50 for o1-mini. We are trying to amplify the percentages so that ChatGPT can pick the best model automatically for any given command.
ChatGPT Enterprise and Edu users will get access to both models beginning next week.
Developers who qualify for API usage tier 5(opens in a new window) can start prototyping with both models in the API today with a rate limit of 20 RPM. We’re working to increase these limits after additional testing. The API for these models currently doesn’t include function calls, streaming, support for system messages, and other features. To get started, check out the API documentation(opens in a new window).
We also are planning to bring o1-mini access to all ChatGPT Free users.
Evals
To highlight the reasoning improvement over GPT-4o, we tested our models on a diverse set of human exams and ML benchmarks. We show that o1 significantly outperforms GPT-4o on the vast majority of these reasoning-heavy tasks. Unless otherwise specified, we evaluated o1 on the maximal test-time compute set.
o1 improves over GPT-4o on a wide range of benchmarks, including 54/57 MMLU subcategories. Seven are shown for illustration.
Coding
We trained a model that scored 213 points and ranked in the 49th percentile in the 2024 International Olympiad in Informatics (IOI), by initializing from o1 and training to further improve programming skills. This model competed in the 2024 IOI under the same conditions as the human contestants. It had ten hours to solve six challenging algorithmic problems and was allowed 50 submissions per problem.
For each problem, our system sampled many candidate submissions and submitted 50 of them based on a test-time selection strategy. Submissions were selected based on performance on the IOI public test cases, model-generated test cases, and a learned scoring function. If we had instead submitted at random, we would have only scored 156 points on average, suggesting that this strategy was worth nearly 60 points under competition constraints.
With a relaxed submission constraint, we found that model performance improved significantly. When allowed 10,000 submissions per problem, the model achieved a score of 362.14 — above the gold medal threshold — even without any test-time selection strategy.
Finally, we simulated competitive programming contests hosted by Codeforces to demonstrate this model’s coding skill. Our evaluations closely matched competition rules and allowed for 10 submissions. GPT-4o achieved an Elo rating of 808, which is in the 11th percentile of human competitors. This model far exceeded both GPT-4o and o1 — it achieved an Elo rating of 1807, performing better than 93% of competitors.
Further fine-tuning on programming competitions improves o1. The improvement ranked in the 49th percentile in the 2024 International Olympiad in Informatics under competition rules.
Human preference evaluation
In addition to exams and academic benchmarks, we also evaluated human preference for o1-preview vs GPT-4o on challenging, open-ended prompts in a broad spectrum of domains. In this evaluation, human trainers were shown anonymized responses to a prompt from o1-preview and GPT-4o and voted for which response they preferred. o1-preview is preferred to GPT-4o by a large margin in reasoning-heavy categories like data analysis, coding, and math. However, o1-preview is not preferred on some natural language tasks, suggesting that it is not well-suited for all use cases.
Safety
Chain of thought reasoning provides new opportunities for alignment and safety. We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles. By teaching the model our safety rules and how to reason about them in context, we found evidence of reasoning capability directly benefiting model robustness: o1-preview achieved substantially improved performance on key jailbreak evaluations and our hardest internal benchmarks for evaluating our model’s safety refusal boundaries. We believe that using a chain of thought offers significant advances for safety and alignment because (1) it enables us to observe the model thinking legibly, and (2) the model reasoning about safety rules is more robust to out-of-distribution scenarios.
To stress-test our improvements, we conducted a suite of safety tests and red-teaming before deployment, by our Preparedness Framework(opens in a new window). We found that chain of thought reasoning contributed to capability improvements across our evaluations. Of particular note, we observed interesting instances of reward hacking. Detailed results from these evaluations can be found in the accompanying System Card.
Conclusion
o1 significantly advances the state-of-the-art in AI reasoning. We plan to release improved versions of this model as we continue iterating. We expect these new reasoning capabilities will improve our ability to align models to human values and principles. We believe o1 — and its successors — will unlock many new use cases for AI in science, coding, math, and related fields. We are excited for users and API developers to discover how it can improve their daily work.
if you enjoyed this post and would like to read more, you can subscribe or follow here — https://sahajgodhani777.medium.com/ to get an email whenever I publish a story.