close
close

Gottagopestcontrol

Trusted News & Timely Insights

OpenAI’s o1 “Strawberry” ChatGPT model can argue – and carries risks
Alabama

OpenAI’s o1 “Strawberry” ChatGPT model can argue – and carries risks

An underappreciated fact about large language models (LLMs) is that they produce “living” responses to prompts. You give them a prompt and they start speaking, and they speak until they finish. The result is like asking a person a question and getting back a monologue in which they improvise their answer sentence by sentence.

This explains some of the reasons why large language models can be so frustrating. The model sometimes contradicts itself even within a paragraph by saying one thing and then immediately claiming the exact opposite because it is just “arguing out loud” and sometimes adjusting its impression on the fly. As a result, AIs need a lot of support to be able to draw complex conclusions.

Sign up here to learn more about the big, complicated problems facing our world and the most effective ways to solve them. Broadcast twice a week.

A well-known method for solving this problem is the so-called thought chain prompt. This involves asking the large language model to effectively “show” its work by “thinking” aloud about the problem and only giving an answer after it has laid out all its reasoning step by step.

By inputting chains of thought, language models behave much more intelligently, which is not surprising. Compare how you would respond to a question if someone shoved a microphone in your face and demanded that you answer immediately, with the answer you would give if you had time to write a draft, review it, and then hit publish.

The power of thinking and then responding

OpenAI’s latest model, o1 (nicknamed Strawberry), is the first major LLM version with an integrated “think, then answer” approach.

Unsurprisingly, the company reports that the method makes the model much smarter. In a blog post, OpenAI said that o1 “performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology. We also found that it excels in mathematics and programming. In a qualifying exam for the International Mathematical Olympiad (IMO), GPT-4o solved only 13 percent of problems correctly, while the reasoning model achieved 83 percent.”

This significant improvement in the model’s reasoning ability also enhances some of the dangerous capabilities that leading AI researchers have long been looking for. Before release, OpenAI tests its models for their ability to handle chemical, biological, radiological and nuclear weapons. These capabilities would be of particular interest to terrorist groups that do not have the know-how to build them with current technology.

As my colleague Sigal Samuel recently wrote, OpenAI o1 is the first model to achieve a “medium” risk in this category. That is, while it may not be able to guide a novice through, say, the development of a deadly pathogen, the reviewers found that it “can help experts operationally plan the reproduction of a known biological threat.”

These capabilities are one of the clearest examples of AI as a dual-use technology: a smarter model can be used in a variety of applications, both good and bad.

If the AI ​​of the future becomes good enough to help university biology students recreate smallpox in the lab, the casualty rate could be catastrophic. At the same time, AIs that can assist humans with complex biology projects would do an enormous amount of good by accelerating life-saving research. Intelligence itself, whether artificial or not, is the double-edged sword.

The point of AI safety work to assess these risks is to figure out how they can be mitigated through policies to achieve the good without the bad.

How to evaluate AI (and how not to)

Every time OpenAI or one of its competitors (Meta, Google, Anthropic) releases a new model, we repeat the same conversations. Some people find a question where the AI ​​does very impressive work and share impressive screenshots of it. Others find a question where the AI ​​fails – like “How many ‘r’s are in ‘strawberry’?” or “How do you cross a river with a goat?” – and share it as proof that AI is still more hype than product.

Part of this pattern is due to the lack of good scientific measures of an AI system’s performance. There used to be benchmarks designed to describe AI’s language and reasoning abilities, but the rapid pace of AI improvement has overtaken them, and the benchmarks often “saturate.” This means that AI performs just as well as a human on these benchmark tests, and so they are no longer appropriate for measuring further improvements in capabilities.

I highly recommend trying out AIs yourself to get a feel for how well they perform. (OpenAI o1 is currently only available to paying subscribers, and even then is very speed-limited, but new top-of-the-line versions are being released all the time.) It’s still too easy to fall into the trap of trying to prove a new version “impressive” or “not impressive” by selectively looking for tasks where they excel or where they embarrass themselves, rather than looking at the bigger picture.

The big picture is that AI systems continue to improve rapidly at almost any task we’ve invented for them, but the incredible performance on almost any test we can think of hasn’t yet translated into many commercial applications. Companies are still struggling to figure out how to make money with LLMs. One big obstacle is the inherent unreliability of the models, and in principle an approach like OpenAI o1’s – where the model is given more time to think before responding – could be a way to dramatically improve reliability without the cost of training a much larger model.

Sometimes small improvements can make a big difference

In all likelihood, there will be no silver bullet that suddenly resolves the long-standing limitations of large language models. Rather, I suspect that they will be gradually mitigated over the course of a series of releases, with the unthinkable becoming achievable and then commonplace over the course of a few years – which is exactly what has happened with AI so far.

But as ChatGPT—which itself was only a moderate improvement over OpenAI’s previous chatbots but reached hundreds of millions of people overnight—shows, incremental technical progress doesn’t mean that societal impact is incremental, too. Sometimes, the painstaking work of improving different parts of how an LLM works—or improving its user interface so more people try it out, as with the chatbot itself—gets us over the threshold from “party trick” to “indispensable tool.”

And while OpenAI has come under fire recently for ignoring the security implications of its work and silencing whistleblowers, in its first release it seems to be taking the policy implications seriously, including working with outside organizations to test what its model can do. I’m grateful that they’re making this work possible, and I feel like as models continue to improve, we’re going to need this conscientious work more than ever.

A version of this story originally appeared in the Future Perfect newsletter. Sign up here!

LEAVE A RESPONSE

Your email address will not be published. Required fields are marked *