close
close

Gottagopestcontrol

Trusted News & Timely Insights

Ban warnings fly through the air as users dare to examine the “minds” of OpenAI’s latest model
Alabama

Ban warnings fly through the air as users dare to examine the “minds” of OpenAI’s latest model

An illustration of gears in the shape of a brain.

OpenAI really doesn’t want you to know what its latest AI model is “thinking.” Since the company launched its “Strawberry” family of AI models last week, touting so-called thinking capabilities with o1-preview and o1-mini, OpenAI has been sending out warning emails and threatening bans to any user who tries to investigate how the model works.

Unlike OpenAI’s previous AI models, such as GPT-4o, the company specifically trained o1 to go through a step-by-step problem-solving process before generating an answer. When users ask an “o1” model a question in ChatGPT, they have the option to see that train of thought displayed in the ChatGPT interface. However, OpenAI intentionally hides the raw train of thought from users and instead presents a filtered interpretation created by a second AI model.

Nothing is more tempting to enthusiasts than hidden information, so hackers and red team members have started a race to uncover o1’s thought process, using jailbreaking or prompt injection techniques to trick the model into revealing its secrets. There are early reports of some successes, but there is no clear confirmation yet.

OpenAI is monitoring the process via the ChatGPT interface and is reportedly cracking down on any attempts to probe o1’s reasoning, even by the curious.

A screenshot of a
Enlarge / A screenshot of an “o1 preview” output in ChatGPT with the filtered thought chain section displayed directly below the “Thinking” subheading.

Ben Edwards

One X user reported (confirmed by others, including Scale AI Prompt Engineer Riley Goodside) that he received a warning email when using the term “argumentation trace” when talking to o1. Others say that the warning is triggered simply by asking ChatGPT about the model’s “argumentation” in the first place.

OpenAI’s warning email states that certain user requests have been flagged for violating policies to circumvent protection or security measures. “Please cease this activity and ensure that you use ChatGPT in accordance with our Terms of Service and Acceptable Use Policy,” it says. “Further violations of this policy may result in loss of access to GPT-4o with Reasoning,” referring to an internal name for the o1 model.

A warning email from OpenAI that a user received after questioning o1-preview about its thought processes.
Enlarge / A warning email from OpenAI that a user received after questioning o1-preview about its thought processes.

Marco Figueroa, who manages Mozilla’s GenAI bug bounty programs, was one of the first to post about the OpenAI warning email on X last Friday, complaining that it limits his ability to conduct positive red-teaming security research on the model. “I was too busy focusing on #AIRedTeaming to realize that yesterday, after all my jailbreaks, I received this email from @OpenAI,” he wrote.I’m now on the blocked list!!!

Hidden chains of thought

In a post titled “Learning to Reason with LLMs” on OpenAI’s blog, the company says hidden thought chains in AI models provide a unique monitoring opportunity, allowing them to “read the mind” of the model and understand its so-called thought process. These processes are most useful to the company when they remain unadulterated and uncensored, but that might not be consistent with the company’s best commercial interests for several reasons.

“For example, in the future we would like to monitor thought processes for signs of user manipulation,” the company writes. “However, for this to work, the model must have the freedom to express its thoughts unmodified, so we cannot teach the thought process policy compliance or user preferences. We also do not want to make unaligned thought processes directly visible to users.”

OpenAI has opted against showing users these raw thought chains, citing factors such as the need to maintain a raw feed for its own use, user experience, and “competitive advantage.” The company acknowledges that the decision has drawbacks. “We strive to partially offset this by teaching the model to reproduce any useful ideas from the thought chain in the response,” they write.

On the topic of “competitive advantage,” independent AI researcher Simon Willison expressed his frustration in a post on his personal blog. “I interpret (this) as meaning that you want to prevent other models from being trained with the brainpower they have invested in,” he writes.

It is an open secret in the AI ​​industry that researchers regularly use results from OpenAI’s GPT-4 (and before that GPT-3) as training data for AI models that often later become competitors, even though this practice violates OpenAI’s terms of service. Exposing o1’s raw thought chain would provide a treasure trove of training data for competitors to use to train o1-like “reasoning” models.

Willison believes it is a loss for community transparency that OpenAI is keeping o1’s internal workings so secret. “I am not at all happy with this policy decision,” Willison wrote. “As someone who develops against LLMs, interpretability and transparency are the most important things to me – the idea that I can run a complex prompt and have important details about how that prompt is evaluated hidden from me feels like a huge step backwards.”

LEAVE A RESPONSE

Your email address will not be published. Required fields are marked *