OpenAI has officially launched a new general-purpose agent within ChatGPT, marking a significant leap in AI performance by doubling the benchmarks set by its predecessors, o3 and o4-mini.
Record-Breaking Performance Metrics
The new ChatGPT agent model achieved a 41.6% score on “Humanity’s Last Exam” (pass@1), a rigorous evaluation encompassing thousands of questions across over 100 diverse subjects. This performance represents a substantial improvement over the o3 and o4-mini models.
Furthermore, the agent demonstrated exceptional aptitude on FrontierMath, one of the most challenging mathematical benchmarks available. When equipped with tools such as a terminal for code execution, the ChatGPT agent secured a 27.4% score, significantly outperforming the previous state-of-the-art result of 6.3% held by o4-mini.
Prioritizing Safety in Agentic AI
Due to the advanced capabilities of this new model, OpenAI has implemented strict safety protocols to prevent misuse by malicious actors. In an official safety report, the company categorized the agent as “high capability” regarding biological and chemical weapon domains. While OpenAI lacks direct evidence of exploitation, it has adopted a precautionary stance to mitigate risks of “amplifying existing pathways to severe harm.”
Advanced Safeguards and Feature Restrictions
To maintain security, OpenAI has integrated a real-time monitoring system. Every user prompt is now processed through a classifier to detect biology-related queries. If triggered, a secondary monitor evaluates the model’s response to ensure it cannot be utilized to facilitate biological threats.
Additionally, OpenAI has opted to disable the ChatGPT memory feature for this specific agent. This decision aims to prevent potential prompt injection attacks that could lead to the exfiltration of sensitive data. While the company may consider reintroducing memory capabilities in the future, it remains restricted for the time being to ensure system integrity.
The Road Ahead for Agentic Technology
While the technical specifications of the ChatGPT agent are impressive, the true test lies in real-world application. Historically, agent technology has struggled with the volatility of real-world interactions. OpenAI asserts that this new model is specifically engineered to overcome those limitations and deliver on the long-standing promise of truly capable AI agents.
