A new paper from the DAIR.AI team is raising questions about how AI models are evaluated today. Titled "The Evaluation Trap," the work argues that many current benchmarks measure correlated behaviors rather than true underlying capabilities — especially relevant for agent leaderboards and model selection.

The topic gained traction in the research community, as more companies and researchers rely on benchmark results to choose models or showcase progress. The paper suggests rethinking how these tests are constructed.

In parallel, TechCrunch reports that Greg Brockman, OpenAI co-founder, is taking on broader responsibility for product strategy at the company. Brockman has been active on X recently, sharing thoughts on tokens as a universal input for problem-solving, Codex improvements, and even country-wide ChatGPT Plus access for Malta.

Another notable funding round came from Nectar Social, a Marketing OS platform that raised $30 million in a Series A led by Menlo Ventures. The investment reflects growing demand for AI-powered marketing solutions that can operate at scale.

TechCrunch also published analysis on the widening gap between the AI "haves" and "have-nots" — where some players enjoy access to massive compute resources while others face budget constraints.

An older story that resurfaced in the conversation is that of Cerebras, the AI chip startup valued at $60 billion. According to the report, the company nearly collapsed in its early days while burning $8 million per month.

Finally, arXiv announced a new policy: authors who let AI do all the work on a paper — without meaningful human contribution — may face a one-year suspension. The move aims to preserve scientific quality and integrity.

The X conversation around these topics reflects a community grappling with deep questions about how progress is measured, how AI companies are funded, and how research standards are maintained in an era of rapidly improving tools.

Why it matters

The DAIR.AI paper and arXiv's new policy highlight the need for critical thinking about how the AI community evaluates itself. As the industry scales, so does the risk of relying on superficial metrics.

Bottom line

Today's AI discussion centered on questions of evaluation reliability, capital raises, and the evolving roles of senior founders at leading companies.