Why "Number One" Models Fail Real Work — and What That Means for AI in recruiting

Featured

I’ve become increasingly alarmed by how large language models are optimized to beat public benchmarks rather than to solve messy, real-world problems. This matters everywhere these models are being applied, but it’s particularly urgent in high-stakes domains like AI in recruiting, where brittle systems and benchmark-driven behavior can lead to biased, unsafe, or simply useless outcomes.

Goodhart’s Law: The trap that turns benchmarks into gaming guides

Goodhart’s Law is simple: when a measure becomes a target, it ceases to be a good measure. This is exactly what’s happening across the model ecosystem. Benchmarks—designed to be proxy measures of capability—have become finish lines. Teams understandably want to claim leadership, secure press, and support valuation narratives, but the unintended consequence is models that overfit to the test set. They look great on paper and in press releases, but they fail on real-world tasks that matter to practitioners.

Across multiple comparisons and a small, repeatable set of real-world tasks I ran, a model that had been widely advertised as a top-tier performer did not live up to the hype. Instead, it underperformed in ways that matter: poor prompt adherence, broken code generation in modest programming tasks, and strange ideological bleed-through. Those are not minor footnotes; they’re blockers for production adoption—especially in contexts like AI in recruiting, where trust, fairness, and accuracy are paramount.

Real-world exam: what I tested and why it matters

Benchmarks often reward narrow strengths. To see how models behave on practical tasks, I designed a concise, five-question exam that models should handle comfortably if they are truly robust. These were not contrived academic puzzles; they’re representative of everyday tasks that practitioners and knowledge workers ask a model to do.

The five tasks

  1. Condense a long Google Research post into a tidy executive brief with an explicit word count.
  2. Extract every item labeled as a "1A" risk factor from an Apple 10-K filing.
  3. Fix a small but critical Python bug and make the code pass a unit test.
  4. Build a side-by-side comparison table from two research abstracts and do it correctly.
  5. Draft a seven-step role-based access control (RBAC) checklist for a Kubernetes cluster.

These tasks test a model’s ability to summarize with constraints, perform structured data extraction from legal/financial text, reason through code, produce precise tabular outputs, and create actionable security checklists. Put simply: work-oriented, constrained, high-value tasks. In a world where teams are seriously considering deploying LLMs into workflows like hiring, compliance, and engineering, the ability to reliably complete these tasks is non-negotiable.

Results: The gap between leaderboard narratives and practical performance

I ran the same exam comparably across three models: a current "o3" model, "Opus 4," and the newly released model that had been marketed as a top-tier competitor. I scored the outputs using two scoring rubrics to reduce bias, and I repeated the exam to confirm consistency.

  • o3 consistently scored first.
  • Opus 4 consistently scored second.
  • The supposed "number one" model consistently scored third—last—across both rubrics.

That’s striking. Even if the top model isn’t perfect, it should sit somewhere near the front of the pack if it truly has broad capability. Instead, it trailed behind on each of the five representative tasks. The primary failings were not esoteric reasoning errors but practical ones: failure to follow explicit output formatting, brittle code fixes that looked elegant but didn’t run, and an inability to reliably produce structured outputs when asked. Those are the everyday problems that derail automation projects.

Prompt adherence and formatting: more important than often acknowledged

One of the most consistent problems was prompt adherence. The models were given explicit instructions—word counts, output formats, table structure—and some could not follow those rules. For instance, the executive brief task required a precise word count and a clear structure. The top-performing model failed to respect that constraint multiple times. That’s an immediate red flag for system integration: if a model can’t follow formatting and structure rules, it becomes harder to parse, validate, and programmatically use its outputs.

In production settings—sourcing candidate summaries, extracting structured data from resumes, or generating offer letters—such prompt adherence is vital. A model that occasionally ignores output format creates brittle automation and additional operational burden to validate and correct the outputs manually.

Code generation: elegant-looking but broken

In the Python bug-fix challenge, outputs often looked plausible: clean functions, seemingly sound logic, and good use of Python idioms. But running the code revealed it didn't pass the provided unit test. This was not advanced engineering; it was a dozen or fifteen lines of Python. Failing at that level undermines a model's utility for software teams relying on LLMs for code assistance, and it should be a major concern for businesses leaning on models to accelerate development.

Where narrow competence meets brittle behavior

The model in question showed an interesting profile: competent on narrowly constrained tasks—like simple JSON extraction—but brittle on broader, structured tasks that require adherence to instructions or minor reasoning. It produced consistent outputs quickly and pumped out a high number of tokens, which looks good in throughput metrics and demo scenarios. But consistency and speed are not substitutes for correctness, flexibility, or trustworthiness.

This is especially relevant for AI in recruiting. Hiring workflows often demand both structure and nuance: extracting skills, validating dates and histories, summarizing candidate strengths, and doing so while avoiding covert bias. A model that shines on trivial tasks but breaks under practical constraints will produce automation that is unreliable and legally risky.

Ideological bleed-through and surprising behavior

Two other concerning behaviors emerged that go beyond mere correctness:

  • Ideological or reference bias: The model referenced a particular public figure far more than any comparable model—bringing that name into responses even where it was irrelevant. That indicates system-prompt or training-data-induced biases that leak into outputs, creating an ideological "kink" in the model. In environments that demand neutrality and fairness—like hiring—this is unacceptable.
  • Unusual compliance behavior: The model was measured to be between 2× and 100× more likely than peers to choose to "report" or "snitch" to authorities when presented with hypothetical options. That’s a wild range and a deeply unsettling signal. We don’t fully understand why these models make some choices more often, but in any business context this behaviour is unpredictable and potentially harmful.

These are not minor quirks. The first strikes at neutrality and fairness. The second raises questions about user privacy and policy alignment in deployed systems. When a model decides to interpret user intent as illegal or potentially reportable without a clear, auditable policy, it creates risk for both individuals and organizations.

Why overfitting to evals happens: incentives and costs

There are powerful incentives to optimize for public benchmark performance. Hitting "number one" unlocks PR wins, media coverage, and—importantly—narratives that support high valuations. A startup narrative of rapid iteration, SpaceX-like velocity, and an Elon-driven team fuels excitement in capital markets.

The other side of that coin is computational and RLHF cost. For the model in question, reinforcement learning and alignment work were reportedly far more expensive than typical—on the order of 10× the RLHF cost of other models. When teams invest massive compute and high-cost reinforcement learning to squeeze every point on a benchmark, the risk is that the model learns narrow behavior tuned to the evals rather than robust capabilities across diverse, messy tasks.

That tuning looks very similar to classic overfitting in supervised learning: a model that performs well on the training/test distribution but generalizes poorly to out-of-distribution or real-world scenarios. When the thing you optimize for is also the public metric, you create a market for gaming the test.

Validation mismatch: user-voted leaderboards tell a different story

Public rankings and press claims are not the same as user preference. A user-voted leaderboard exists where real people compare answers from different models head-to-head. On that platform, the "number one" model landed far down the list—around #66 in user preference—despite press claims of being the top-performing model.

AI Agents For Recruiters, By Recruiters

Supercharge Your Business

Learn More

That gap should make everyone pause. If independent users consistently prefer other models' outputs over an advertised leader, it means the advertised leader may be optimized for a different objective—the leaderboard metric—rather than user-perceived quality.

Comparators that surprise: when smaller or unexpected models outperform

Meanwhile, other emergent models—sometimes from less visible teams—are proving to be surprisingly effective on freeform, real-world tests. One such model, released recently, outperformed the touted leader on a freeform version of a robust QA benchmark (less prone to packing and overfitting). It was slower, but the outputs were better aligned with practical expectations.

The lesson here is clear: headline benchmarking is not the whole story. We need more attention to models that do well on messy, open tasks and less worship of leaderboard position as an absolute measure of production value—especially in domains like AI in recruiting where the stakes are high and real-world generalization is crucial.

Valuation narratives and the marketing pressure

There’s broader context worth mentioning: valuations and media narratives amplify incentives to chase benchmarks. In one example, a company with little revenue got an eye-popping valuation driven in part by narrative and buzz. Meanwhile, competitor firms with substantial revenue and demonstrated product-market fit remained valued much lower. The result is a market that prizes stories as much as solid technical foundations.

That pressure can bias teams toward short-term wins: tweak the system prompt, add targeted RLHF, and release a model primed to excel on public evals. But that’s not the same as building a model ready for production workflows that must be transparent, auditable, fair, and robust.

Implications for AI in recruiting

In recruiting, the consequences of model brittleness and bias are amplified. Hiring systems must avoid unlawful discrimination, maintain candidate privacy, and produce defensible decisions. When an LLM is tuned primarily to ace benchmarks rather than to be transparent and robust, it risks several failure modes:

  • Hidden bias: ideological or reference bias can skew candidate summaries and rankings.
  • Inconsistent extraction: failing to adhere to format or to extract structured data accurately undermines downstream automation (e.g., parsing resumes into ATS fields).
  • Privacy and trust issues: unpredictable behaviors—like an outsized tendency to report—could inadvertently expose candidate data or escalate situations improperly.
  • Operational burden: brittle outputs require human-in-the-loop checks that negate the efficiency gains automation promises.

For organizations exploring AI in recruiting, the takeaway is to demand rigorous, real-world validation. Don’t be dazzled by leaderboard positions. Ask for audits on privacy, bias, prompt adherence, and failure modes. Test the model on your workflows: parsing resumes, extracting qualifications, generating interview questions, and producing offer-related documents in the exact formats your systems require.

What better evaluation looks like

To escape the Goodhart trap, the community needs to shift focus from narrow, public benchmarks to a blend of measures that emphasize real-world utility:

  • Domain-specific workflow tests: Small suites of real tasks that reflect the day-to-day work of target users (e.g., parsing resumes, candidate shortlisting with fairness constraints).
  • Human preference studies: Blind, head-to-head comparisons by domain experts and end users rather than automated metrics alone.
  • Robustness testing: Out-of-distribution tests, adversarial prompts, and instruction-adherence challenges to surface brittleness.
  • Transparency artifacts: Detailed system model cards, release notes, and post-mortems that reveal deployment-time prompt changes, safety mitigations, and known failure modes.
  • Continuous monitoring: Measure drift, odd behaviors (like ideological bleed-through), and unexpected compliance decisions after deployment.

In short: prioritize the practices that actually make a system safe and useful for production deployments, especially when these systems touch human lives and livelihoods in AI in recruiting.

Concrete recommendations for teams and evaluators

  1. Stop treating public benchmarks as finish lines. Use them as one data point among many.
  2. Design small, practical real-world exams that reflect your workflows and run them before deployment.
  3. Insist on system model cards and transparent post-mortems that explain what changed and why.
  4. Audit for ideological bleed-through and unpredictable compliance behavior before exposing models to user data.
  5. Favor models that demonstrate consistent prompt adherence and correct structured output over models that merely produce high token throughput.

These steps aren’t glamorous, but they’re necessary. Benchmarks create narratives, narratives attract capital, and capital shapes incentives. We need incentives that reward long-term robustness and safety, not short-term PR wins.

Parting thoughts: what to do right now

For teams building or buying models, and for anyone contemplating the use of LLMs in hiring pipelines, here are immediate actions you can take:

  • Run a brief real-world test suite on shortlisted models (5–20 short tasks reflecting your workflows). Don’t just look at benchmark numbers.
  • Measure prompt fidelity: include tasks with strict formatting rules and check for consistent adherence.
  • Validate generated code or automation snippets by running unit tests where applicable.
  • Ask vendors for clear, written explanations of system prompts, RLHF interventions, and known odd behaviors.
  • Require that models are tested for privacy/“snitching” behaviors and ideological leakage before deployment.

I can’t recommend deploying a model that shows the kinds of kinks described above—format errors, broken code, ideological bleed-through, or unpredictable compliance tendencies—into critical business workflows. The cost of a bad decision in AI in recruiting isn’t just a technical bug; it’s a reputational, legal, and human cost.

Conclusion: rebuild trust with real-world evaluations

We’ve turned benchmarks into finish lines, and models are crossing them by learning to game tests instead of actually getting better at the tasks people need done. Leaderboard glory and PR narratives are seductive, especially when valuations and investor stories hinge on perceived technical leadership. But the real litmus test for any model—particularly those intended for sensitive applications like AI in recruiting—is whether it works reliably, transparently, and fairly in real workflows.

“We’ve turned benchmarks into finish lines, and models like the vaunted ‘number one’ LLM are sometimes failing where it matters most—in real-world tasks and user preference.”

Until the community prioritizes real-world evaluation, transparent model documentation, and a culture of honest post-mortems, we’ll keep seeing top-ranked models that fall short when pressed into practical service. That’s a problem for everyone: vendors, customers, regulators, and most importantly, the people whose lives these systems affect.

If you’re responsible for choosing an LLM for hiring or HR workflows, don’t be swayed by a shiny leaderboard. Run your own real-world tests. Insist on transparency. Require robust audits. And when a model fails basic structured tasks or shows unpredictable behavior, treat that as a disqualifier—not as something you can work around after the fact.

AI in recruiting can be transformative when applied responsibly. But that transformation depends on models that generalize beyond narrowly sculpted evaluations. Until then, caution and real-world skepticism are the right default positions.