Perplexity just moved the search bottleneck into the model's hands

"Search as Code" reads like an efficiency story. It's really an argument about who controls the filtering logic between you and a citation.

Jun 08, 2026

The detail that stuck with me is the CVE test. Perplexity pointed an agent at 200 critical software vulnerabilities published between 2023 and 2025 and told it to find, for each one, the official vendor advisory plus the exact version that patched the bug. No blog posts, no news rewrites. The agent wrote a three-stage Python script that ran searches tailored to how Mozilla and Google actually format their security bulletins, found its own gaps, then verified each result against a schema. The Decoder reports it finished the task “using 85 percent fewer tokens than its standard pipeline,” while competing systems got less than a quarter of the data right.

Take the benchmark numbers with the usual skepticism, since they’re self-reported and one of them runs on Perplexity’s own unreleased “WANDR” set. The architecture is the part worth your attention anyway.

What actually changed

The old loop is familiar to anyone who’s watched an agent grind through research. Model writes a query, the search API hands back a list of links, the model reads them, writes the next query, repeat. The search engine is a black box. The agent can only change the words it types in.

Search as Code breaks the engine into parts the model can call directly: retrieve, filter, deduplicate, rerank, each one a function in an SDK that runs inside a sandbox (an isolated environment where generated code executes without touching anything it shouldn’t). Now the agent isn’t asking a search engine a question. It’s assembling its own.

The reason this matters has less to do with speed and more to do with context, the working memory a model carries through a task. Standard search stuffs that memory with whatever the engine decided to return, junk included, because the filtering logic lives on the engine’s side and you can’t reach it. When the agent writes its own filters, it only pulls in hits that survive its own rules. The context stays lean. On a long research session, a model that isn’t drowning in irrelevant text keeps its bearings far longer.

The cheating problem this might fix

Here’s the angle I keep coming back to. The Decoder mentions a study showing popular search agents often don’t really search. They pull the answer from training data and use a query to confirm what they already “knew.” On a benchmark of fresh facts the model couldn’t have memorized, “every single system saw its score plunge by 25 to 40 points.”

Those systems all used standard search tools. A black box rewards faking it, because confirming a memorized answer is cheaper than honestly digging. When the agent has to write code that retrieves, filters, and schema-checks against live vendor advisories, faking it gets harder. The CVE test is structured exactly so memorization fails: you can’t guess the precise patch version, you have to go find the bulletin.

I’m not fully sold that code-as-the-interface is the durable shape here rather than one good idea Perplexity shipped before the rest of the labs do their own version. The survey paper it cites argues writing code is becoming “the default way agents interact with the world,” and that’s a big claim resting on early evidence.

What I can’t answer: if every serious agent ends up writing its own retrieval pipeline against a raw search backend, does the polished consumer answer engine, the blue-links-for-humans product, become a legacy interface nobody’s agents bother to call?

Victor Racariu

Discussion about this post

Ready for more?