Faster RegEx Search for AI Agents in Large Codebases

If you’ve used an AI coding assistant on a large codebase, you’ve probably sat there watching a spinner while the agent “searches for relevant files.” That wait isn’t just annoying — it’s a fundamental technical problem. And solving it properly requires going back to 1973 and rethinking how search actually works.

This post explains how modern AI agents search through code, why brute-force search breaks down at scale, and the indexing techniques that make instant code search possible. (Based on research and engineering work published by Cursor)

Problem: AI Agents Are Obsessed With grep

When an AI coding agent needs to find something in your codebase — a function definition, a variable name, a pattern — its go-to tool is regex search. Specifically, most agent frameworks call ripgrep (rg), a fast, modern alternative to the classic Unix grep command.

ripgrep is genuinely excellent. It’s faster than almost anything else at matching regular expressions against file contents. But it has one unavoidable flaw: it has to look at every single file.

On a small project with 200 files, that’s fine. On a large enterprise monorepo with 200,000 files — the kind of repo that senior engineers at big tech companies live inside — a single rg call can take 15+ seconds. And agents don’t just call it once. They call it repeatedly, in parallel, as they investigate bugs and trace dependencies. That latency compounds into minutes.

The solution isn’t to make ripgrep faster. The solution is to avoid scanning most files in the first place, using an index.

What Is an Index, Really?

Before diving into code-specific techniques, it’s worth understanding what an index actually is, because the concept powers everything from Google to your database.

An inverted index flips the relationship between documents and their content. Instead of asking “what words are in this document?”, it asks “which documents contain this word?”

Here’s the simplest version: take a set of documents, split each one into individual words (this is called tokenization), and build a lookup table where each word maps to a list of document IDs (called a posting list). When you search for two words, you look up both posting lists and find the document IDs that appear in both — that intersection is your result set.

This is how Google, Elasticsearch, and almost every search engine works at its core. The complexity is in how you tokenize, how you compress the posting lists, and how you rank results — but the foundation is always the inverted index.

Now here’s the challenge: that approach works great for natural language. But source code isn’t natural language. You can’t tokenize MAX_FILE_SIZE in any meaningful way that would help you match the regex MAX_\w+_SIZE. You need a different tokenization strategy.

Trigrams: The Right Way to Tokenize Code

The insight that makes code search possible is using trigrams as your tokens. A trigram is any sequence of three consecutive characters in a string. For the string "the cat", the trigrams are: the, he , e c, ca, cat.

Every overlapping window of three characters becomes a token. This is the key idea that powers code search indexes.

Why three characters specifically?

  • Use 2-character tokens (bigrams): too few unique keys, posting lists become enormous and slow to search
  • Use 4-character tokens (quadgrams): too many unique keys, the index itself becomes unwieldy
  • Three characters hits the right balance — enough specificity to keep posting lists short, few enough unique combinations to keep the index manageable

When you index a file, you extract all its trigrams and add that file’s ID to each trigram’s posting list. When you want to search for a regex pattern, you decompose it into the trigrams it must contain, look up those posting lists, and intersect them. The files that appear in all the posting lists are your candidates — files that might match. You then run the actual regex only against those candidates, not the entire codebase.

The index doesn’t give you answers, it eliminates the work. You still need to run the regex to confirm matches, but you’ve narrowed thousands of files down to dozens.

When Trigrams Aren’t Enough

The trigram inverted index works well and is used in real tools like google/codesearch and sourcegraph/zoekt. But it has limitations.

If your regex is something like [a-z]+ — a very broad pattern — it can’t be decomposed into specific trigrams. The index can’t help, and you’re back to scanning everything.

More practically: the more specific your trigrams, the better the index works. The more generic your query, the worse. And there’s a tradeoff at query time — use too few trigrams and you get too many candidates; use too many and you’re doing nearly as much work as not having an index.

Researchers and engineers have spent decades improving on the basic trigram approach.

Here are the two most interesting directions.

 

Approach 1: Trigrams With Bloom Filter Augmentation

This technique, developed as part of GitHub’s “Project Blackbird” research effort, asks: what if we stored a little extra information alongside each trigram posting?

Instead of just recording “file X contains trigram ABC”, what if we also recorded “and in file X, the character that follows ABC is usually one of: {D, E, F}”?

You can’t store the exact list of following characters without making the index too large. But you can store a bloom filter — a compact probabilistic data structure that encodes whether certain values are present. With just 8 bits, you can encode a hash of all the characters that follow a given trigram in a given document.

Payoff is that you’ve essentially turned a trigram index into something closer to a quadgram (4-character) index, but at a fraction of the storage cost. When you query, you can filter out documents where the trigram is present but the following character doesn’t match — eliminating false candidates before you ever touch the file.

A second bloom filter encodes the positions where each trigram appears. This lets you verify that two required trigrams actually appear next to each other in the document, not just independently somewhere in the file.

However,bloom filters can get “saturated.” If you add too many entries over time, all the bits get set to 1, the filter starts returning true for everything, and your carefully designed index degrades back to brute-force performance. This makes in-place updates painful.

Approach 2: Sparse N-grams (The Smart Approach)

Used in ClickHouse and GitHub’s current code search, sparse n-grams are the most elegant solution.

The problem with trigrams is redundancy. In the string "function", the trigrams are fun, unc, nct, cti, tio, ion — and they overlap heavily. You’re storing a lot of information that doesn’t help you discriminate between documents.

Sparse n-grams solve this by extracting only a representative subset of substrings, chosen deterministically based on a weight function.

Here’s how it works:

  1. Assign a “weight” to every pair of adjacent characters in the document. The weight is computed by hashing the character pair — deterministically, so the same pair always gets the same weight.
  2. Extract substrings (n-grams) wherever the weights at both ends of the substring are strictly greater than all the weights inside it. These are natural “boundary” points in the string.
  3. The resulting n-grams vary in length — some are 2 characters, some are 5 — but they’re non-overlapping and cover the entire string.

When you query, you run the same algorithm on your search string and look up only the resulting sparse n-grams. Because the algorithm is deterministic, query-time n-grams will match index-time n-grams for any file that genuinely contains the pattern.

Optimization tip: instead of a random hash function for computing weights, use a frequency table built from analyzing terabytes of open-source code. Character pairs that are rare in real code get high weights; common pairs get low weights. This means your sparse n-grams naturally anchor to the most distinctive parts of any string — the unusual character combinations that appear in very few files. The result is dramatically fewer posting list lookups per query, and far fewer candidate files to scan.

Where Should the Index Live?

All of the approaches above can be deployed server-side or client-side. The choice matters more than it might seem.

Server-side is the obvious approach — centralize the index, query it over the network. But there are real problems:

  • You need to sync all the files to the server anyway, which has privacy and security implications
  • Network round-trips add latency to an operation the agent performs constantly and in parallel
  • The index needs to be extremely fresh — if an agent writes a file and then immediately searches for something it just wrote, a stale index breaks the workflow

Client-side (running the index on the developer’s machine) sidesteps all of this. The files are already there, no sync needed. Latency is essentially zero. Privacy is preserved.

The engineering challenge is keeping a local index fresh and memory-efficient. One elegant solution: anchor the index state to a Git commit. The base index reflects the last committed state of the repository. Uncommitted changes (both user edits and agent writes) are tracked as a separate layer on top. This makes startup fast and incremental updates cheap.

For memory efficiency, the index can be split into two files: one containing all the posting lists written sequentially to disk, and a second containing a sorted lookup table of n-gram hashes pointing to offsets in the first file. Only the lookup table needs to live in memory (via memory-mapped I/O). Queries work by binary searching the lookup table to find the right offset, then reading just the relevant posting list directly from disk.

All of this might sound like low-level infrastructure detail, but it has a direct impact on how useful AI coding agents are in practice.

An agent investigating a bug in a large codebase might call search 10-20 times in a single session. If each call takes 15 seconds without an index, that’s 3-5 minutes of dead time per task. With a local sparse n-gram index, those same searches complete in under a second.

That difference isn’t just speed — it changes the interaction model. When search is instant, agents can explore more aggressively, verify hypotheses faster, and make better decisions with less wasted effort (and fewer wasted tokens).

The broader lesson: as AI agents become standard tools in software development, the infrastructure they rely on needs to be engineered with the same care as the models themselves. A fast model paired with slow tools is still a slow agent.

Key Takeaways

  • AI agents use regex search constantly. On large codebases, brute-force search (even with fast tools like ripgrep) becomes a bottleneck.
  • Inverted indexes are the foundational data structure for fast search. They map tokens to document IDs.
  • Trigrams (3-character substrings) are the standard tokenization strategy for code search indexes. They’re the right balance between specificity and index size.
  • Bloom filter augmentation can make a trigram index behave like a quadgram index at a fraction of the storage cost, but degrades over time with updates.
  • Sparse n-grams are the current state of the art: deterministic, non-overlapping substrings anchored to distinctive character pairs, giving high search specificity with minimal index lookups.
  • Local indexing beats server-side for agent tools: lower latency, better privacy, and easier freshness guarantees.

Leave a Comment

Your email address will not be published. Required fields are marked *