Finding all regex matches has always been O(n²)

<- Back

Finding all regex matches has always been O(n²)

lalitmaganti

Comments (44)

conartist6
@ievev Have you ever seen an implementation like @bablr/regex? https://github.com/bablr-lang/regex-vm It's an NFA system so it isn't going to be winning any awards for throughput, but in this particular case it does seem to completely avoid the complexity blowup. It will run your heap out of memory though on really big inputs.The strategy this engine uses is just to evolve the state as a function of time. A match can be successfully completed, yet not be emitted because some other longer match could still supercede it by being longer or more leftmost.I tried the pattern /d+s+/g on 10,000,000 digits followed by no space. It took 4 seconds to return no results. I tried it on 20,000,000 digits followed by no space. It took 8 seconds to return no results. I tried on 100,000,000 and I ran out of heap space.Test setup: https://gist.github.com/conartist6/051838025af1e04d966e03aa9...
10000truths
Restricting regex features to guarantee time complexity works, but it requires sacrificing potentially useful features like backtracking (or in the article's case, constraining oneself to fixed-upper-bound-length needles).In a real-world deployment where you want to run any arbitrary regex in an idiot/malice-proof manner, the best solution is the same solution you'd use for running any other kind of untrusted code - sandbox it! A good regex API should limit its execution time and memory consumption and return a timeout error in case those limits are exceeded. Ideally, those parameters would be configurable at the API level. Unfortunately, the only regex libraries I know of that get this right are .NET's standard library Regex API and the third-party regex package in Python.
nine_k
> nearly everything that matters in practice: where the matches are, how long they are, and how many there areI would say that regexes that matter in practice, e.g. when digging through logs, have clear boundaries that curb the pathological backtracking behavior. In particular, I find it difficult to imagine a practical need to find all matches of an expression like /.*a|b/, as shown in the article. Realistically you'd have to handle /\b.*a|b\b/, or similar, because realistically when you need all matches, you don't want intersecting matches. This means you want to proceed past the end of the n-th match to look for n+1-th match, and never want to use indeterminate prefixes like /.*a/.This OTOH gives a reasonably useful heuristic if your regexp comes from an untrusted source and could be adversarial. Check that it does not start with a prefix with a Kleene star, like /a*/. Require at least one positive match (in each alternate branch). Of course, /a+b|c/ would still be quadratic if your text is long sequences of "a" interspersed with characters other than "b". But this, again, is more of a theoretical case, to my mind.
gpvos
I find it weird to have the Perl innovation (?:...) be called "traditional regex". Perl was rather innovative back then, even if it's more than 30 years ago now. Traditional regex is what came before it (grep -E being the most advanced form). I wonder what counts as nontraditional in the author's eyes.
adzm
Is there any reason that RE#'s two-pass approach couldn't be adopted by other regex engines?Ah, there is a post with more detail about RE# and discussion here recently that I must have missed: https://news.ycombinator.com/item?id=47206647
ummonk
Great stuff.I would argue that hardened mode should be default though, similar to how siphash is the default hashing function in Rust hash maps. Faster mode should be opt in if the user is confident that the supplied data is nonmalicious and they need the speed up.
nitely
FWIW, nim-regex does achieve linear time in the rebar test[0], even if the regex includes capture groups. It's NFA based.[0]: https://github.com/BurntSushi/rebar/pull/20#issuecomment-256...
babelfish
Cursor just wrote a great blog post on this - "Fast regex search: indexing text for agent tools" https://cursor.com/blog/fast-regex-search
uwais12
This is a great writeup. The fact that this has been hiding in plain sight for so long is wild. Most people assume regex engines are well-optimized at this point, but the overlapping match case is genuinely tricky to handle without quadratic behavior. The Thompson NFA approach mentioned at the end seems like the right direction but I can see why nobody's prioritized it when most real-world patterns don't hit this case.
anon
undefined
thaumasiotes
> the problem we're talking about in this post (finding all longest matches without quadratic blowup)Wait, what? I thought this was about finding all matches. With a minor tweak to the opening example:We want to match `(.*a | b)` against `bbbbbabbbbb`.I want to detect each `b` individually, and I also want to detect `bbbbba`, `bbbba`, `bbba`, `bba`, `ba`, and `a`. That's what it means to find all matches.
openclaw01
[dead]
bedardbrandon89
[dead]
zahlman
> search a document for a pattern and it takes a second. search one a hundred times larger and it doesn't take a hundred seconds - it can take almost three hours.Most of this is about quadratic time find-all operations where a search operation is linear. But it's also still possible to get quadratic behaviour out of a single search without catastrophic backtracking, more easily than you might expect. In late January to early February, Tim Peters was talking about an example of this on the Python forums (see e.g. https://discuss.python.org/t/add-re-prefixmatch-deprecate-re...) and also related the experience of trying to diagnose the issue with AI (see https://discuss.python.org/t/claude-code-how-much-hype-how-m... and onward). Peters' example was: \d+\s+ on a string containing only digits, a prefix match takes O(n) time as it considers every possible end position for the digit, and immediately sees no following whitespace. But the search is quadratic because it has to repeat that O(n) work at every position; the regex engine can't track the fact that it's already examined the string and found no whitespace, so it re-tries each digit match length.(This is arguably "backtracking" since it tries the longest match first, but clearly not in a catastrophic way; if you use `\d+?` instead then of course it only searches forward but is still O(n). It actually is slower in my testing in the Python implementation; I don't exactly know why. As noted in the discussion, the possessive quantifier `\d++` is considerably faster, and of course doesn't backtrack, but still causes O(n^2) searching. The repeated attempts to match `\s+` aren't the problem; the problem is repeatedly looking for digits in places where digits were already found and rejected.)The way to fix this proposed in the discussion is to use a negative lookbehind assertion before the digits: `(?<!\d)\d+\s+`. This way, the regex engine can bail out early when it's in the middle of a digit string; if the previous character was a digit, then either `\d+\s+` doesn't match here, or it would have matched there.A simpler idea is to just search for `\d\s+`, or even `\d\s` — since these will be present if and only if `\d+\s+` is. This way, though, you still need to do extra work with the partial match to identify the start and end of the full match. My first idea was to use positive lookbehind for the digits, since the lookbehind match doesn't need to backtrack. In fact lookbehinds require a fixed-length pattern, so this is really just a more complicated way to do the `\d\s+` simplification.----> Hyperscan (and its fork Vectorscan) is a true linear-time all-matches regex engine. it achieves this by using "earliest match" semantics - reporting a match the moment the DFA enters a match state, instead of continuing to find the longest one.Is this not just equivalent to forcing "reluctant" quantifiers (`\d+?`) everywhere?
nmilo
[flagged]
nadavdebi
[flagged]
ChadNauseam
[flagged]