Skip to content

Conversation

@benhoyt
Copy link
Owner

@benhoyt benhoyt commented Dec 1, 2025

This is not for merging, but just for testing the new https://github.com/coregx/coregex library and seeing what tests fail.

Things I've found at a guick glance:

  • A bunch of regex-related tests fail. I haven't dug into the details yet.

The following issues are fixed in coregex v0.8.2:

  • coregex has no Regex.Longest(), which GoAWK uses
  • coregex has no QuoteMeta function (so I'm just using the regexp package's one, which seems fine for that)
  • Probably the worst issue is that some regexes seem to hang indefinitely and peg the CPU to 100% when compiling. For example, at first GoAWK didn't start up, because I have an init-time regexp.Compile which hangs when using coregex. The original was regex was in interp.go, ^([_a-zA-Z][_a-zA-Z0-9]*)=(.*) -- it seems to be the last .* part that's causing it to hang.

Other test failures should be able to be seen in the GitHub Actions output, for example.

@benhoyt
Copy link
Owner Author

benhoyt commented Dec 3, 2025

Some of the additional test failures are the following:

  • This regex should work: invalid regex "[^,]": meta: failed to compile: NFA compilation failed: character class too large (>256 characters). Similarly, this regex should work: invalid regex "[^a-z]": meta: failed to compile: NFA compilation failed: character class too large (>256 characters)
  • BEGIN { print "food"~/[oO]+d/ } prints 0 (no match) when it should print 1 (match), because "food" should MatchString the regex /[oO]+d/
  • It seems like the "s" flag (. matches newline \n) is not working, because one of my tests is that (?s:^a.*c$) should match a\nb\nc, and it doesn't

That's a start, and all I have time for today. :-)

@benhoyt
Copy link
Owner Author

benhoyt commented Dec 4, 2025

A few more:

  • echo -e '1\n22\n333' | go run . '/^.$/' prints 1\n22\n333\n instead of just 1\n. In other words, /^.$/ using MatchString is matching a 1-or-more character string, not just a 1-character string.
  • Perhaps similarly, echo -e 'a68' | go run . '/^[a-z]+$/' prints a68 when it should match nothing (alphabetic chars only). This and the above indicate that ^ might not be working.

@kolkov
Copy link

kolkov commented Dec 5, 2025

@benhoyt Word boundary assertions (\b and \B) are now fixed in coregex v0.8.5.

The issue was that FindString/Find returned empty while MatchString worked correctly. Root cause: missing word boundary checks in the DFA prefilter path.

Release: https://github.com/coregx/coregex/releases/tag/v0.8.5

@kolkov
Copy link

kolkov commented Dec 7, 2025

coregex v0.8.6 released 🎉

Fixed Issue #14: ^ anchor now works correctly in FindAllIndex/ReplaceAll.

What was broken: FindAllIndex sliced the input b[pos:] at each iteration, making the engine think each position was the start of input. So ^ matched at every position instead of just position 0.

The fix: Added FindAt(haystack, at) methods throughout the engine stack (PikeVM, DFA, Meta-engine). All FindAll* and ReplaceAll* functions now preserve absolute positions.

Release: https://github.com/coregx/coregex/releases/tag/v0.8.6

go get github.com/coregx/coregex@v0.8.6

This should fix the CI failures like:

  • ReplaceAllString("(^)", "12345", "x") = "x1x2x3x4x5" want "x12345"

Looking forward to the test results!

@kolkov
Copy link

kolkov commented Dec 7, 2025

Update: coregex v0.8.7 released

Fixed in v0.8.7:

  • Error message format now matches stdlib exactly
    • Was: regexp: error parsing regexp: error parsing regexp: ...
    • Now: error parsing regexp: ...

This fixes the back89, funstack, regexpbrack test failures.

Still investigating:

go get github.com/coregx/coregex@v0.8.7

Release: https://github.com/coregx/coregex/releases/tag/v0.8.7

@kolkov
Copy link

kolkov commented Dec 7, 2025

coregex v0.8.8 released 🎉

Fixed: Issue #15 - DFA.IsMatch with capture groups

Root cause: epsilonClosure in DFA builder didn't follow StateCapture transitions.

This fixes the datanonl test failure (empty output) and all other patterns with parentheses.

go get github.com/coregx/coregex@v0.8.8

Summary of fixes in v0.8.6-v0.8.8:

  • ✅ v0.8.6: ^ anchor in FindAllIndex/ReplaceAll
  • ✅ v0.8.7: Error message format (matches stdlib exactly)
  • ✅ v0.8.8: DFA.IsMatch with capture groups

Release: https://github.com/coregx/coregex/releases/tag/v0.8.8

@kolkov
Copy link

kolkov commented Dec 7, 2025

Hi @benhoyt! 👋

I've released v0.8.10 with inline flags support (fixes #8):

  • (?s:...) now correctly makes . match newlines within the scoped group
  • (?i:...) and combined (?is:...) also work

For AWK compatibility, since AWK expects . to match newlines by default, you could:

  1. Wrap user patterns with (?s:pattern) before compiling, or
  2. Use the global DotNL option if/when we add one to the Compile API

The gsub-related failures (gsubtest, gsubtst3, etc.) appear to be related to a separate issue with empty alternation branches (#16) - I'm investigating that next.

Let me know if you'd like to test v0.8.10!

@benhoyt
Copy link
Owner Author

benhoyt commented Dec 7, 2025

Yes, I do already wrap all user patterns with (?s:pattern).

I've updated the branch to v0.8.10. Getting very close now. Just 3 remaining tests that fail (I haven't yet looked into why):

$ go test -count=1 ./...
--- FAIL: TestAWK (0.43s)
    --- FAIL: TestAWK/t.gsub1 (0.00s)
        goawk_test.go:148: output differs, run: git diff testdata/output/t.gsub1
    --- FAIL: TestAWK/t.monotone (0.00s)
        goawk_test.go:148: output differs, run: git diff testdata/output/t.monotone
FAIL
FAIL	github.com/benhoyt/goawk	0.843s
--- FAIL: TestInterp (1.25s)
    --- FAIL: TestInterp/BEGIN_{_s_=_"#!a";_sub(/(#|#!)/,_"",_s);_print_s_} (0.00s)
        interp_test.go:1244: expected/got:
            "a\n"
            "!a\n"
FAIL
FAIL	github.com/benhoyt/goawk/interp	1.312s

@kolkov
Copy link

kolkov commented Dec 7, 2025

Thanks for testing v0.8.10!

Regarding the (#|#!) pattern - this is AWK vs Go regex semantics:

// Both stdlib and coregex return the same result:
regexp.MustCompile(`(#|#!)`).ReplaceAllString("#!a", "")  // "!a"
coregex.MustCompile(`(#|#!)`).ReplaceAllString("#!a", "") // "!a"
  • Go regex: leftmost-first semantics - first branch # wins
  • AWK regex: leftmost-longest semantics - #! should win as longer match

For AWK compatibility, you'd need to reorder alternations by length (longest first):

// Instead of (a|ab) use (ab|a)

This is beyond what coregex can fix since it's designed to match Go stdlib behavior exactly.

I'll look into the other 2 failing tests (t.gsub1, t.monotone) to see if they're similar semantic differences or actual bugs.

@kolkov
Copy link

kolkov commented Dec 7, 2025

Update: Checked Rust regex behavior

Rust regex also uses LeftmostFirst as default and doesn't implement LeftmostLongest:

// regex-automata/src/util/search.rs
pub enum MatchKind {
    All,
    #[default]
    LeftmostFirst,  // ← default
    // LeftmostLongest - NOT IMPLEMENTED
}

From Rust regex source comments:

"leftmost-longest can be emulated when using literals by sorting the literals by length in descending order. However, this won't work for arbitrary regexes. e.g., \w|\w\w will always match a in ab when using leftmost-first, but leftmost-longest would match ab."

Summary of modern regex engines:

Engine Default Semantics
Go stdlib LeftmostFirst
Rust regex LeftmostFirst
RE2 LeftmostFirst
PCRE LeftmostFirst
AWK/POSIX LeftmostLongest

So this is standard behavior for PCRE-style engines. AWK's POSIX semantics are the exception.

For GoAWK, possible solutions:

  1. Sort literal alternations by length (longest first) during pattern preprocessing
  2. Parse regex AST and reorder alternation branches
  3. Document as known limitation for complex alternations

@kolkov
Copy link

kolkov commented Dec 7, 2025

v0.8.11 released with additional bug fixes:

  • Fixed hexadecimal value escapes #24: a$ on "ab" returned true on first call, false on subsequent calls (lazy DFA caching bug)
  • Fixed empty string handling for end-anchored patterns
  • Fixed strategy selection for patterns with start anchors (^)

The t.monotone pattern (^a?$|^b?$) should work correctly now - it was incorrectly using ReverseAnchored strategy which doesn't work for patterns containing ^.

Would be great to re-run tests with v0.8.11 to see if any regex-related failures are resolved!

go get github.com/coregx/coregex@v0.8.11

@benhoyt
Copy link
Owner Author

benhoyt commented Dec 7, 2025

Just one failing test left:

--- FAIL: TestInterp (1.25s)
    --- FAIL: TestInterp/BEGIN_{_s_=_"#!a";_sub(/(#|#!)/,_"",_s);_print_s_} (0.00s)
        interp_test.go:1244: expected/got:
            "a\n"
            "!a\n"

This is a leftmost-longest test. But the coregex docs indicate it should be leftmost-longest by default? https://pkg.go.dev/github.com/coregx/coregex#Regex.Longest

@kolkov
Copy link

kolkov commented Dec 8, 2025

Great news - only one failing test left! 🎉

It's been really productive working through these issues together. Considering the scope of coregex's functionality, the number of bugs we found was surprisingly small, which is encouraging!

Regarding the Longest() / leftmost-longest issue:

You're right that our documentation was misleading - I've investigated this thoroughly. Here's what I found:

Engine Default Longest() support
Go stdlib leftmost-first ✅ Yes
Rust regex leftmost-first ❌ Not implemented
RE2 leftmost-first ❌ Not implemented
coregex leftmost-first ❌ Stub (no-op)

Neither Rust's regex crate nor Google's RE2 implement leftmost-longest semantics - they only support leftmost-first. However, for true stdlib compatibility, we should support it.

I've created issue #26 to track this. We're now conducting deep research into:

  1. Performance implications (will default mode be affected?)
  2. Implementation approach (PikeVM modifications needed)
  3. Benchmark comparisons before/after

We'll implement this in a separate branch (feat/longest-support) and run comprehensive benchmarks to ensure no regression in the default (leftmost-first) mode. I'll update you once we have results!

For now, the (#|#!) pattern workaround is to reorder alternatives by length: (#!|#) - longest first.

@kolkov
Copy link

kolkov commented Dec 8, 2025

@benhoyt v0.8.12 released with Longest() fix: https://github.com/coregx/coregex/releases/tag/v0.8.12

The Longest() method now correctly implements leftmost-longest (POSIX) semantics. Please update the dependency and test again. This should resolve the AWK compatibility issue you reported.

@benhoyt
Copy link
Owner Author

benhoyt commented Dec 8, 2025

Thanks. All the GoAWK tests pass now!

There's one final issue in case you care -- coregex doesn't compile on 32-bit systems (Linux i386). Here's the error:

go: downloading github.com/coregx/coregex v0.8.12
# github.com/coregx/coregex/internal/conv
Error: ../../../go/pkg/mod/github.com/coregx/coregex@v0.8.12/internal/conv/conv.go:15:18: math.MaxUint32 (untyped int constant 4294967295) overflows int

By the way, do you run coregex against the full suite of tests from the standard library's regex package? That would be worth doing -- I think it would have caught several of the issues here.

@kolkov
Copy link

kolkov commented Dec 8, 2025

v0.8.13 released - fixed the 386 build issue: https://github.com/coregx/coregex/releases/tag/v0.8.13

Also added CI job for 386 to catch this in future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants