Conversation
Implements MySQL REGEXP_LIKE(expr, pattern [, match_type]) via a new variadic UDF. Introduces a shared regexp_compile() helper that translates MySQL match_type flags to PCRE modifiers and always uses UTF-8 mode, plus regexp_run() (suppresses preg_* warnings) and regexp_fail() (translates preg failures into MySQL-style messages). regexp_compile() rejects empty patterns to match MySQL ERROR 3685 and documents two known limitations of the emulation: collation-blind case-sensitivity defaulting and the always-on /u modifier diverging from the legacy REGEXP operator on binary data. The match_type loop accepts MySQL's c/i/m/n/u flags (last of the case flags wins; "u" — Unix-only line endings — is a no-op since PCRE's default already matches that semantics). Unknown flags raise "Invalid match_type flag: X.". Tests cover: data-driven match cases, NULL propagation, invalid patterns, multi-flag combinations, UTF-8 input errors via the PREG_BAD_UTF8_ERROR branch, the backtrack-limit branch, and that the legacy REGEXP operator still works alongside REGEXP_LIKE.
Implements MySQL REGEXP_REPLACE(expr, pattern, replacement [, pos
[, occurrence [, match_type]]]) with three new private helpers:
- regexp_char_to_byte_offset() converts a 1-based character pos into a
byte offset, accepting char_count + 1 for the "start at end" case
that MySQL allows for REPLACE / SUBSTR.
- regexp_find_matches() walks the subject with preg_match and its
offset argument so that lookbehind assertions can see bytes before
pos. Skips UTF-8 continuation bytes after zero-width matches.
- regexp_expand_replacement() implements MySQL/ICU replacement
grammar: "$N" backreferences (with "$0" = full match and longest
valid digit-prefix wins), "\X" emits X literally, "${N}" is rejected
as invalid, and a trailing lone backslash is dropped. Errors mirror
MySQL's: "A capture group has an invalid name." (3887) and "Index
out of bounds in regular expression search." (3686).
REGEXP_REPLACE rebuilds the result by walking collected matches,
emitting the in-between bytes verbatim and substituting only the
targeted occurrence (or all when occurrence = 0). Negative occurrence
is clamped to 1 to match MySQL.
Tests cover the data-driven happy path, NULL propagation, every
documented backreference form, lookbehind across pos, zero-width
matches, the pos = char_count + 1 edge, the negative-occurrence
clamp, and the ICU error branches.
Returns the Nth matched substring at or after a given character position, or NULL if no match. Reuses regexp_compile(), regexp_char_to_byte_offset() (with $allow_past_end so pos = char_count + 1 yields NULL), regexp_find_matches(), and regexp_fail() introduced with REGEXP_REPLACE. Negative or zero `occurrence` is clamped to 1, matching MySQL. Tests cover the data-driven happy path, NULL propagation, the occurrence clamp, the pos = char_count + 1 / pos > char_count + 1 boundary, multi-byte matches, invalid patterns, invalid flags, and a lookbehind whose context spans pos.
Returns the 1-based character position of the Nth match (or 0 if none), with return_option controlling whether to report the match start or the position one past its end. Adds the small regexp_byte_offset_to_char_index() helper that converts a byte offset returned by PCRE into a UTF-8 character index. `pos` greater than char_count is rejected even when SUBSTR / REPLACE allow it, matching MySQL's stricter validation for INSTR. Negative or zero `occurrence` is clamped to 1, also matching MySQL. return_option must be 0 (start) or 1 (one past end); anything else raises "Incorrect arguments to regexp_instr: return_option must be 1 or 0." (matching MySQL's wording). The check runs before the occurrence clamp so the message is consistent. Tests cover the data-driven happy path, NULL propagation, the occurrence clamp, the straddling-match boundary, multi-byte return_option=1, the return_option validation (including under otherwise no-op occurrences), and a lookbehind whose context spans pos.
Adds a final layer of tests that exercise behaviors which involve more than one of REGEXP_LIKE / _REPLACE / _SUBSTR / _INSTR at once and only become testable once all four functions are available: - Empty pattern raises "Illegal argument to a regular expression." uniformly (MySQL ERROR 3685). - Empty subject with a zero-width-matching pattern still produces a match (LIKE = 1, SUBSTR = "", INSTR = 1). - Zero-width anchors ^ / $ report sensible 1-based positions and an empty-string match for SUBSTR rather than NULL. - Astral-plane (4-byte UTF-8) characters are counted as one code point by both SUBSTR and INSTR. - Negative pos rejects consistently across REPLACE / SUBSTR / INSTR.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the four MySQL 8.0 regular-expression functions on top of SQLite by registering PHP UDFs that translate MySQL/ICU semantics onto PCRE.
regexp_compile()that maps MySQL match-type flags (c/i/m/n/u) onto PCRE modifiers with UTF-8 mode always on, plusregexp_run()/regexp_fail()that distinguish compile errors from backtrack-limit and invalid-UTF-8 failures.preg_matchwith its offset argument so lookbehind assertions retain context acrosspos. Implements MySQL/ICU replacement grammar:$Nbackreferences (with$0= full match and longest-valid-prefix wins),\Xdrops the backslash,${N}is rejected, trailing\is dropped.occurrence = 0replaces all; negatives clamp to 1.pos. Acceptspos = char_count + 1(returns NULL); clampsoccurrence <= 0to 1.return_optionstrictly 0 or 1; rejectspos > char_counteven where SUBSTR/REPLACE allow it.MySQL fidelity
Behavior was verified against MySQL 8.0.45 in Docker across ~100 SQL cases. Error texts mirror MySQL's ICU wording where practical:
Illegal argument to a regular expression.(empty pattern, ERROR 3685)A capture group has an invalid name.(invalid `$` in replacement, ERROR 3887)Index out of bounds in regular expression search.(ERROR 3686)Incorrect arguments to regexp_instr: return_option must be 1 or 0.Known limitations
Documented in `regexp_compile()`:
Test plan
Addresses #47.