Add MySQL REGEXP_LIKE(), REGEXP_REPLACE(), REGEXP_SUBSTR(), REGEXP_INSTR() UDFs by JanJakes · Pull Request #368 · WordPress/sqlite-database-integration

JanJakes · 2026-04-21T15:31:26Z

Summary

Implements the four MySQL 8.0 regular-expression functions on top of SQLite by registering PHP UDFs that translate MySQL/ICU semantics onto PCRE.

REGEXP_LIKE(expr, pattern [, match_type]) — boolean matching. Introduces a shared regexp_compile() that maps MySQL match-type flags (c/i/m/n/u) onto PCRE modifiers with UTF-8 mode always on, plus regexp_run() / regexp_fail() that distinguish compile errors from backtrack-limit and invalid-UTF-8 failures.
REGEXP_REPLACE(expr, pattern, replacement [, pos [, occurrence [, match_type]]]) — uses a walker that calls preg_match with its offset argument so lookbehind assertions retain context across pos. Implements MySQL/ICU replacement grammar: $N backreferences (with $0 = full match and longest-valid-prefix wins), \X drops the backslash, ${N} is rejected, trailing \ is dropped. occurrence = 0 replaces all; negatives clamp to 1.
REGEXP_SUBSTR(expr, pattern [, pos [, occurrence [, match_type]]]) — returns the Nth match from pos. Accepts pos = char_count + 1 (returns NULL); clamps occurrence <= 0 to 1.
REGEXP_INSTR(expr, pattern [, pos [, occurrence [, return_option [, match_type]]]]) — returns a 1-based character position. return_option strictly 0 or 1; rejects pos > char_count even where SUBSTR/REPLACE allow it.

MySQL fidelity

Behavior was verified against MySQL 8.0.45 in Docker across ~100 SQL cases. Error texts mirror MySQL's ICU wording where practical:

Illegal argument to a regular expression. (empty pattern, ERROR 3685)
A capture group has an invalid name. (invalid `$` in replacement, ERROR 3887)
Index out of bounds in regular expression search. (ERROR 3686)
Incorrect arguments to regexp_instr: return_option must be 1 or 0.

Known limitations

Documented in `regexp_compile()`:

The UDF has no access to the session collation, so case-sensitivity defaults to insensitive (correct for `utf8mb4_0900_ai_ci`, wrong for `_bin`/`_cs`; callers on those collations should pass an explicit `c` match_type).
The `/u` PCRE modifier is always applied. Binary data with invalid UTF-8 that would match under the legacy `REGEXP` operator raises `Invalid UTF-8 data in regular expression input.` when routed through the new UDFs.

Test plan

`cd packages/mysql-on-sqlite && composer run test` passes (756 tests on my machine)
`composer run check-cs` clean on both changed files
Review: four new public methods at `packages/mysql-on-sqlite/src/sqlite/class-wp-sqlite-pdo-user-defined-functions.php` and seven new private helpers
Spot-check a few queries against MySQL 8 for any corner cases you care about

Addresses #47.

Implements MySQL REGEXP_LIKE(expr, pattern [, match_type]) via a new variadic UDF. Introduces a shared regexp_compile() helper that translates MySQL match_type flags to PCRE modifiers and always uses UTF-8 mode, plus regexp_run() (suppresses preg_* warnings) and regexp_fail() (translates preg failures into MySQL-style messages). regexp_compile() rejects empty patterns to match MySQL ERROR 3685 and documents two known limitations of the emulation: collation-blind case-sensitivity defaulting and the always-on /u modifier diverging from the legacy REGEXP operator on binary data. The match_type loop accepts MySQL's c/i/m/n/u flags (last of the case flags wins; "u" — Unix-only line endings — is a no-op since PCRE's default already matches that semantics). Unknown flags raise "Invalid match_type flag: X.". Tests cover: data-driven match cases, NULL propagation, invalid patterns, multi-flag combinations, UTF-8 input errors via the PREG_BAD_UTF8_ERROR branch, the backtrack-limit branch, and that the legacy REGEXP operator still works alongside REGEXP_LIKE.

Implements MySQL REGEXP_REPLACE(expr, pattern, replacement [, pos [, occurrence [, match_type]]]) with three new private helpers: - regexp_char_to_byte_offset() converts a 1-based character pos into a byte offset, accepting char_count + 1 for the "start at end" case that MySQL allows for REPLACE / SUBSTR. - regexp_find_matches() walks the subject with preg_match and its offset argument so that lookbehind assertions can see bytes before pos. Skips UTF-8 continuation bytes after zero-width matches. - regexp_expand_replacement() implements MySQL/ICU replacement grammar: "$N" backreferences (with "$0" = full match and longest valid digit-prefix wins), "\X" emits X literally, "${N}" is rejected as invalid, and a trailing lone backslash is dropped. Errors mirror MySQL's: "A capture group has an invalid name." (3887) and "Index out of bounds in regular expression search." (3686). REGEXP_REPLACE rebuilds the result by walking collected matches, emitting the in-between bytes verbatim and substituting only the targeted occurrence (or all when occurrence = 0). Negative occurrence is clamped to 1 to match MySQL. Tests cover the data-driven happy path, NULL propagation, every documented backreference form, lookbehind across pos, zero-width matches, the pos = char_count + 1 edge, the negative-occurrence clamp, and the ICU error branches.

Returns the Nth matched substring at or after a given character position, or NULL if no match. Reuses regexp_compile(), regexp_char_to_byte_offset() (with $allow_past_end so pos = char_count + 1 yields NULL), regexp_find_matches(), and regexp_fail() introduced with REGEXP_REPLACE. Negative or zero `occurrence` is clamped to 1, matching MySQL. Tests cover the data-driven happy path, NULL propagation, the occurrence clamp, the pos = char_count + 1 / pos > char_count + 1 boundary, multi-byte matches, invalid patterns, invalid flags, and a lookbehind whose context spans pos.

Returns the 1-based character position of the Nth match (or 0 if none), with return_option controlling whether to report the match start or the position one past its end. Adds the small regexp_byte_offset_to_char_index() helper that converts a byte offset returned by PCRE into a UTF-8 character index. `pos` greater than char_count is rejected even when SUBSTR / REPLACE allow it, matching MySQL's stricter validation for INSTR. Negative or zero `occurrence` is clamped to 1, also matching MySQL. return_option must be 0 (start) or 1 (one past end); anything else raises "Incorrect arguments to regexp_instr: return_option must be 1 or 0." (matching MySQL's wording). The check runs before the occurrence clamp so the message is consistent. Tests cover the data-driven happy path, NULL propagation, the occurrence clamp, the straddling-match boundary, multi-byte return_option=1, the return_option validation (including under otherwise no-op occurrences), and a lookbehind whose context spans pos.

Adds a final layer of tests that exercise behaviors which involve more than one of REGEXP_LIKE / _REPLACE / _SUBSTR / _INSTR at once and only become testable once all four functions are available: - Empty pattern raises "Illegal argument to a regular expression." uniformly (MySQL ERROR 3685). - Empty subject with a zero-width-matching pattern still produces a match (LIKE = 1, SUBSTR = "", INSTR = 1). - Zero-width anchors ^ / $ report sensible 1-based positions and an empty-string match for SUBSTR rather than NULL. - Astral-plane (4-byte UTF-8) characters are counted as one code point by both SUBSTR and INSTR. - Negative pos rejects consistently across REPLACE / SUBSTR / INSTR.

JanJakes added 5 commits April 21, 2026 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MySQL REGEXP_LIKE(), REGEXP_REPLACE(), REGEXP_SUBSTR(), REGEXP_INSTR() UDFs#368

Add MySQL REGEXP_LIKE(), REGEXP_REPLACE(), REGEXP_SUBSTR(), REGEXP_INSTR() UDFs#368
JanJakes wants to merge 5 commits intotrunkfrom
regex

JanJakes commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JanJakes commented Apr 21, 2026

Summary

MySQL fidelity

Known limitations

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant