Skip to content

Add MySQL REGEXP_LIKE(), REGEXP_REPLACE(), REGEXP_SUBSTR(), REGEXP_INSTR() UDFs#368

Draft
JanJakes wants to merge 5 commits intotrunkfrom
regex
Draft

Add MySQL REGEXP_LIKE(), REGEXP_REPLACE(), REGEXP_SUBSTR(), REGEXP_INSTR() UDFs#368
JanJakes wants to merge 5 commits intotrunkfrom
regex

Conversation

@JanJakes
Copy link
Copy Markdown
Member

Summary

Implements the four MySQL 8.0 regular-expression functions on top of SQLite by registering PHP UDFs that translate MySQL/ICU semantics onto PCRE.

  1. REGEXP_LIKE(expr, pattern [, match_type]) — boolean matching. Introduces a shared regexp_compile() that maps MySQL match-type flags (c/i/m/n/u) onto PCRE modifiers with UTF-8 mode always on, plus regexp_run() / regexp_fail() that distinguish compile errors from backtrack-limit and invalid-UTF-8 failures.
  2. REGEXP_REPLACE(expr, pattern, replacement [, pos [, occurrence [, match_type]]]) — uses a walker that calls preg_match with its offset argument so lookbehind assertions retain context across pos. Implements MySQL/ICU replacement grammar: $N backreferences (with $0 = full match and longest-valid-prefix wins), \X drops the backslash, ${N} is rejected, trailing \ is dropped. occurrence = 0 replaces all; negatives clamp to 1.
  3. REGEXP_SUBSTR(expr, pattern [, pos [, occurrence [, match_type]]]) — returns the Nth match from pos. Accepts pos = char_count + 1 (returns NULL); clamps occurrence <= 0 to 1.
  4. REGEXP_INSTR(expr, pattern [, pos [, occurrence [, return_option [, match_type]]]]) — returns a 1-based character position. return_option strictly 0 or 1; rejects pos > char_count even where SUBSTR/REPLACE allow it.

MySQL fidelity

Behavior was verified against MySQL 8.0.45 in Docker across ~100 SQL cases. Error texts mirror MySQL's ICU wording where practical:

  • Illegal argument to a regular expression. (empty pattern, ERROR 3685)
  • A capture group has an invalid name. (invalid `$` in replacement, ERROR 3887)
  • Index out of bounds in regular expression search. (ERROR 3686)
  • Incorrect arguments to regexp_instr: return_option must be 1 or 0.

Known limitations

Documented in `regexp_compile()`:

  • The UDF has no access to the session collation, so case-sensitivity defaults to insensitive (correct for `utf8mb4_0900_ai_ci`, wrong for `_bin`/`_cs`; callers on those collations should pass an explicit `c` match_type).
  • The `/u` PCRE modifier is always applied. Binary data with invalid UTF-8 that would match under the legacy `REGEXP` operator raises `Invalid UTF-8 data in regular expression input.` when routed through the new UDFs.

Test plan

  • `cd packages/mysql-on-sqlite && composer run test` passes (756 tests on my machine)
  • `composer run check-cs` clean on both changed files
  • Review: four new public methods at `packages/mysql-on-sqlite/src/sqlite/class-wp-sqlite-pdo-user-defined-functions.php` and seven new private helpers
  • Spot-check a few queries against MySQL 8 for any corner cases you care about

Addresses #47.

Implements MySQL REGEXP_LIKE(expr, pattern [, match_type]) via a new
variadic UDF. Introduces a shared regexp_compile() helper that
translates MySQL match_type flags to PCRE modifiers and always uses
UTF-8 mode, plus regexp_run() (suppresses preg_* warnings) and
regexp_fail() (translates preg failures into MySQL-style messages).

regexp_compile() rejects empty patterns to match MySQL ERROR 3685 and
documents two known limitations of the emulation: collation-blind
case-sensitivity defaulting and the always-on /u modifier diverging
from the legacy REGEXP operator on binary data.

The match_type loop accepts MySQL's c/i/m/n/u flags (last of the
case flags wins; "u" — Unix-only line endings — is a no-op since
PCRE's default already matches that semantics). Unknown flags raise
"Invalid match_type flag: X.".

Tests cover: data-driven match cases, NULL propagation, invalid
patterns, multi-flag combinations, UTF-8 input errors via the
PREG_BAD_UTF8_ERROR branch, the backtrack-limit branch, and that
the legacy REGEXP operator still works alongside REGEXP_LIKE.
Implements MySQL REGEXP_REPLACE(expr, pattern, replacement [, pos
[, occurrence [, match_type]]]) with three new private helpers:

- regexp_char_to_byte_offset() converts a 1-based character pos into a
  byte offset, accepting char_count + 1 for the "start at end" case
  that MySQL allows for REPLACE / SUBSTR.
- regexp_find_matches() walks the subject with preg_match and its
  offset argument so that lookbehind assertions can see bytes before
  pos. Skips UTF-8 continuation bytes after zero-width matches.
- regexp_expand_replacement() implements MySQL/ICU replacement
  grammar: "$N" backreferences (with "$0" = full match and longest
  valid digit-prefix wins), "\X" emits X literally, "${N}" is rejected
  as invalid, and a trailing lone backslash is dropped. Errors mirror
  MySQL's: "A capture group has an invalid name." (3887) and "Index
  out of bounds in regular expression search." (3686).

REGEXP_REPLACE rebuilds the result by walking collected matches,
emitting the in-between bytes verbatim and substituting only the
targeted occurrence (or all when occurrence = 0). Negative occurrence
is clamped to 1 to match MySQL.

Tests cover the data-driven happy path, NULL propagation, every
documented backreference form, lookbehind across pos, zero-width
matches, the pos = char_count + 1 edge, the negative-occurrence
clamp, and the ICU error branches.
Returns the Nth matched substring at or after a given character
position, or NULL if no match. Reuses regexp_compile(),
regexp_char_to_byte_offset() (with $allow_past_end so pos =
char_count + 1 yields NULL), regexp_find_matches(), and regexp_fail()
introduced with REGEXP_REPLACE.

Negative or zero `occurrence` is clamped to 1, matching MySQL.

Tests cover the data-driven happy path, NULL propagation, the
occurrence clamp, the pos = char_count + 1 / pos > char_count + 1
boundary, multi-byte matches, invalid patterns, invalid flags, and a
lookbehind whose context spans pos.
Returns the 1-based character position of the Nth match (or 0 if
none), with return_option controlling whether to report the match
start or the position one past its end. Adds the small
regexp_byte_offset_to_char_index() helper that converts a byte offset
returned by PCRE into a UTF-8 character index.

`pos` greater than char_count is rejected even when SUBSTR / REPLACE
allow it, matching MySQL's stricter validation for INSTR. Negative or
zero `occurrence` is clamped to 1, also matching MySQL.

return_option must be 0 (start) or 1 (one past end); anything else
raises "Incorrect arguments to regexp_instr: return_option must be 1
or 0." (matching MySQL's wording). The check runs before the
occurrence clamp so the message is consistent.

Tests cover the data-driven happy path, NULL propagation, the
occurrence clamp, the straddling-match boundary, multi-byte
return_option=1, the return_option validation (including under
otherwise no-op occurrences), and a lookbehind whose context spans
pos.
Adds a final layer of tests that exercise behaviors which involve more
than one of REGEXP_LIKE / _REPLACE / _SUBSTR / _INSTR at once and only
become testable once all four functions are available:

- Empty pattern raises "Illegal argument to a regular expression."
  uniformly (MySQL ERROR 3685).
- Empty subject with a zero-width-matching pattern still produces a
  match (LIKE = 1, SUBSTR = "", INSTR = 1).
- Zero-width anchors ^ / $ report sensible 1-based positions and an
  empty-string match for SUBSTR rather than NULL.
- Astral-plane (4-byte UTF-8) characters are counted as one code
  point by both SUBSTR and INSTR.
- Negative pos rejects consistently across REPLACE / SUBSTR / INSTR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant