Skip to content

ext/uri: fast-path canonical URIs in get_normalized_uri#21726

Open
iliaal wants to merge 1 commit intophp:masterfrom
iliaal:perf/uri-rfc3986-reads
Open

ext/uri: fast-path canonical URIs in get_normalized_uri#21726
iliaal wants to merge 1 commit intophp:masterfrom
iliaal:perf/uri-rfc3986-reads

Conversation

@iliaal
Copy link
Copy Markdown
Contributor

@iliaal iliaal commented Apr 11, 2026

Summary

Fast-path get_normalized_uri() in ext/uri/uri_parser_rfc3986.c when the parsed URI is already in canonical form. A single call to uriNormalizeSyntaxMaskRequiredExA returns the dirty mask; a zero mask means we alias the raw uri instead of running uriCopyUriMmA plus a full uriNormalizeSyntaxExMmA(..., -1, ...) pass. The struct caches the dirty mask so multiple non-raw reads on the same instance only run the scan once.

This commit replaces an earlier version that bundled four additional changes (port cache, inline digit accumulator in port_str_to_zend_long_checked, emalloc in uriparser_create_uris, guarded normalized_uri destroy). Per review feedback from @TimWolla, the PR now carries only the one change that stands on its own merits.

Benchmark

17 URL shapes: plain http/https, deep paths, query/fragment, userinfo, IPv4, IPv6, mailto, URN, data, file, relative, long paths with query. 100K iterations per run × 17 URLs = 1.7M parses per measurement, 10 runs per scenario, CPU pinned via taskset -c 0, same-session stash-pop A/B.

scenario baseline mean optimized mean delta
parse only 0.3992s (4.26M/s) 0.4083s (4.16M/s) noise (parse path unchanged)
parse + 1 read 0.6687s (2.54M/s) 0.5464s (3.11M/s) −18.3% / +22.4% throughput
parse + 7 reads 0.8510s (2.00M/s) 0.7305s (2.33M/s) −14.2% / +16.5% throughput

parse + 1 read isolates the first-read cost where this optimization lands. parse + 7 reads shows the realistic user pattern: the first getter pays the reduced normalization cost, the remaining six hit the cached normalized uri and cost the same as before.

hyperfine cross-check

15 runs of the full benchmark script per direction, CPU pinned:

Benchmark 1: baseline
  Time (mean ± σ):     20.471 s ±  1.052 s    [19.535 s … 22.985 s]
Benchmark 2: optimized
  Time (mean ± σ):     17.240 s ±  0.540 s    [16.556 s … 18.190 s]

Summary: optimized runs 1.19 ± 0.07 times faster than baseline

Reproducing

Save the script below as bench.php. Build a baseline and an optimized sapi/cli/php, then:

taskset -c 0 /path/to/php-baseline  -n bench.php
taskset -c 0 /path/to/php-optimized -n bench.php

Or with hyperfine:

hyperfine --warmup 3 --runs 15 \
  -n baseline  'taskset -c 0 /path/to/php-baseline  -n bench.php' \
  -n optimized 'taskset -c 0 /path/to/php-optimized -n bench.php'
bench.php
<?php
// Measures Uri\Rfc3986\Uri::parse() and parse-then-read under two
// access patterns: a single first-read (isolates the normalization
// cost) and a realistic seven-getter sequence.

$urls = [
    'simple_http'    => 'http://example.com/',
    'https_host'     => 'https://www.example.com',
    'deep_path'      => 'https://example.com/a/b/c/d/e/f/index.html',
    'with_query'     => 'https://example.com/search?q=test&lang=en&category=web',
    'with_fragment'  => 'https://example.com/docs/guide#section-3.2',
    'with_userinfo'  => 'https://user:pass@example.com:8080/secure/area',
    'long_query'     => 'https://api.example.com/v2/users?filter=name&sort=desc&page=1&per_page=100&include=profile,settings&fields=id,name,email,created_at',
    'ipv4'           => 'http://192.168.1.1:8080/admin/panel',
    'ipv6'           => 'http://[2001:db8::1]:8080/index',
    'unicode_path'   => 'https://example.com/%E6%97%A5%E6%9C%AC/path',
    'ftp'            => 'ftp://files.example.com/pub/release/latest.tar.gz',
    'mailto'         => 'mailto:user@example.com',
    'urn'            => 'urn:isbn:0451450523',
    'data'           => 'data:text/plain;base64,SGVsbG8gV29ybGQ=',
    'only_path'      => '/path/to/resource?x=1',
    'relative'       => '../relative/path.html',
    'long_url'       => 'https://www.example.com/very/long/path/with/many/segments/that/go/on/and/on/file.html?first=1&second=2&third=3&fourth=4&fifth=5#anchor-point',
];

$iterations = 100000;
$runs = 10;

function bench_parse_only(array $urls, int $iter): float {
    $t = hrtime(true);
    for ($i = 0; $i < $iter; $i++) {
        foreach ($urls as $u) {
            \Uri\Rfc3986\Uri::parse($u);
        }
    }
    return (hrtime(true) - $t) / 1e9;
}

function bench_parse_one_read(array $urls, int $iter): float {
    $t = hrtime(true);
    for ($i = 0; $i < $iter; $i++) {
        foreach ($urls as $u) {
            $o = \Uri\Rfc3986\Uri::parse($u);
            if ($o === null) continue;
            $o->getScheme();
        }
    }
    return (hrtime(true) - $t) / 1e9;
}

function bench_parse_full(array $urls, int $iter): float {
    $t = hrtime(true);
    for ($i = 0; $i < $iter; $i++) {
        foreach ($urls as $u) {
            $o = \Uri\Rfc3986\Uri::parse($u);
            if ($o === null) continue;
            $o->getScheme(); $o->getHost(); $o->getPort();
            $o->getPath(); $o->getQuery(); $o->getFragment();
            $o->getUserInfo();
        }
    }
    return (hrtime(true) - $t) / 1e9;
}

function stats(array $samples): array {
    sort($samples);
    $n = count($samples);
    $mean = array_sum($samples) / $n;
    $variance = 0.0;
    foreach ($samples as $s) {
        $variance += ($s - $mean) ** 2;
    }
    $stddev = $n > 1 ? sqrt($variance / ($n - 1)) : 0.0;
    $median = $n % 2 === 1
        ? $samples[(int)(($n - 1) / 2)]
        : ($samples[(int)($n / 2) - 1] + $samples[(int)($n / 2)]) / 2;
    return ['min' => $samples[0], 'mean' => $mean, 'median' => $median, 'stddev' => $stddev];
}

bench_parse_only($urls, 2000);
bench_parse_one_read($urls, 2000);
bench_parse_full($urls, 2000);

$benches = [
    'parse only     ' => 'bench_parse_only',
    'parse + 1 read ' => 'bench_parse_one_read',
    'parse + 7 reads' => 'bench_parse_full',
];

$results = [];
foreach ($benches as $label => $fn) {
    $samples = [];
    for ($r = 0; $r < $runs; $r++) {
        $samples[] = $fn($urls, $iterations);
    }
    $results[$label] = stats($samples);
}

$total = $iterations * count($urls);
printf("runs=%d  iterations=%d  urls=%d  calls/run=%d\n\n", $runs, $iterations, count($urls), $total);
printf("%-17s %9s %9s %9s %9s   %s\n", '', 'min', 'mean', 'median', 'stddev', 'mean ops/s');
foreach ($results as $label => $s) {
    printf("%-17s %8.4fs %8.4fs %8.4fs %8.4fs   %.2fM\n",
        $label, $s['min'], $s['mean'], $s['median'], $s['stddev'],
        ($total / $s['mean']) / 1e6);
}

Tests

All 309 tests in ext/uri/tests pass. The full normalize path still runs for URIs that need it (checked with http://EXAMPLE.com/A/%2e%2e/c resolving to /c) via the nonzero dirty mask.

@TimWolla
Copy link
Copy Markdown
Member

Five targeted changes in ext/uri/uri_parser_rfc3986.c; the full per-change breakdown is in the commit message.

Putting five changes into a single commit is not "targeted".

Benchmark

Please provide the benchmarking script for independent verification.

@iliaal
Copy link
Copy Markdown
Contributor Author

iliaal commented Apr 11, 2026

Re: benchmark script. Save the script below as bench.php, run it from the build root:

taskset -c 0 sapi/cli/php -n -d memory_limit=2G bench.php

Repeat 5 or 6 times per direction and use a same-session A/B (stash the patch, rebuild, rerun in the same shell session) to keep CPU cache and scheduler state consistent between baseline and optimized runs. The warmup is 1000 iterations; the timed loop is 100K × 17 URLs = 1.7M parses per measurement.

bench.php
<?php
// Measures Uri\Rfc3986\Uri::parse(), Uri\WhatWg\Url::parse(), and
// full parse-then-read for both backends.

$urls = [
    'simple_http'    => 'http://example.com/',
    'https_host'     => 'https://www.example.com',
    'deep_path'      => 'https://example.com/a/b/c/d/e/f/index.html',
    'with_query'     => 'https://example.com/search?q=test&lang=en&category=web',
    'with_fragment'  => 'https://example.com/docs/guide#section-3.2',
    'with_userinfo'  => 'https://user:pass@example.com:8080/secure/area',
    'long_query'     => 'https://api.example.com/v2/users?filter=name&sort=desc&page=1&per_page=100&include=profile,settings&fields=id,name,email,created_at',
    'ipv4'           => 'http://192.168.1.1:8080/admin/panel',
    'ipv6'           => 'http://[2001:db8::1]:8080/index',
    'unicode_path'   => 'https://example.com/%E6%97%A5%E6%9C%AC/path',
    'ftp'            => 'ftp://files.example.com/pub/release/latest.tar.gz',
    'mailto'         => 'mailto:user@example.com',
    'urn'            => 'urn:isbn:0451450523',
    'data'           => 'data:text/plain;base64,SGVsbG8gV29ybGQ=',
    'only_path'      => '/path/to/resource?x=1',
    'relative'       => '../relative/path.html',
    'long_url'       => 'https://www.example.com/very/long/path/with/many/segments/that/go/on/and/on/file.html?first=1&second=2&third=3&fourth=4&fifth=5#anchor-point',
];

$iterations = 100000;

function bench_rfc3986(array $urls, int $iter): float {
    $t = hrtime(true);
    for ($i = 0; $i < $iter; $i++) {
        foreach ($urls as $u) {
            \Uri\Rfc3986\Uri::parse($u);
        }
    }
    return (hrtime(true) - $t) / 1e9;
}

function bench_whatwg(array $urls, int $iter): float {
    $t = hrtime(true);
    for ($i = 0; $i < $iter; $i++) {
        foreach ($urls as $u) {
            \Uri\WhatWg\Url::parse($u);
        }
    }
    return (hrtime(true) - $t) / 1e9;
}

function bench_rfc3986_full(array $urls, int $iter): float {
    $t = hrtime(true);
    for ($i = 0; $i < $iter; $i++) {
        foreach ($urls as $u) {
            $o = \Uri\Rfc3986\Uri::parse($u);
            if ($o === null) continue;
            $o->getScheme(); $o->getHost(); $o->getPort();
            $o->getPath(); $o->getQuery(); $o->getFragment();
            $o->getUserInfo();
        }
    }
    return (hrtime(true) - $t) / 1e9;
}

function bench_whatwg_full(array $urls, int $iter): float {
    $t = hrtime(true);
    for ($i = 0; $i < $iter; $i++) {
        foreach ($urls as $u) {
            $o = \Uri\WhatWg\Url::parse($u);
            if ($o === null) continue;
            $o->getScheme(); $o->getAsciiHost(); $o->getPort();
            $o->getPath(); $o->getQuery(); $o->getFragment();
            $o->getUsername(); $o->getPassword();
        }
    }
    return (hrtime(true) - $t) / 1e9;
}

bench_rfc3986($urls, 1000); // warmup
bench_whatwg($urls, 1000);

$runs = 5;
$best_rfc = INF; $best_wg = INF; $best_rfc_full = INF; $best_wg_full = INF;
for ($r = 0; $r < $runs; $r++) {
    $t = bench_rfc3986($urls, $iterations);      if ($t < $best_rfc)      $best_rfc      = $t;
    $t = bench_whatwg($urls, $iterations);       if ($t < $best_wg)       $best_wg       = $t;
    $t = bench_rfc3986_full($urls, $iterations); if ($t < $best_rfc_full) $best_rfc_full = $t;
    $t = bench_whatwg_full($urls, $iterations);  if ($t < $best_wg_full)  $best_wg_full  = $t;
}

$total = $iterations * count($urls);
printf("rfc3986 parse only:      %.4fs  (%.0f/s)\n", $best_rfc,      $total / $best_rfc);
printf("whatwg  parse only:      %.4fs  (%.0f/s)\n", $best_wg,       $total / $best_wg);
printf("rfc3986 parse + 7 reads: %.4fs  (%.0f/s)\n", $best_rfc_full, $total / $best_rfc_full);
printf("whatwg  parse + 8 reads: %.4fs  (%.0f/s)\n", $best_wg_full,  $total / $best_wg_full);

Re: "not targeted". The word choice isn't an argument for splitting. On the actual question: three of the five are separable from each other, but two pairs cannot be split without breaking the tree at intermediate commits.

  1. emalloc in uriparser_create_uris (4) and the guarded uriFreeUriMembersMmA on normalized_uri in _destroy (5) have to land in the same commit. The emalloc leaves normalized_uri as uninitialized bytes. Without the destroy guard in the same commit, destroy reads garbage and segfaults. With the guard added first and the emalloc later, the guard is dead code.

  2. The port cache (2) is one invariant spread across three sites: parse-time stash, read-time short-circuit, and write-time invalidate. Pull any one of them out and the others leave state inconsistent: a stash nothing reads, or reads that return stale values after a write.

@TimWolla
Copy link
Copy Markdown
Member

On the actual question: three of the five are separable from each other, but two pairs cannot be split without breaking the tree at intermediate commits.

Why did you initially list them as five individual "targeted" changes when some of them cannot stand on their own?

but two pairs cannot be split

You list only one pair and one individual patch in the following list.


Either way, this PR needs to be split into actually targeted changes that each stand on their own and come with their own justification as to why the added complexity to the code is worth it. That includes a targeted benchmark for each change, ideally measured with Hyperfine. It is not useful to benchmark code that clearly is unaffected by the change (e.g. the WHATWG implementation, or calling multiple getters after the normalized URL is already cached).

When Uri\Rfc3986\Uri::parse() produces a URI already in canonical form
(the common case: http/https URLs with no uppercase host, no
percent-encoding in unreserved ranges, no ".." path segments),
get_normalized_uri() no longer deep-copies the parsed struct and runs
a full normalization pass. It calls uriNormalizeSyntaxMaskRequiredExA
once to compute the dirty mask; a zero mask means we alias the raw
uri. The struct caches the dirty mask, so multiple non-raw reads on
the same instance only run the scan once.

Fallback: when the mask is nonzero, we copy and normalize as before,
but only for the flagged components (uriNormalizeSyntaxExMmA(...,
dirty_mask, ...) instead of (..., -1, ...)).

Measurements on a 17-URL mix with a realistic parse-and-read workload
(10 runs of 1.7M parses each, CPU pinned via taskset, same-session
stash-pop A/B so both builds share machine state):

                        baseline mean    optimized mean    delta
    parse only         0.3992s (4.26M/s)  0.4083s (4.16M/s)  noise
    parse + 1 read     0.6687s (2.54M/s)  0.5464s (3.11M/s)  -18.3%
    parse + 7 reads    0.8510s (2.00M/s)  0.7305s (2.33M/s)  -14.2%

The "parse + 1 read" row isolates the first-read cost where this
change lands. The "parse + 7 reads" row shows the amortized effect
under a realistic user pattern: the first getter pays the reduced
normalization cost, and the remaining six getters hit the cached
normalized uri and cost the same as before.

hyperfine cross-check on the whole benchmark script, 15 runs each:

    baseline   20.471 s +/- 1.052 s  [19.535 .. 22.985]
    optimized  17.240 s +/- 0.540 s  [16.556 .. 18.190]
    optimized runs 1.19 +/- 0.07 times faster.

All 309 tests in ext/uri/tests pass. I checked that URIs needing
normalization (http://EXAMPLE.com/A/%2e%2e/c resolving to /c) still
hit the full normalize path through the nonzero dirty mask.
@iliaal iliaal force-pushed the perf/uri-rfc3986-reads branch from b5d6508 to 6c8cc4c Compare April 11, 2026 23:13
@iliaal iliaal changed the title ext/uri: speed up Uri\Rfc3986\Uri component reads (~22%) ext/uri: fast-path canonical URIs in get_normalized_uri Apr 11, 2026
@iliaal
Copy link
Copy Markdown
Contributor Author

iliaal commented Apr 11, 2026

It is not useful to benchmark code that clearly is unaffected by the change... calling multiple getters after the normalized URL is already cached

Every iteration of the benchmark parses a fresh URI, so the normalization cache starts cold on each call. The first getter is where this optimization lands, and the subsequent six amortize that saving. That's a realistic read pattern. I've added a parse + 1 read variant to the new benchmark that isolates the first-read cost on its own, which is the clearest single number for this change (−18.3%).

I've also rewritten the commit to carry only the normalize-alias fast path. That's the single biggest contributor of the five original changes. The other four (port cache, inline digit accumulator in port_str_to_zend_long_checked, emalloc in uriparser_create_uris, guarded normalized_uri destroy) are marginal on their own and I'm keeping them in my own fork. The per-change review overhead doesn't make sense.

Updated PR description has the per-scenario numbers (three access patterns, min/mean/median/stddev across 10 runs) plus a hyperfine cross-check (1.19 ± 0.07x faster, 15 runs per direction, process-level). Bench script is in the body under <details> so you can reproduce.

@LamentXU123
Copy link
Copy Markdown
Contributor

LamentXU123 commented Apr 12, 2026

Bench script is in the body under <details> so you can reproduce.

It's in your PR description under the 'Reproducing' section I presume. It seems like your benchmark already has a timer function? No need to add it if you are running hyperfine.

Every iteration of the benchmark parses a fresh URI, so the normalization cache starts cold on each call. The first getter is where this optimization lands, and the subsequent six amortize that saving. That's a realistic read pattern.

It is actually cached.


I would try reproduce your changes later.

@iliaal
Copy link
Copy Markdown
Contributor Author

iliaal commented Apr 12, 2026

It's in your PR description under the 'Reproducing' section I presume. It seems like your benchmark already has a timer function? No need to add it if you are running hyperfine.

Admittedly a duplication, I prefer timers inside the code itself, hyperfine (new tool I learned about) is fine (no pun intended 😆 ) but it measures entire PHP stack, which in theory is the same across runs, while hrtime() (used here) and microtime(1) time specifically the code being evaluated and don't include any php init/shutdown overhead.

Every iteration of the benchmark parses a fresh URI, so the normalization cache starts cold on each call. The first getter is where this optimization lands, and the subsequent six amortize that saving. That's a realistic read pattern.

It is actually cached.

Uri\Rfc3986\Uri::parse() returns a new object each iteration. The normalized cache lives on the object instance (normalized_uri_initialized = false at construction). So yes, within one iteration getters 2-7 hit the warm cache, but the first getter on each freshly-parsed URI pays the full normalization cost. That's what parse + 1 read isolates. The "parse + 7 reads" variant is there to show the amortized pattern, not to claim all seven benefit from the optimization.

Comment on lines 25 to 32
struct php_uri_parser_rfc3986_uris {
UriUriA uri;
UriUriA normalized_uri;
unsigned int dirty_mask;
bool normalized_uri_initialized;
bool normalized_uri_is_alias;
bool dirty_mask_valid;
};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need all these additional fields. dirty_mask_valid is guaranteed to be identical to normalized_uri_initialized and normalized_uri_is_alias is identical to dirty_mask == URI_NORMALIZED.

struct php_uri_parser_rfc3986_uris {
UriUriA uri;
UriUriA normalized_uri;
unsigned int dirty_mask;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dirty_mask is a misleading name, it implies some fields need to be stored after changing them.

uriparser_uris->dirty_mask_valid = true;
}

if (uriparser_uris->dirty_mask == 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check against URI_NORMALIZED.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants