Fix GH-21734: WHATWG URL parser accepts overlong UTF-8 and invalid continuation bytes#21735
Closed
iliaal wants to merge 1 commit intophp:masterfrom
Closed
Fix GH-21734: WHATWG URL parser accepts overlong UTF-8 and invalid continuation bytes#21735iliaal wants to merge 1 commit intophp:masterfrom
iliaal wants to merge 1 commit intophp:masterfrom
Conversation
… continuation bytes lxb_encoding_decode_valid_utf_8_single() skipped all UTF-8 validation (continuation byte range, overlong sequences, surrogates), trusting the caller to pass valid input. The URL parser calls it on untrusted user input at 7 sites, and the IDNA code calls it on percent-decoded hostname bytes at 2 more. Overlong ASCII characters in hostnames passed through IDNA processing as their target codepoints, producing valid domains from byte sequences that look nothing like the canonical form (e.g., %C1%A5%C1%B6... → "evil.com"). Chrome, Firefox, and Safari reject these at the UTF-8 decode step. Add the missing validation to decode_valid_utf_8_single: - 2-byte: reject lead bytes < 0xC2 (overlong), validate continuation - 3-byte: validate continuations, reject 0xE0 + < 0xA0 (overlong), reject 0xED + > 0x9F (surrogates) - 4-byte: reject lead > 0xF4, validate continuations, reject 0xF0 + < 0x90 (overlong), reject 0xF4 + > 0x8F (> U+10FFFF) On error, advance by 1 byte (not the full sequence length) so the next byte gets its own decode attempt, matching browser behavior. Closes phpGH-21734
Member
|
This needs to be fixed in the upstream library at https://github.com/lexbor/lexbor. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #21734.
lxb_encoding_decode_valid_utf_8_single()had no UTF-8 validation: no continuation byte range checks, no overlong sequence rejection, no surrogate rejection. It was written to assume the caller already verified the input. The URL parser calls it on untrusted user input at 7 sites, and the IDNA code calls it on percent-decoded hostname bytes at 2 more.An attacker feeding overlong ASCII characters into a hostname could get them through IDNA processing as their target codepoints, producing valid domains from byte sequences that look nothing like the canonical form.
%C1%A5%C1%B6%C1%A9%C1%AC.comresolved toevil.com. Chrome, Firefox, and Safari reject overlong sequences at the UTF-8 decode step.Added the missing validation to
decode_valid_utf_8_single:On error, the decoder advances by 1 byte (not the full sequence length) so the next byte gets its own decode attempt, matching browser behavior.