Files
wordpress/wp-includes/utf8.php
dmsnell 7342284ab7 Charset: Improve UTF-8 scrubbing ability via new UTF-8 scanning pipeline.
This is the fourth in a series of patches to modernize and standardize UTF-8 handling.

`wp_check_invalid_utf8()` has long been dependent on the runtime configuration of the system running it. This has led to hard-to-diagnose issues with text containing invalid UTF-8. The function has also had an apparent defect since its inception: when requesting to strip invalid bytes it returns an empty string.

This patch updates the function to remove all dependency on the system running it. It defers to the `mbstring` extension if that’s available, falling back to the new UTF-8 scanning pipeline.

To support this work, `wp_scrub_utf8()` is created with a proper fallback so that the remaining logic inside of `wp_check_invalid_utf8()` can be minimized. The defect in this function has been fixed, but instead of stripping the invalid bytes it will replace them with the Unicode replacement character for stronger security guarantees.

Developed in https://github.com/WordPress/wordpress-develop/pull/9498
Discussed in https://core.trac.wordpress.org/ticket/63837

Follow-up to: [60768].
Props askapache, chriscct7, Cyrille37, desrosj, dmsnell, helen, jonsurrell, kitchin, miqrogroove, pbearne, shailu25.
Fixes #63837, #29717.
See #63863.

Built from https://develop.svn.wordpress.org/trunk@60793


git-svn-id: http://core.svn.wordpress.org/trunk@60129 1a063a9b-81f0-0310-95a4-ce76da25c4cd
2025-09-23 03:36:32 +00:00

5.6 KiB