rfc:pcre2-migration

PHP RFC: PCRE2 migration

Version: 0.9

Date: 2017-10-16

Author: Anatol Belski, ab@php.net

Status: Implemented (in PHP 7.3)

First Published at: http://wiki.php.net/rfc/pcr2-migration

Introduction

PCRE is the base for many core functionalities in PHP. Currently it is based on 8.x series which is a legacy library version today. It is supported by the mainstream, however no new feature implementations flow in there, it is a bugfix version only. Still, as ext/pcre is the core functionality for PHP, and it is essential to keep it rolling. The up-to-date version is called PCRE2 and it pertains as currently actively supported, also it is where the new features are implemented. However, the API has certain differencies. As the original release announcement tells, PCRE2 should be taken as a new project. Nevertheless, it is the library with the same purpose, that inherits a lot from the original PCRE. Today it's over two years past, since PCRE2 was released. Yes, the API is different, but to the big part is reusable compared to PCRE. It contains already features with no analogue to 8.x series. The PCRE2 JIT lists a wider platform support than PCRE JIT

Proposal

Migrate PHP core to use PCRE2 with the focus on the maximum backward compatibility. The main goal is to bring PCRE2 into the core and have it stable, no new features are targeted by this RFC . If accepted, some new features can definitely find their way before the targeted PHP version is stable.

Backward Incompatible Changes

Internal library API has changed

The 'S' modifier has no effect, patterns are studied automatically. No real impact.

The 'X' modifier is the default behavior in PCRE2. The current patch reverts the behavior to the meaning of 'X' how it was in PCRE, but it might be better to go with the new behavior and have 'X' turned on by default. So currently no impact, too.

Some behavior change due to the newer Unicode engine was sighted. It's Unicode 10 in PCRE2 vs Unicode 7 in PCRE.

Some behavior change can be sighted with invalid patterns.

Proposed PHP Version(s)

PHP 7.3

Impact to SAPIs, existing extensions and Opcache

Code, that makes use of PCRE needs to be rewritten to match the PCRE2 API . Otherwise there's no impact. The current patch takes care of all the core items depending on PCRE. As PCRE2 is bundled with PHP, PHP can be compiled also on systems where libpcre2 is not available. External libpcre2 can be provided by a corresponding package system or compiled on the given system, if desired. As ext/pcre code can change significantly, cross patching between different ext/pcre versions can certainly make impact. However to expect were, that issues between these versions are in most case unrelated to each other.

Performance

So far no negative performance impacts could be sighted at least from the linked patch. The performance is of course pattern and input specific, the tests show at least same performance PCRE2 vs. PCRE. Some test suite runs with phpunit show even a faster operation on the side of PCRE2, when preg_* functions are involved.

New Regex Syntax

These and more are available with the upgrade to PCRE2 10.x, almost nothing to be done on the PHP side. Forward relative back-references, \g{+2} (mirroring the existing \g{-2})

Version check available via patterns such as (?(VERSION>=x)…)

(*NOTEMPTY) and (*NOTEMPTY_ATSTART) tell the engine not to return empty matches)

(*NO_JIT) disable JIT optimization

(*LIMIT_HEAP=d) set the heap size limit to d kilobytes

(*LIMIT_DEPTH=d) set the backtracking limit to d

(*LIMIT_MATCH=d) set the match limit to d More on the PCRE2 syntax vs PCRE syntax pages. In general, PCRE2 seems to have a more explicit pattern interpreter, so invalid patterns are checked more agressively.

New Constants

PCRE_VERSION_MINOR

PCRE_VERSION_MAJOR

Open Issues

None.

Unaffected PHP Functionality

The userland code is unaffected, whereby the pattern checking is done more precise in PCRE2. Invalid patterns are more likely to fail the compilation. The behavior of 'X' modifier was made same in the patch, whereby PCRE2 has 'X' on by default. Also, as mentioned in the impacts section, any C code not using PCRE is unaffected. The 'S' modifier can persist, but won't take effect. The current test suite passes with PCRE2 with almost no change to the tests. One test ext/pcre/tests/bug75207.phpt had to be adjusted because of the newer UNICODE engine. There can be of course behavior differences that teh current tests don't catch, thus it is all the more important to start the QA as early as possible.

Future Scope

PCRE2 has quite a few things to offer. Please check the compiled version of the API changes here http://www.rexegg.com/pcre-documentation.html. Specifically to mention were the following New pcre2_substitute() API

Serialization of compiled patterns

New compilation options that can be turned into modifiers or used in any way otherwise, such as PCRE2_LITERAL PCRE2_NO_DOTSTAR_ANCHOR PCRE2_NEVER_BACKSLASH_C PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES PCRE2_NO_AUTO_CAPTURE PCRE2_EXTENDED_MORE

Other APIs that might be of interest - pcre2_set_parens_nest_limit, pcre2_set_offset_limit, pcre2_set_heap_limit

'X' modifier behavior is default in PCRE2, one could decide to go by it in PHP, too

PCRE2 is context based. The current patch only uses global contexts for all the compiled patterns. This can be changed depending on the future needs to create separate contexts per pattern, or to reuse contexts, etc. PCRE2 also has better Unicode support and a new error reporting API , we might check whether our current UTF-8 sanity checks are still required. Beyond features coming in the new PCRE2 versions are also to take into account.

Vote

Migrate the PHP core to the most current PCRE2 version. PCRE2 migration Real name Yes No ashnazg (ashnazg) bukka (bukka) bwoebi (bwoebi) cmb (cmb) daverandom (daverandom) derick (derick) dm (dm) dmitry (dmitry) emir (emir) galvao (galvao) jhdxr (jhdxr) kalle (kalle) kelunik (kelunik) malukenho (malukenho) narf (narf) nikic (nikic) ocramius (ocramius) peehaa (peehaa) philip (philip) pollita (pollita) ramsey (ramsey) remi (remi) sammyk (sammyk) stas (stas) tpunt (tpunt) zeev (zeev) Final result: 26 0 This poll has been closed. 2/3 majority required. Voting starts on 2017-10-30 and closes no 2017-11-13.

Patches and Tests

Implementation

Merged into 7.3 http://git.php.net/?p=php-src.git;a=commitdiff;h=a5bc5aed71f7a15f14f33bb31b8e17bf5f327e2d

References