RE2 regular expressions in R
- Mentors
- Marek Gagolewski, Toby Dylan Hocking
- Organization
- R project for statistical computing
R provides two types of regular expressions in base
package, extended regular expressions (the default) with TRE and Perl-like regular expressions used by perl = TRUE
with PCRE.
PCRE includes useful features, such as named capture, but it uses a backtracking algorithm, so it is easy to take exponential time or arbitrary stack depth for certain regular expressions. Using PCRE in the service backend would have left it open to easy denial of service attacks.
TRE has a polynomial time complexity but does not include named capture.
stringi
is a R package use the regular expression engine from the ICU library, which has an exponential time complexity. The stringi
package does not support named capture yet because it is still considered as experimental in ICU.
RE2 is a primarily DFA based regular expression engine from Google that is fast at matching large amounts of text with named capture. Users can build fast and scalable service backend with RE2 library.
This project will create an R package interface to the RE2 library, providing the R community with the first regular expression package with both named capture and polynomial time complexity.