Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Cache compiled regular expressions (and other amortized work) in SQL function runtime. Compiling a regular expression to a pattern is expensive (compared to the cost of matching, given an existing pattern) and therefore caching the compiled form will yield performance benefits if the regular expression is constant or has a small number of values.
Consider the following query:
SELECT ename, job, RLIKE(ename, 'A.*'), RLIKE(ename, job || '.*') FROM emp
The first regular expression, 'A.*', is constant and can be compiled at prepare time or at the start of execution; the second regular expression, job || '.*', might vary from one row to the next. However if the job column has a small number of values it still might be beneficial to cache the compiled regular expression.
If SqlFunctions.rlike could use a cache (mapping from String to java.util.regex.Pattern) then it would achieve benefits in both the constant and non-constant cases.
The cache needs to:
- be thread-safe (in case queries are executing using multiple threads),
- return thread-safe objects (as is Pattern),
- have bounded space (so that a query doesn't blow memory with 1 million distinct regular expressions),
- disposed after the query has terminated,
- (ideally) share with regexes of the same language in the same query,
- not conflict with regexes of different languages in the same query.
One possible implementation is to add an interface FunctionState, with subclasses including class RegexpCache, and if argument 1 of a function is a subclass of FunctionState the compiler would initialize the state in the generated code. The function can rely on the state argument being initialized, and being the same object from one call to the next. Example:
interface FunctionState { } class RegexpCache implements FunctionState { final Cache<String, Pattern> cache = ...; }
This change should install the cache for all applicable functions, including LIKE, ILIKE, RLIKE, SIMILAR, posix regex, REGEXP_REPLACE (MySQL, Oracle), REGEXP_CONTAINS (BigQuery), other BigQuery REGEXP_ functions, PARSE_URL, JSON_REPLACE, PARSE_TIMESTAMP.
It can also be used for functions that have mutable state, e.g. RANDOM with and without a seed.
Attachments
Issue Links
- causes
-
CALCITE-5967 UnsupportedOperationException while implementing a call that requires a special collator
- Closed
- is related to
-
CALCITE-2884 Implement JSON_INSERT, JSON_REPLACE, JSON_SET
- Closed
-
CALCITE-3063 Babel parse should parse PostgreSQL posix regular expressions
- Closed
-
CALCITE-3574 Add RLIKE operator (similar to LIKE, but Hive- and Spark-specific)
- Closed
-
CALCITE-3280 Cannot parse query REGEXP_REPLACE in Redshift
- Closed
-
CALCITE-4443 Add ILIKE operator (similar to LIKE, but case-insensitive, and PostgreSQL-specific)
- Closed
-
CALCITE-5543 Implement BigQuery functions for parsing DATE, TIME, TIMESTAMP, DATETIME
- Closed
-
CALCITE-5873 Add REGEXP_CONTAINS function (enabled in BigQuery library)
- Closed
- relates to
-
CALCITE-6596 Enable function-level cache by default
- Resolved
-
CALCITE-6585 In the Postgres TO_CHAR function, improve caching
- Resolved
-
CALCITE-5995 Add cache to the dejsonize functions (JSON_VALUE, JSON_EXISTS, JSON_QUERY)
- Closed
- links to