Uploaded image for project: 'Jackrabbit Oak'
  1. Jackrabbit Oak
  2. OAK-3574

Query engine: support p=lowercase('x') and other function-based indexes

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.5.11, 1.6.0
    • core, lucene, query
    • None

    Description

      Currently, the query engine and indexes don't support function-based indexes for conditions of the form lowercase(property) = 'x'. This needs to be supported, specially for the Lucene property index.

      Also important is lowercase(name()).

      Attachments

        1. OAK-3574-lucene.patch
          41 kB
          Thomas Mueller
        2. OAK-3574-lucene-2.patch
          47 kB
          Thomas Mueller

        Issue Links

          Activity

            thomasm Thomas Mueller added a comment - - edited

            length(property) = 10 can be supported as well. Also, other comparison types, such as upper(property) > 'abc', length(property) < 10.

            In theory, combined functions can be supported (upper(lower(property)), but the grammar doesn't support that, and such combinations rarely make sense. The condition length(name()) = 10 is not supported by the grammar.

            thomasm Thomas Mueller added a comment - - edited length(property) = 10 can be supported as well. Also, other comparison types, such as upper(property) > 'abc', length(property) < 10. In theory, combined functions can be supported (upper(lower(property)), but the grammar doesn't support that, and such combinations rarely make sense. The condition length(name()) = 10 is not supported by the grammar.
            thomasm Thomas Mueller added a comment -

            http://svn.apache.org/r1747703 (trunk)

            Work in progress. The query engine now supports the relevant cases, and adds property restrictions as needed. What is missing is support on the Lucene index side. I don't think the property index needs to support this currently, even thought it could do that as well.

            thomasm Thomas Mueller added a comment - http://svn.apache.org/r1747703 (trunk) Work in progress. The query engine now supports the relevant cases, and adds property restrictions as needed. What is missing is support on the Lucene index side. I don't think the property index needs to support this currently, even thought it could do that as well.
            chetanm Chetan Mehrotra added a comment - - edited

            thomasm It would be better if parsing of the encoded function semantics is done in PropertyRestriction itself

            So instead of

            +                if (name.startsWith(QueryConstants.FUNCTION_RESTRICTION_PREFIX)) {
            +                    // TODO support function-based indexes
            +                    continue;
            +                }
            

            One can say pr.isFunction() and then have pr.getArgument(index) to get the proper property name (which can be relative). Also for lower/upper was thinking that at Lucene level we should only support case insensitive index i.e. property values say stored always in lower case and comparison done after converting the value from query to lower case. QE can then enforce proper check as the resulting result set would be a super set

            chetanm Chetan Mehrotra added a comment - - edited thomasm It would be better if parsing of the encoded function semantics is done in PropertyRestriction itself So instead of + if (name.startsWith(QueryConstants.FUNCTION_RESTRICTION_PREFIX)) { + // TODO support function-based indexes + continue; + } One can say pr.isFunction() and then have pr.getArgument(index) to get the proper property name (which can be relative). Also for lower/upper was thinking that at Lucene level we should only support case insensitive index i.e. property values say stored always in lower case and comparison done after converting the value from query to lower case. QE can then enforce proper check as the resulting result set would be a super set
            thomasm Thomas Mueller added a comment -

            I understand that "case insensitive" queries are the most important use case, and that's the one we need to support first.

            But function-based indexes are very flexible and can be used for other cases as well, for example indexing the length of a property (for example length of a binary), indexing formulas such as the area (x*y if you have properties x and y), and so on. So function-based indexes don't necessarily only have one property name, there can be multiple, so that "getArgument" doesn't always make sense. I know that currently, the query languages don't support complex expressions (not even length(name()), but that can easily be changed in the parser.

            It would be best if keep some flexibility, and write the API so that future extensions are not awkward and break backward compatibility.

            > for lower/upper was thinking that at Lucene level we should only support case insensitive index

            I'm not sure what Lucene supports. Java supports String.compareToIgnoreCase, but I'm not sure if we can use that. For example assuming the condition is "upper(firstName) = 'john'", then there are no matches. It would be incorrect to use String.compareToIgnoreCase. Sure, we could log a warning that says the condition doesn't make much sense. But there can only be matches for "lower(firstName) = 'john'".

            thomasm Thomas Mueller added a comment - I understand that "case insensitive" queries are the most important use case, and that's the one we need to support first. But function-based indexes are very flexible and can be used for other cases as well, for example indexing the length of a property (for example length of a binary), indexing formulas such as the area (x*y if you have properties x and y), and so on. So function-based indexes don't necessarily only have one property name, there can be multiple, so that "getArgument" doesn't always make sense. I know that currently, the query languages don't support complex expressions (not even length(name()), but that can easily be changed in the parser. It would be best if keep some flexibility, and write the API so that future extensions are not awkward and break backward compatibility. > for lower/upper was thinking that at Lucene level we should only support case insensitive index I'm not sure what Lucene supports. Java supports String.compareToIgnoreCase, but I'm not sure if we can use that. For example assuming the condition is "upper(firstName) = 'john'", then there are no matches. It would be incorrect to use String.compareToIgnoreCase. Sure, we could log a warning that says the condition doesn't make much sense. But there can only be matches for "lower(firstName) = 'john'".
            thomasm Thomas Mueller added a comment -

            http://svn.apache.org/r1748364 (trunk; fixes broken integration tests)

            thomasm Thomas Mueller added a comment - http://svn.apache.org/r1748364 (trunk; fixes broken integration tests)
            thomasm Thomas Mueller added a comment -

            http://svn.apache.org/r1748390 (trunk; fixes a test case that depends on the Java version used)

            thomasm Thomas Mueller added a comment - http://svn.apache.org/r1748390 (trunk; fixes a test case that depends on the Java version used)
            thomasm Thomas Mueller added a comment -

            http://svn.apache.org/r1759820 (oak-core, mainly refactoring)
            http://svn.apache.org/r1759821 (oak-lucene, don't fail tests if only function conditions are there)

            thomasm Thomas Mueller added a comment - http://svn.apache.org/r1759820 (oak-core, mainly refactoring) http://svn.apache.org/r1759821 (oak-lucene, don't fail tests if only function conditions are there)
            thomasm Thomas Mueller added a comment - - edited

            OAK-3574-lucene.patch: patch for the Lucene index to support function based indexes.

            Tests: there are new test for this feature, and changes to integration test. For existing integration tests, the relevant function-based indexes are created. If an integration test uses a function restriction that is not indexes, the test fails because the index runs in "test mode".

            Index configuration: similar to indexing a property. For the condition (XPath) "fn:upper-case(@lastName) = 'x'", create a new node under the relevant Lucene index (compatVersion 2), as follows. To use it for "order by fn:upper-case(@lastName)", use "ordered":

            /oak:index/assetType
              + indexRules
                + nt:base
                  + properties
                    + upperLastName
                      - function = "fn:upper-case(@lastName)"
                      - ordered = true
            

            The syntax of the "function" property is similar to the relevant condition in the query. Both XPath and SQL are supported (even thought I would probably use XPath in practise). Examples for property "function":

            • fn:upper-case(@data)
            • fn:lower-case(test/@data)
            • fn:lower-case(fn:name())
            • fn:lower-case(fn:local-name())
            • fn:string-length(test/@data)
            • upper([data])
            • lower([test/data])
            • lower(name())
            • lower(localname())
            • length([test/data])
            • length(name())

            Indexing multi-valued properties are supported. Relative properties are supported (except for ".." and "."). Range conditions are supported ('>', '>=', '<=', '<').

            chetanm could you please review the patch?

            thomasm Thomas Mueller added a comment - - edited OAK-3574 -lucene.patch: patch for the Lucene index to support function based indexes. Tests: there are new test for this feature, and changes to integration test. For existing integration tests, the relevant function-based indexes are created. If an integration test uses a function restriction that is not indexes, the test fails because the index runs in "test mode". Index configuration: similar to indexing a property. For the condition (XPath) "fn:upper-case(@lastName) = 'x'", create a new node under the relevant Lucene index (compatVersion 2), as follows. To use it for "order by fn:upper-case(@lastName)", use "ordered": /oak:index/assetType + indexRules + nt:base + properties + upperLastName - function = "fn:upper-case(@lastName)" - ordered = true The syntax of the "function" property is similar to the relevant condition in the query. Both XPath and SQL are supported (even thought I would probably use XPath in practise). Examples for property "function": fn:upper-case(@data) fn:lower-case(test/@data) fn:lower-case(fn:name()) fn:lower-case(fn:local-name()) fn:string-length(test/@data) upper([data]) lower([test/data]) lower(name()) lower(localname()) length([test/data]) length(name()) Indexing multi-valued properties are supported. Relative properties are supported (except for ".." and "."). Range conditions are supported ('>', '>=', '<=', '<'). chetanm could you please review the patch?

            This looks much more powerful now!

            String name = functionIndex.function;
            String f = FunctionIndexParser.convertToPolishNotation(name);
            

            Instead of computing the polish notation in IndexPlanner this can be computed once and stored in PropertyDefinition itself.

            Currently the function evaluation logic in LuceneIndexEditor is handling relative property itself. This would work fine when the node gets indexed for first time or if some non function property got changed. But it would not work if the relative property which is encoded in function gets changed as editor would not be aware of that.

            Instead of that it would be better to use a separate field in property definition to capture function and then have property name determined from that. Then

            • In IndexDefinition#collectPropConfigs have the loop evaluate pd.relative part as that would then register this relative property.
            • And then you can implement the function evaluation logic in LuceneIndexEditor#indexProperty
            chetanm Chetan Mehrotra added a comment - This looks much more powerful now! String name = functionIndex.function; String f = FunctionIndexParser.convertToPolishNotation(name); Instead of computing the polish notation in IndexPlanner this can be computed once and stored in PropertyDefinition itself. Currently the function evaluation logic in LuceneIndexEditor is handling relative property itself. This would work fine when the node gets indexed for first time or if some non function property got changed. But it would not work if the relative property which is encoded in function gets changed as editor would not be aware of that. Instead of that it would be better to use a separate field in property definition to capture function and then have property name determined from that. Then In IndexDefinition#collectPropConfigs have the loop evaluate pd.relative part as that would then register this relative property. And then you can implement the function evaluation logic in LuceneIndexEditor#indexProperty
            thomasm Thomas Mueller added a comment -

            > this can be computed once

            You are right! I will fix that.

            > not work if the relative property which is encoded in function gets changed

            Good catch! I will add a test case and fix this. I will add the relative property / properties in collectPropConfigs to the propAggregate list. That way, a function can potentially have multiple parameters in the future (for example, a function that calculates the duration based on start and end date).

            thomasm Thomas Mueller added a comment - > this can be computed once You are right! I will fix that. > not work if the relative property which is encoded in function gets changed Good catch! I will add a test case and fix this. I will add the relative property / properties in collectPropConfigs to the propAggregate list. That way, a function can potentially have multiple parameters in the future (for example, a function that calculates the duration based on start and end date).
            thomasm Thomas Mueller added a comment -

            New patch with fixed issues. The code processing functions should now be a bit more encapsulated and easier to re-use (class FunctionIndexProcessor). Processing now uses a simple stack machine, so it should be easier to extend.

            chetanm could you review again please?

            thomasm Thomas Mueller added a comment - New patch with fixed issues. The code processing functions should now be a bit more encapsulated and easier to re-use (class FunctionIndexProcessor). Processing now uses a simple stack machine, so it should be easier to extend. chetanm could you review again please?
            thomasm Thomas Mueller added a comment -

            http://svn.apache.org/r1761025 (trunk) - committing the current patch, please revert if there is a problem

            thomasm Thomas Mueller added a comment - http://svn.apache.org/r1761025 (trunk) - committing the current patch, please revert if there is a problem

            Bulk close for 1.5.11

            edivad Davide Giannella added a comment - Bulk close for 1.5.11

            People

              thomasm Thomas Mueller
              thomasm Thomas Mueller
              Votes:
              13 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: