Uploaded image for project: 'Apache Jena'
  1. Apache Jena
  2. JENA-1770

Spilling bindings with OPTIONAL leads to wrong answers

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • Jena 3.13.1
    • Jena 3.14.0
    • ARQ
    • None

    Description

      A query like the following where some variables are optional may lead to wrong answers when spilling occurs: 

      PREFIX  foaf: <http://xmlns.com/foaf/0.1/>
      SELECT  ?name ?mbox
      WHERE
        { ?x  foaf:name  ?name
          OPTIONAL
            { ?x  foaf:mbox  ?mbox }
        }
      ORDER BY ASC(?mbox)
      

      This is only a problem when the ARQ.spillToDiskThreshold setting has been configured.

      The root cause is that BindingOutputStream emits a VARS row based on the first binding, but it doesn't emit a new VARS row when a subsequent binding contains additional variables.  

      The BindingOutputStream.needVars() method will cause a second VARS row to be emitted when a new binding is missing variables, but not when it has extras.  This logic may be inverted from what was intended.

      There's a TestDistinctDataBag test case below that reproduces the problem. It generates a spill file like this:

      VARS ?1 .
      "A" .
      "A" .
      

      when a correct spill file would be:

      VARS ?1 .
      "A" .
      VARS ?2 ?1 .
      "B" "A" .
      

      If you run it, you may notice that it fails with a spill threshold of 2 but passes with a higher threshold:

      @Test public void testOptionalVariables()
      {
          // Setup a situation where the second binding in a spill file binds more
          // variables than the first binding
          BindingMap binding1 = BindingFactory.create();
          binding1.add(Var.alloc("1"), NodeFactory.createLiteral("A"));
      
          BindingMap binding2 = BindingFactory.create();
          binding2.add(Var.alloc("1"), NodeFactory.createLiteral("A"));
          binding2.add(Var.alloc("2"), NodeFactory.createLiteral("B"));
      
          List<Binding> undistinct = Arrays.asList(binding1, binding2, binding1);
          List<Binding> control = Iter.toList(Iter.distinct(undistinct.iterator()));
          List<Binding> distinct = new ArrayList<>();
      
          DistinctDataBag<Binding> db = new DistinctDataBag<>(
                  new ThresholdPolicyCount<Binding>(2),
                  SerializationFactoryFinder.bindingSerializationFactory(),
                  new BindingComparator(new ArrayList<SortCondition>()));
          try
          {
              db.addAll(undistinct);
              Iterator<Binding> iter = db.iterator();
              while (iter.hasNext())
              {
                  distinct.add(iter.next());
              }
              Iter.close(iter);
          }
          finally
          {
              db.close();
          }
      
          assertEquals(control.size(), distinct.size());
          assertTrue(ResultSetCompare.equalsByTest(control, distinct, NodeUtils.sameTerm));
      }
      

      Attachments

        Issue Links

          Activity

            People

              andy Andy Seaborne
              ssmith Shawn Smith
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m