[COLLECTIONS-547] Performance issue in SetUtils::isEqualSet - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Not A Problem
Affects Version/s: 4.0
Fix Version/s: None
Component/s: Set
Labels:
- perfomance
Environment:

Ubuntu 14.04

Description

There seems to be a performance problem with the function isEqualSet of
SetUtils when the first parameter is of a collection type that has a slow containsAll/contains method.

The following is the code of the function:

    /**
     * Tests two sets for equality as per the <code>equals()</code> contract
     * in {@link java.util.Set#equals(java.lang.Object)}.
     * <p>
     * This method is useful for implementing <code>Set</code> when you cannot
     * extend AbstractSet. The method takes Collection instances to enable other
     * collection types to use the Set implementation algorithm.
     * <p>
     * The relevant text (slightly paraphrased as this is a static method) is:
     * <blockquote>
     * <p>Two sets are considered equal if they have
     * the same size, and every member of the first set is contained in
     * the second. This ensures that the <tt>equals</tt> method works
     * properly across different implementations of the <tt>Set</tt>
     * interface.</p>
     *
     * <p>
     * This implementation first checks if the two sets are the same object:
     * if so it returns <tt>true</tt>.  Then, it checks if the two sets are
     * identical in size; if not, it returns false. If so, it returns
     * <tt>a.containsAll((Collection) b)</tt>.</p>
     * </blockquote>
     *
     * @see java.util.Set
     * @param set1  the first set, may be null
     * @param set2  the second set, may be null
     * @return whether the sets are equal by value comparison
     */
    public static boolean isEqualSet(final Collection<?> set1, final Collection<?> set2) {
        if (set1 == set2) {
            return true;
        }
        if (set1 == null || set2 == null || set1.size() != set2.size()) {
            return false;
        }

        return set1.containsAll(set2);
    }

The problem is that in the last return statement, the function relies on the
containsAll method of the class of the set1, which can be any type of collection.

The following test harness shows the performance degradation between
using a list and using a set as a first parameter, across different collection sizes.

    public static void 	setUtilsisEqualSetTest(boolean original) {

	int N=500000;
	ArrayList<Integer> firstArrayList=new ArrayList<Integer>();
	ArrayList<Integer> secondArrayList=new ArrayList<Integer>();
	for(int i=0;i<N;++i) {

	    firstArrayList.add(new Integer(i));
	    secondArrayList.add(new Integer((N-1)-i));
	    
	}

	SetUtils.isEqualSet(original?firstArrayList:(new HashSet<Integer>(firstArrayList)),secondArrayList);

In the following table "Original" corresponds to the time taken by
the existing implementation of SetUtils::isEqualSet, "Repaired" to the time
taken by the function invoked with the first parameter converted into a
set, and "Speed-up" to the speed-up obtained after the repair.

N Original(s) Repaired(s) Speed-up(X)

10 1.01 1.02 0.99
100 1.02 1.02 1
1000 1.04 1.04 1
10000 1.15 1.09 1.05
100000 9.33 1.11 8.40
500000 > 300 1.26 > 238.09

One way to deal with this would be to call the CollectionUtils::containsAll method instead of Collection::containsAll, since it has a linear time implementation instead of quadratic, and can handle the types of isEqualSet.

Another solution would be to include a warning to the user in the documentation that the first parameter should have a fast containment method when possible.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Oswaldo Olivo

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 21/Jan/15 21:10

Updated:: 27/Nov/15 21:47

Resolved:: 22/Jan/15 09:18