Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
Lucene.Net 2.9.2
-
None
-
Irrelevant
Description
Right now, CharArraySet derives from System.Collections.Hashtable, but doesn't actually use this base type for storing elements.
However, the StandardAnalyzer.STOP_WORDS_SET is exposed as a System.Collections.Hashtable. The trivial code to build your own stopword set using the StandardAnalyzer.STOP_WORDS_SET and adding your own set of stopwords like this:
CharArraySet myStopWords = new CharArraySet(StandardAnalyzer.STOP_WORDS_SET, ignoreCase: false);
foreach (string domainSpecificStopWord in DomainSpecificStopWords)
stopWords.Add(domainSpecificStopWord);
... will fail because the CharArraySet accepts an ICollection, which will be passed the Hashtable instance of STOP_WORDS_SET: the resulting myStopWords will only contain the DomainSpecificStopWords, and not those from STOP_WORDS_SET.
One workaround would be to replace the first line with this:
CharArraySet stopWords = new CharArraySet(StandardAnalyzer.STOP_WORDS_SET.Count + DomainSpecificStopWords.Length, ignoreCase: false);
foreach (string domainSpecificStopWord in (CharArraySet)StandardAnalyzer.STOP_WORDS_SET)
stopWords.Add(domainSpecificStopWord);
... but this makes use of the implementation detail (the STOP_WORDS_SET is really an UnmodifiableCharArraySet which is itself a CharArraySet). It works because it forces the foreach() to use the correct CharArraySet.GetEnumerator(), which is defined as a "new" method (this has a bad code smell to it)
At least 2 possibilities exist to solve this problem:
- Make CharArraySet use the Hashtable instance and a custom comparator, instead of its own implementation.
- Make CharArraySet use HashSet<char[]>, defined in .NET 4.0.