Details
-
Test
-
Status: Closed
-
Major
-
Resolution: Not A Problem
-
0.4
-
None
-
None
-
None
Description
I have done a test ,
Preferences records: 680,194
distinct users: 23,246
distinct items:437,569
SIMILARITY_CLASS_NAME=SIMILARITY_COOCCURRENCE
maybePruneItemUserMatrixPath:16.50M
weights:13.80M
pairwiseSimilarity:18.81G
Job RowSimilarityJob-RowWeightMapper-WeightedOccurrencesPerColumnReducer:used 32 sec
Job RowSimilarityJob-CooccurrencesMapper-SimilarityReducer:used 4.30 hours
I think the reason may be following:
1) We used SequenceFileOutputFormat,it cause job can only be run by n ( n= Hadoop node counts ) mappers or reducers concurrently.
2) We stored redundant info.
for example :
the output of CooccurrencesMapper: (ItemIndexA,similarity),(ItemIndexA,ItemIndexB,similarity)
3) Some frequently used code
https://issues.apache.org/jira/browse/MAHOUT-467
4) allocate many local variable in loop (need confirm )
In Class DistributedUncenteredZeroAssumingCosineVectorSimilarity
@Override
public double weight(Vector v) {
double length = 0.0;
Iterator<Element> elemIterator = v.iterateNonZero();
while (elemIterator.hasNext())
{ double value = elemIterator.next().get(); //this one length += value * value; } return Math.sqrt(length);
}
5) Maybe we need control the size of cooccurrences
Attachments
Attachments
Issue Links
- duplicates
-
MAHOUT-460 Add "maxPreferencesPerItemConsidered" option to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob
- Closed