[MAHOUT-308] Improve Lanczos to handle extremely large feature sets (without hashing) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 0.3
Fix Version/s: 0.5
Component/s: classic
Labels:
None
Environment:

all

Description

DistributedLanczosSolver currently keeps all Lanczos vectors in memory on the driver (client) computer while Hadoop is iterating. The memory requirements of this is (desiredRank) * (numColumnsOfInput) * 8bytes, which for desiredRank = a few hundred, starts to cap out usefulness at some-small-number * millions of columns for most commodity hardware.

The solution (without doing stochastic decomposition) is to persist the Lanczos basis to disk, except for the most recent two vectors. Some care must be taken in the "orthogonalizeAgainstBasis()" method call, which uses the entire basis. This part would be slower this way.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAHOUT-308.patch
17/Jun/10 15:38
14 kB
Danny Leshem

Activity

People

Assignee:: Jake Mannix

Reporter:: Jake Mannix

Votes:: 2 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 24/Feb/10 23:01

Updated:: 31/Jan/24 22:16

Resolved:: 31/Mar/11 13:53