Description
Many popular machine learning algorithms can be expressed in what's known as the statistical query model (SQM): They rely on aggregate statistics, not random data access. In the most common case, those statistics are aggregates of functions applied to each dataset. Such queries map trivially to the map-reduce programming paradigm.
However, most ML algorithms perform many of such queries in iterations. This leads to inefficiencies on traditional map-reduce systems: Ech query turns into a job which needs to be scheduled, its input needs to be read and its output needs to be persisted.
We propose Iterative Map Reduce Update (IMRU), a simple extension to the map-reduce abstraction to capture such programs in three functions:
- TMapOutput Map(TMapInput input) is a map function with side information. It is assumed to have access to the training data through other means, and the input provided is the mutable state of the computation provided by the Update function.
- TMapOutput Reduce(param TMapOutput[] mapoutput is a (pure) reduce function.
- Tuple<TMapInput,TResult> Update(TMapOutput mapoutput) takes the (aggregated) outputs from the Map functions and produces a new set of inputs for them, a result of the computation or both. Computation terminates if no further TMapInput is produced.
As part of this work, we will introduce the IMRU API, a local (threaded) test harness as well as an implementation on top of REEF. Actually getting the data into the mappers is out of scope here and will be part of another JIRA.
This JIRA serves as an umbrella for work leading to an IMRU implementation on REEF.
Attachments
Issue Links
1.
|
Implement Batch Gradient Descent on IMRU | Open | Unassigned | |
2.
|
Create an Example of IMRU with IPartitionedDataSet | Open | Unassigned |