[SPARK-1486] Support multi-model training in MLlib - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Critical
Resolution: Auto Closed
Affects Version/s: None
Fix Version/s: None
Component/s: MLlib
Labels:
- bulk-closed

Description

It is rare in practice to train just one model with a given set of parameters. Usually, this is done by training multiple models with different sets of parameters and then select the best based on their performance on the validation set. MLlib should provide native support for multi-model training/scoring. It requires decoupling of concepts like problem, formulation, algorithm, parameter set, and model, which are missing in MLlib now. MLI implements similar concepts, which we can borrow. There are different approaches for multi-model training:

0) Keep one copy of the data, and train models one after another (or maybe in parallel, depending on the scheduler).

1) Keep one copy of the data, and train multiple models at the same time (similar to `runs` in KMeans).

2) Make multiple copies of the data (still stored distributively), and use more cores to distribute the work.

3) Collect the data, make the entire dataset available on workers, and train one or more models on each worker.

Users should be able to choose which execution mode they want to use. Note that 3) could cover many use cases in practice when the training data is not huge, e.g., <1GB.

This task will be divided into sub-tasks and this JIRA is created to discuss the design and track the overall progress.

Attachments

Issue Links

relates to

SPARK-5256 Improving MLlib optimization APIs

Resolved

SPARK-18303 CLONE - Improving MLlib optimization APIs

Resolved

links to

[Github] Pull Request #2451 (brkyvz)

Activity

People

Assignee:: Burak Yavuz

Reporter:: Xiangrui Meng

Votes:: 8 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 14/Apr/14 07:25

Updated:: 06/Jun/19 13:57

Resolved:: 06/Jun/19 13:57