[SPARK-16431] Add a unified method that accepts single instances to feature transformers and predictors - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: ML
Labels:
None

Description

Current transformers in spark.ml can only operate on DataFrames and don't have a method that accepts single instances. A typical transformer has a User-Defined Function (udf) in its transform method which includes a set of operations on the features of a single instance:

val column_operation = udf {operations on single instance}

Adding a new method that operates directly on single instances (e.g. called transformInstance) and using it in the udf instead can be useful:

def transformInstance(features: featureType): OutputType = {operations on single instance}
val column_operation = udf {transformInstance}

Predictors also don’t have a public method that does predictions on single instances. transformInstance can be easily added to predictors by acting as a wrapper for the internal method predict (which takes features as input).

This simple change has (at least) three benefits.

Providing a low-latency transformation/prediction method to support machine learning applications that require real-time predictions. The current transform method has a relatively high latency when transforming single instances or small batches due to the overhead introduced by DataFrame operations. I measured the latency required to classify a single instance in the 20 Newsgroups dataset using the current transform method and the proposed transformInstance. The ML pipeline contains a tokenizer, stopword remover, TF hasher, IDF, scaler, and Logisitc Regression. The table below shows the latency percentiles in milliseconds after measuring the time to classify 700 documents.

Transformation Method P50 P90 P99 Max

transform 31.44 39.43 67.75 126.97

transformInstance 0.16 0.38 1.16 3.2

transformInstance is 200 times faster on average and can classify a document in less than a millisecond. By profiling the code of transform, it turns out that every transformer in the pipeline wastes 5 milliseconds on average in DataFrame-related operations when transforming a single instance. This implies that the latency increases linearly with the pipeline size which can be problematic.
Increasing code readability and allowing easier debugging as operations on rows are now combined into a function that can be tested independently of the higher-level transform method.
Adding flexibility to create new models: for example, check this comment on supporting new ensemble methods.

Attachments

Issue Links

is related to

SPARK-10413 ML models should support prediction on single instances

Resolved

links to

[Github] Pull Request #14101 (husseinhazimeh)

Activity

People

Assignee:: Unassigned

Reporter:: Hussein Hazimeh

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 07/Jul/16 23:53

Updated:: 22/Jul/16 18:08

Resolved:: 22/Jul/16 18:08

Transformation Method	P50	P90	P99	Max
transform	31.44	39.43	67.75	126.97
transformInstance	0.16	0.38	1.16	3.2