[SPARK-20028] Implement NGrams aggregate function - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

This is the implementation of `ngrams` aggregate expression which is also implemented by Hive. It takes use of n-gram concept in natural language processing to understand texts.

Currently, Spark doesn't support using Hive UDAF GenericUDAFnGrams, which is actually a feature missing.

An n-gram is a contiguous subsequence of n item(s) drawn from a given sequence. This expression finds the k most frequent n-grams from one or more sequences.

This expression has the pattern of : ngrams(children: Array[Array[String]](or Array[String]), n: Int, k: Int, accuracy: Int), it can be used in conjuction with `sentences` to split the column of String to Array. Among the parameters:
Children indicates the 'given sequence' we collect n-grams from;
N indicates n-gram's element number, size 1 is referred to as a "unigram", size 2 is a "bigram", size 3 is a "trigram"...
K indicates top k;
Accuracy is related to the memory used for frequency estimation, more memory will give more accurate frequency counts.

A simple example:
`SELECT ngrams(array("abc", "abc", "bcd", "abc", "bcd"), 2, 4);` will get
`[

{["abc","bcd"]:2.0}

{["abc","abc"]:1.0}

{["bcd","abc"]:1.0}

]`. Because there are four 2-grams for the input which are `["abc", "abc"], ["abc", "bcd"], ["bcd", "abc"], ["abc", "bcd"]`, and `["abc", "bcd"]` occurs 2 times, the other two 2-grams occurs 1 time each, while `["abc","abc"]` is alphabetically before `["bcd","abc"]`, so the answer is like that.

Attachments

Issue Links

links to

[Github] Pull Request #17359 (gczsjdy)

Activity

People

Assignee:: Unassigned

Reporter:: Chenzhao Guo

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 20/Mar/17 07:01

Updated:: 21/May/19 04:15

Resolved:: 21/May/19 04:15