Description
Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling.
In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core.
Algorithm survey from Pedro: https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing
API design doc from Joseph: https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing
Attachments
Attachments
Issue Links
- duplicates
-
SPARK-953 Latent Dirichlet Association (LDA model)
- Resolved
- is related to
-
SPARK-5556 Latent Dirichlet Allocation (LDA) using Gibbs sampler
- Resolved
- relates to
-
SPARK-2199 Distributed probabilistic latent semantic analysis in MLlib
- Resolved
- links to