Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1405

parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 1.3.0
    • MLlib

    Description

      Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling.

      In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core.

      Algorithm survey from Pedro: https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing
      API design doc from Joseph: https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing

      Attachments

        1. performance_comparison.png
          48 kB
          Guoqiang Li

        Issue Links

          Activity

            People

              josephkb Joseph K. Bradley
              yinxusen Xusen Yin
              Xiangrui Meng Xiangrui Meng
              Votes:
              6 Vote for this issue
              Watchers:
              32 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 336h
                  336h
                  Remaining:
                  Remaining Estimate - 336h
                  336h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified