Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4320

JavaPairRDD should supply a saveAsNewHadoopDataset which takes a Job object

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • Input/Output, Spark Core
    • None

    Description

      I am outputting data to Accumulo using a custom OutputFormat. I have tried using saveAsNewHadoopFile() and that works- though passing an empty path is a bit weird. Being that it isn't really a file I'm storing, but rather a generic Pair dataset, I'd be inclined to use the saveAsHadoopDataset() method, though I'm not at all interested in using the legacy mapred API.

      Perhaps we could supply a saveAsNewHadoopDateset method. Personally, I think there should be two ways of calling into this method. Instead of forcing the user to always set up the Job object explicitly, I'm in the camp of having the following method signature:

      saveAsNewHadoopDataset(keyClass : Class[K], valueClass : Class[V], ofclass : Class[? extends OutputFormat], conf : Configuration). This way, if I'm writing spark jobs that are going from Hadoop back into Hadoop, I can construct my Configuration once.

      Perhaps an overloaded method signature could be:

      saveAsNewHadoopDataset(job : Job)

      Attachments

        Activity

          People

            Unassigned Unassigned
            sonixbp Corey J. Nolet
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: