Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-8369

MapReduce over snapshot files

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.98.0
    • mapreduce, snapshots
    • None
    • Reviewed
    • Hide
      Added TableSnapshotInputFormat and TableSnapshotScanner for performing scans over hbase table snapshots from the client side, bypassing the hbase servers. The former configures a mapreduce job, while the latter does single client side scan over snapshot files. Can also be used with offline HBase with in-place or exported snapshot files.

      WARNING: This feature bypasses HBase-level security completely since the files are read from the hdfs directly. The user who is running the scan / job has to have read permissions to the data files and snapshot files.

      Show
      Added TableSnapshotInputFormat and TableSnapshotScanner for performing scans over hbase table snapshots from the client side, bypassing the hbase servers. The former configures a mapreduce job, while the latter does single client side scan over snapshot files. Can also be used with offline HBase with in-place or exported snapshot files. WARNING: This feature bypasses HBase-level security completely since the files are read from the hdfs directly. The user who is running the scan / job has to have read permissions to the data files and snapshot files.

    Description

      The idea is to add an InputFormat, which can run the mapreduce job over snapshot files directly bypassing hbase server layer. The IF is similar in usage to TableInputFormat, taking a Scan object from the user, but instead of running from an online table, it runs from a table snapshot. We do one split per region in the snapshot, and open an HRegion inside the RecordReader. A RegionScanner is used internally for doing the scan without any HRegionServer bits.

      Users have been asking and searching for ways to run MR jobs by reading directly from hfiles, so this allows new use cases if reading from stale data is ok:

      • Take snapshots periodically, and run MR jobs only on snapshots.
      • Export snapshots to remote hdfs cluster, run the MR jobs at that cluster without HBase cluster.
      • (Future use case) Combine snapshot data with online hbase data: Scan from yesterday's snapshot, but read today's data from online hbase cluster.

      Attachments

        1. HBASE-8369-trunk_v3.patch
          24 kB
          Bryan Keller
        2. HBASE-8369-trunk_v2.patch
          24 kB
          Bryan Keller
        3. HBASE-8369-trunk_v1.patch
          24 kB
          Bryan Keller
        4. HBASE-8369-0.94.patch
          23 kB
          Bryan Keller
        5. HBASE-8369-0.94_v5.patch
          24 kB
          Bryan Keller
        6. HBASE-8369-0.94_v4.patch
          24 kB
          Bryan Keller
        7. HBASE-8369-0.94_v3.patch
          24 kB
          Bryan Keller
        8. HBASE-8369-0.94_v2.patch
          24 kB
          Bryan Keller
        9. hbase-8369_v9.patch
          150 kB
          Enis Soztutar
        10. hbase-8369_v8.patch
          151 kB
          Enis Soztutar
        11. hbase-8369_v7.patch
          151 kB
          Enis Soztutar
        12. hbase-8369_v6.patch
          148 kB
          Enis Soztutar
        13. hbase-8369_v5.patch
          160 kB
          Enis Soztutar
        14. hbase-8369_v11.patch
          152 kB
          Enis Soztutar
        15. hbase-8369_v0.patch
          73 kB
          Enis Soztutar

        Issue Links

          Activity

            People

              enis Enis Soztutar
              enis Enis Soztutar
              Votes:
              2 Vote for this issue
              Watchers:
              39 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: