Uploaded image for project: 'HCatalog'
  1. HCatalog
  2. HCATALOG-237

Switch from using StorageDrivers to SerDes to do data (de)serialization

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.4
    • 0.4
    • None
    • None

    Description

      HCatalog started by creating its own classes, InputStorageDriver and OutputStorageDriver, to do data conversion between the storage layer Input/OutputFormats and the HCatInput/OutputFormats. These provide very similar functionality to Hive's SerDe class, though with a much simpler interface.

      This usage of separate classes has led to a number of issues for HCatalog. One, it cannot make use of existing Hive SerDes. Two, it has led to a need to make HCat specific extensions of Hive interfaces (such as the StorageHandler) to provide the StorageDescriptors. Three, it means that users who already have Hive installed cannot use HCatalog without first updating every partition in their metastore with storage driver information.

      I propose we switch to using SerDes for this. To address the issue of the more complicated SerDe interface we can provide adaptor classes that make writing new SerDes easy in simple cases.

      Attachments

        Issue Links

          1.
          Changes to HCatInputFormat to make it use SerDes instead of StorageDrivers Sub-task Closed Vikram Dixit K
          2.
          Changes to HCatOutputFormat to make it use SerDes instead of StorageDriver Sub-task Closed Francis Christopher Liu
          3.
          Changes to HCatRecord to support switch from StorageDriver to SerDe Sub-task Closed Sushanth Sowmyan
          4.
          CLI changes to remove checks and support for StorageDrivers Sub-task Closed Sushanth Sowmyan
          5.
          HCat e2e tests need to change to not use StorageDrivers Sub-task Closed Alan Gates
          6.
          Rework JSON StorageDriver into a JSON SerDe Sub-task Closed Sushanth Sowmyan
          7.
          Rework HBase storage driver into HBase storage handler Sub-task Closed Rohini Palaniswamy
          8.
          LazyHCatTuple introduction to prevent paying full cost of deserialization of LazyHCatRecord Sub-task Open Sushanth Sowmyan
          9.
          Make readFields() and write() in LazyHCatRecord work Sub-task Closed Alan Gates
          10.
          remove deprecated HCatStorageHandler Sub-task Closed Francis Christopher Liu
          11.
          Remove remnants of storage drivers. Sub-task Closed Rohini Palaniswamy
          12.
          HCatInputFormat shouldn't expect storageHandler to be serializable Sub-task Closed Sushanth Sowmyan
          13.
          only serialize OutputJobInfo into tableDesc.getJobProperties() when calling configureOutputJobProperties() Sub-task Reopened Unassigned
          14.
          Remove remaining code mentioning isd/osd Sub-task Closed Daniel Dai
          15.
          move setInputPath to FosterStorageHandler.configureInputProperties() Sub-task Open Unassigned
          16.
          InputJobInfo still uses serverUri and serverKerberosPrincipal Sub-task Closed Sushanth Sowmyan
          17.
          Rename storage-drivers directory to storage-handlers (fix packaging, etc) Sub-task Closed Alan Gates
          18.
          TableDesc and jobProperties related changes to configureInputJobProperties and configureOutputJobProperties Sub-task Resolved Sushanth Sowmyan

          Activity

            People

              Unassigned Unassigned
              gates Alan Gates
              Votes:
              3 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: