[SPARK-37660] Spark-3.2.0 Fetch Hbase Data not working - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.2.0
Fix Version/s: None
Component/s: Spark Core
Labels:
None
Environment:

Hide

Hadoop version : hadoop-2.9.2

HBase version : hbase-2.2.5

Spark version : spark-3.2.0-bin-without-hadoop

java version : jdk1.8.0_151

scala version : scala-sdk-2.12.10

os version : Red Hat Enterprise Linux Server release 6.6 (Santiago)

Show
Hadoop version : hadoop-2.9.2 HBase version : hbase-2.2.5 Spark version : spark-3.2.0-bin-without-hadoop java version : jdk1.8.0_151 scala version : scala-sdk-2.12.10 os version : Red Hat Enterprise Linux Server release 6.6 (Santiago)

Language:
- scala

Description

Below is the sample code snipet that is used to fetch data from hbase. This used to work fine with spark-3.1.1

However after upgrading to psark-3.2.0 it is not working, The issue is it is not throwing any exception, it just don't fill RDD.

 
   def getInfo(sc: SparkContext, startDate:String, cachingValue: Int, sparkLoggerParams: SparkLoggerParams, zkIP: String, zkPort: String): RDD[(String)] = {{
val scan = new Scan
    scan.addFamily("family")
    scan.addColumn("family","time")
    val rdd = getHbaseConfiguredRDDFromScan(sc, zkIP, zkPort, "myTable", scan, cachingValue, sparkLoggerParams)
    val output: RDD[(String)] = rdd.map { row =>
      (Bytes.toString(row._2.getRow))
    }
    output
  }
 
def getHbaseConfiguredRDDFromScan(sc: SparkContext, zkIP: String, zkPort: String, tableName: String,
                                    scan: Scan, cachingValue: Int, sparkLoggerParams: SparkLoggerParams): NewHadoopRDD[ImmutableBytesWritable, Result] = {
    scan.setCaching(cachingValue)
    val scanString = Base64.getEncoder.encodeToString(org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(scan).toByteArray)
    val hbaseContext = new SparkHBaseContext(zkIP, zkPort)
    val hbaseConfig = hbaseContext.getConfiguration()
    hbaseConfig.set(TableInputFormat.INPUT_TABLE, tableName)
    hbaseConfig.set(TableInputFormat.SCAN, scanString)
    sc.newAPIHadoopRDD(
      hbaseConfig,
      classOf[TableInputFormat],
      classOf[ImmutableBytesWritable], classOf[Result]
    ).asInstanceOf[NewHadoopRDD[ImmutableBytesWritable, Result]]
  }

If we fetch with using scan directly without using newAPIHadoopRDD, it works.

Attachments

Issue Links

relates to

HBASE-28219 Document spark.hadoopRDD.ignoreEmptySplits issue for Spark Connector

Open

PHOENIX-7065 Spark3 connector tests fail with Spark 3.4.1

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Bhavya Raj Sharma

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 16/Dec/21 07:49

Updated:: 23/Nov/23 20:04