Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Currently, spark hbase connector use `String` to specify regionStart and regionEnd, but we often have serialized binary row key, I made a little patch at https://github.com/apache/hbase-connectors/pull/72/files to always treat the `String` in ISO_8859_1, so we can put raw bytes into the String object and get it unchanged.
This has a drawback, if your row key is really Unicode strings beyond ISO_8859_1 charset, you should convert it to UTF-8 encoded bytes and then encapsulate it in ISO_8859_1 string. This is a limitation of Spark option interface which allows only string to string map.
import java.nio.charset.StandardCharsets; df.write() .format("org.apache.hadoop.hbase.spark") .option(HBaseTableCatalog.tableCatalog(), catalog) .option(HBaseTableCatalog.newTable(), 5) .option(HBaseTableCatalog.regionStart(), new String("你好".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1)) .option(HBaseTableCatalog.regionEnd(), new String("世界".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1)) .mode(SaveMode.Append) .save();
Attachments
Issue Links
- links to