[HUDI-2083] Hudi CLI does not work with S3 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.11.1
Component/s: cli
Labels:

Description

Hudi CLI gives exception when trying to connect to s3 path

create --path s3://some-bucket/tmp/hudi/test_mor --tableName test_mor_s3 --tableType MERGE_ON_READ

Failed to get instance of org.apache.hadoop.fs.FileSystem
org.apache.hudi.exception.HoodieIOException: Failed to get instance of org.apache.hadoop.fs.FileSystem
    at org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:98)

=========

create --path s3a://some-bucket/tmp/hudi/test_mor --tableName test_mor_s3 --tableType MERGE_ON_READ

Command failed java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)

This could be because target/lib folder does not contain hadoop-aws or aws-s3 dependency.

Update from Sivabalan:

Something that works for me even w/o the patch linked. If you wish to use latest master Hudi-cli with S3 dataset. Just incase someone wants to try it out.

1. replace local hudi-cli.sh contents with this.
2. do mvn package.
3. tar entire hudi-cli directory.
4. copy to emr master.
5. untar hudi-cli.tar

6. Ensure to set SPARK_HOME to /usr/lib/spark

7. download aws jars and copy to some directory.

mkdir client_jars && cd client_jars

export HADOOP_VERSION=3.2.0
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar -O hadoop-aws.jar
export AWS_SDK_VERSION=1.11.375
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar -O aws-java-sdk.jar

set CLIENT_JARS=/home/hadoop/client_jars/aws-java-sdk.jar:/home/hadoop/client_jars/hadoop-aws.jar

8. and then launch hudi-cli.sh

I verified that cli ommands that launches spark succeeds w/ this for a S3 dataset.
With the patch from vinay, I am running into EMR FS issues.

Ethan: locally running hudi-cli with S3 Hudi table:

Build Hudi with corresponding Spark version

export AWS_REGION=us-east-2
export AWS_ACCESS_KEY_ID=<key_id>
export AWS_SECRET_ACCESS_KEY=<secret_key>

export SPARK_HOME=<spark_home>
# Note: AWS jar versions below are specific to Spark 3.2.0
export CLIENT_JAR=/lib/spark-3.2.0-bin-hadoop3.2/jars/aws-java-sdk-bundle-1.12.48.jar:/lib/spark-3.2.0-bin-hadoop3.2/jars/hadoop-aws-3.3.1.jar
./hudi-cli/hudi-cli.sh

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

hudi-cli_trace.txt
03/Nov/22 18:15
31 kB
Benoit COLAS

Issue Links

links to

GitHub Pull Request #3222

GitHub Pull Request #4603

Activity

People

Assignee:: Vinay

Reporter:: Vinay

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 27/Jun/21 20:03

Updated:: 03/Nov/22 18:23

Resolved:: 18/Jul/22 03:33