Details
-
Task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
Description
Hudi CLI gives exception when trying to connect to s3 path
create --path s3://some-bucket/tmp/hudi/test_mor --tableName test_mor_s3 --tableType MERGE_ON_READ Failed to get instance of org.apache.hadoop.fs.FileSystem org.apache.hudi.exception.HoodieIOException: Failed to get instance of org.apache.hadoop.fs.FileSystem at org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:98) ========= create --path s3a://some-bucket/tmp/hudi/test_mor --tableName test_mor_s3 --tableType MERGE_ON_READ Command failed java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
This could be because target/lib folder does not contain hadoop-aws or aws-s3 dependency.
Update from Sivabalan:
Something that works for me even w/o the patch linked. If you wish to use latest master Hudi-cli with S3 dataset. Just incase someone wants to try it out.
1. replace local hudi-cli.sh contents with this.
2. do mvn package.
3. tar entire hudi-cli directory.
4. copy to emr master.
5. untar hudi-cli.tar
6. Ensure to set SPARK_HOME to /usr/lib/spark
7. download aws jars and copy to some directory.
mkdir client_jars && cd client_jars
export HADOOP_VERSION=3.2.0
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar -O hadoop-aws.jar
export AWS_SDK_VERSION=1.11.375
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar -O aws-java-sdk.jar
set CLIENT_JARS=/home/hadoop/client_jars/aws-java-sdk.jar:/home/hadoop/client_jars/hadoop-aws.jar
8. and then launch hudi-cli.sh
I verified that cli ommands that launches spark succeeds w/ this for a S3 dataset.
With the patch from vinay, I am running into EMR FS issues.
Ethan: locally running hudi-cli with S3 Hudi table:
Build Hudi with corresponding Spark version export AWS_REGION=us-east-2 export AWS_ACCESS_KEY_ID=<key_id> export AWS_SECRET_ACCESS_KEY=<secret_key> export SPARK_HOME=<spark_home> # Note: AWS jar versions below are specific to Spark 3.2.0 export CLIENT_JAR=/lib/spark-3.2.0-bin-hadoop3.2/jars/aws-java-sdk-bundle-1.12.48.jar:/lib/spark-3.2.0-bin-hadoop3.2/jars/hadoop-aws-3.3.1.jar ./hudi-cli/hudi-cli.sh