Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-2083

Hudi CLI does not work with S3

    XMLWordPrintableJSON

Details

    Description

      Hudi CLI gives exception when trying to connect to s3 path

      create --path s3://some-bucket/tmp/hudi/test_mor --tableName test_mor_s3 --tableType MERGE_ON_READ
      
      Failed to get instance of org.apache.hadoop.fs.FileSystem
      org.apache.hudi.exception.HoodieIOException: Failed to get instance of org.apache.hadoop.fs.FileSystem
          at org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:98)
      
      =========
      
      create --path s3a://some-bucket/tmp/hudi/test_mor --tableName test_mor_s3 --tableType MERGE_ON_READ
      
      Command failed java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
      java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
      java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
          at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
          at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
      
      

      This could be because target/lib folder does not contain hadoop-aws or aws-s3 dependency.

       

      Update from Sivabalan:

      Something that works for me even w/o the patch linked. If you wish to use latest master Hudi-cli with S3 dataset. Just incase someone wants to try it out.

      1. replace local hudi-cli.sh contents with this.
      2. do mvn package.
      3. tar entire hudi-cli directory.
      4. copy to emr master.
      5. untar hudi-cli.tar

      6. Ensure to set SPARK_HOME to /usr/lib/spark

      7. download aws jars and copy to some directory. 

      mkdir client_jars && cd client_jars

      export HADOOP_VERSION=3.2.0
      wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_VERSION}/hadoop-aws-${HADOOP_VERSION}.jar -O hadoop-aws.jar
      export AWS_SDK_VERSION=1.11.375
      wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar -O aws-java-sdk.jar 

      set CLIENT_JARS=/home/hadoop/client_jars/aws-java-sdk.jar:/home/hadoop/client_jars/hadoop-aws.jar

      8. and then launch hudi-cli.sh 

      I verified that cli ommands that launches spark succeeds w/ this for a S3 dataset.
      With the patch from vinay, I am running into EMR FS issues.

       

      Ethan: locally running hudi-cli with S3 Hudi table:

      Build Hudi with corresponding Spark version
      
      export AWS_REGION=us-east-2
      export AWS_ACCESS_KEY_ID=<key_id>
      export AWS_SECRET_ACCESS_KEY=<secret_key>
      
      export SPARK_HOME=<spark_home>
      # Note: AWS jar versions below are specific to Spark 3.2.0
      export CLIENT_JAR=/lib/spark-3.2.0-bin-hadoop3.2/jars/aws-java-sdk-bundle-1.12.48.jar:/lib/spark-3.2.0-bin-hadoop3.2/jars/hadoop-aws-3.3.1.jar
      ./hudi-cli/hudi-cli.sh

      Attachments

        1. hudi-cli_trace.txt
          31 kB
          Benoit COLAS

        Activity

          People

            vinaypatil18 Vinay
            vinaypatil18 Vinay
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: