Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Won't Fix
-
None
-
None
Description
The existing Hudi docker image uses a pretty old Debian distribution, therefore it comes with Python 3.5.3. When using it with Spark 3.2.1, running "pyspark" command in container shell results in below error:
root@e9fb3f81bdc9:/opt# pyspark Python 3.5.3 (default, Nov 4 2021, 15:29:10) [GCC 6.3.0 20170516] on linux Type "help", "copyright", "credits" or "license" for more information. Traceback (most recent call last): File "/opt/spark/python/pyspark/shell.py", line 29, in <module> from pyspark.context import SparkContext File "/opt/spark/python/pyspark/__init__.py", line 53, in <module> from pyspark.rdd import RDD, RDDBarrier File "/opt/spark/python/pyspark/rdd.py", line 48, in <module> from pyspark.traceback_utils import SCCallSiteSync File "/opt/spark/python/pyspark/traceback_utils.py", line 23, in <module> CallSite = namedtuple("CallSite", "function file linenum") File "/opt/spark/python/pyspark/serializers.py", line 390, in namedtuple for k, v in _old_namedtuple_kwdefaults.items(): AttributeError: 'NoneType' object has no attribute 'items'
The image I used was: https://hub.docker.com/r/apachehudi/hudi-hadoop_3.1.0-hive_3.1.2-sparkadhoc_3.2.1/tags
Spark 3.2.1 requires Python 3.6+, Spark 3.3 requires Python 3.7+.
Base image for Java 8 uses openjdk:8u212-jdk-slim-stretch.
The goal is to upgrade it to 8u342-jdk-slim-bullseye. This image installs python3.9 with apt-get install python3.
This will also sync with existing java 11 base image distro.
Attachments
Issue Links
- blocks
-
HUDI-5273 Notebook demo with pyspark - improve new users experience
- Closed
- links to