Description
Hi,
I found very strange broadcast join behaviour.
According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
I'm using hint for broadcast join. (I patched 1.5.1 with https://github.com/apache/spark/pull/8801/files )
I found that working of this feature depends on Executor Memory.
In my case broadcast join is working up to 31G.
Example:
spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G debug_broadcast_join.py true Creating test tables... Joining tables... Joined table schema: root |-- id: long (nullable = true) |-- val: long (nullable = true) |-- id2: long (nullable = true) |-- val2: long (nullable = true) Selecting data for id = 5... [Row(id=5, val=5, id2=5, val2=5)] spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py true Creating test tables... Joining tables... Joined table schema: root |-- id: long (nullable = true) |-- val: long (nullable = true) |-- id2: long (nullable = true) |-- val2: long (nullable = true) Selecting data for id = 5... [Row(id=5, val=5, id2=None, val2=None)]
Please find example code attached.
Attachments
Attachments
Issue Links
- duplicates
-
SPARK-10914 UnsafeRow serialization breaks when two machines have different Oops size
- Resolved