Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
There are some CPU hotpots when processing large data when IO is highly optimized:
1. Sort : memory access to compare with 2 values can be bottleneck.
2. Aggregation : hash construction for UnorderedPartitionedKVWriter can be bottleneck.
3. Filter : memory access to compare the key values with given condition.
This issue is a umbrella jira for CPU optmizations at Tez side.
Related works:
Alphasort: http://dl.acm.org/citation.cfm?id=615237
Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited: https://cs.uwaterloo.ca/~tozsu/publications/other/p168-balkesen.pdf
Attachments
Issue Links
- is related to
-
HADOOP-11029 FileSystem#Statistics uses volatile variables that must be updated on write or read calls.
- Open
-
HADOOP-10694 Remove synchronized input streams from Writable deserialization
- Resolved
-
TEZ-3284 Synchronization for every write in UnorderdKVWriter
- Closed
-
TEZ-1491 Tez reducer-side merge's counter update is slow
- Closed
-
TEZ-1277 Tez Spill handler should truncate files to reserve space on disk
- Open
-
TEZ-2582 Consider removing DataInputBuffer sync overheads in pipelinedsorter
- Resolved