Description
So for handling multiple runtimes I'm not sure there is a way to solve this but documenting as a JIRA regardless.
If you are running in a multi-cluster environment where you might want to read data from one cluster and then write the output on another cluster (e.g. generating HFiles to be loaded into a separate HBase cluster), the performance of moving files is noticeable. Specifically due to the fact that the moving of the files happens in the launcher/driver process versus as part of the node execution it seems.[1]
An efficient option would be to kick off a DistCp instead but that would tie the target directly to a runtime which is not a great approach.
Attachments
Issue Links
- is related to
-
CRUNCH-675 HFileTarget should use DistCp when source and destination are in different filesystems
- Resolved
- links to