Details
-
Bug
-
Status: Resolved
-
P2
-
Resolution: Won't Fix
-
None
-
None
Description
Apache Beam version:
2.10
JAVA SDK
Dataflow GCP Staged
Details:
We have a streaming dataflow which ingests data into BigQuery (Streaming Inserts).
We deploy a job with max number of workers = 40 and
there is a huge backlog already (high watermark).
When the dataflow starts it scales 0 -> 3 (from 0 to 3 workers)
and starts ingesting with 12000 messages/sec rate.
After 2 mins it scales 3 -> 40 to keep up with a backlog.
After scaling up, the rate never goes higher than it was with 3 nodes (12000 messages/sec).
We have memory consumption metrics in Stackdriver; from them
we see that the first 3 workers consume about 5GB of RAM and the rest 37 workers
consume about 0.2GB RAM. It appears that these autoscaled Nodes are idle? Importantly, they don’t add to Streaming Inserts process for BigQuery.
Autoscaling in the other streaming pipelines we have works fine.
It appears that this is related to BigQuery streaming inserts.