Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.17
-
None
Description
When the job is submitted and a job ID is returned fro the cluster, gfac executes squeue command. When this command returns queued job details gfac goes and executes gateway user details to XSEDE machines and also adds the job ID to monitoring map.
In intermittent cases, the SSH session validation takes longer after the job submission and then by the time squeue command is executed the job is no longer in the queue (already completed) hence error returned [1]
[1]
2017-05-02 06:27:48,047 [pool-7-thread-15] ERROR o.a.a.g.i.t.DefaultJobSubmissionTask process_id=PROCESS_c7e404ed-0822-404a-8f04-6b09e9ba8ece, token_id=75918c63-30fd-4548-a8d3-7f3a41b185ae, experiment_id=US3-AIRA_740b0ad6-62c4-42dc-9eed-f12b92a6b98b, gateway_id=Ultrascan_Production - Error occurred while submitting the job
org.apache.airavata.gfac.core.GFacException: Error running command squeue -j 9119082 on remote cluster. StandardError: slurm_load_jobs error: Invalid job id specified
at org.apache.airavata.gfac.impl.HPCRemoteCluster.throwExceptionOnError(HPCRemoteCluster.java:298)
at org.apache.airavata.gfac.impl.HPCRemoteCluster.getJobStatus(HPCRemoteCluster.java:233)
at org.apache.airavata.gfac.impl.task.DefaultJobSubmissionTask.verifyJobSubmissionByJobId(DefaultJobSubmissionTask.java:302)
at org.apache.airavata.gfac.impl.task.DefaultJobSubmissionTask.execute(DefaultJobSubmissionTask.java:157)
at org.apache.airavata.gfac.impl.GFacEngineImpl.executeTask(GFacEngineImpl.java:814)
at org.apache.airavata.gfac.impl.GFacEngineImpl.executeJobSubmission(GFacEngineImpl.java:510)
at org.apache.airavata.gfac.impl.GFacEngineImpl.executeTaskListFrom(GFacEngineImpl.java:386)
at org.apache.airavata.gfac.impl.GFacEngineImpl.executeProcess(GFacEngineImpl.java:286)
at org.apache.airavata.gfac.impl.GFacWorker.executeProcess(GFacWorker.java:227)
at org.apache.airavata.gfac.impl.GFacWorker.run(GFacWorker.java:86)
at org.apache.airavata.common.logging.MDCUtil.lambda$wrapWithMDC$0(MDCUtil.java:40)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)