Details
-
Bug
-
Status: Resolved
-
Not a Priority
-
Resolution: Fixed
-
1.19.0
Description
We have been using the following code snippet in our Dockerfiles for running a Flink job in application mode
FROM flink:1.18.1-scala_2.12-java17 COPY --from=build /app/target/my-job*.jar /opt/flink/usrlib/artifacts/my-job.jar USER flink
Which has been working since at least around Flink 1.14, but the 1.19 update has broken our Dockerfiles. The fix is to put the jar file a step further out so the code snippet becomes
FROM flink:1.18.1-scala_2.12-java17 COPY --from=build /app/target/my-job*.jar /opt/flink/usrlib/my-job.jar USER flink
We have not spent too much time looking into what the cause is, but we get the stack trace
myjob-jobmanager-1 | org.apache.flink.util.FlinkException: Could not load the provided entrypoint class. myjob-jobmanager-1 | at org.apache.flink.client.program.DefaultPackagedProgramRetriever.getPackagedProgram(DefaultPackagedProgramRetriever.java:230) ~[flink-dist-1.19.0.jar:1.19.0] myjob-jobmanager-1 | at org.apache.flink.container.entrypoint.StandaloneApplicationClusterEntryPoint.getPackagedProgram(StandaloneApplicationClusterEntryPoint.java:149) ~[flink-dist-1.19.0.jar:1.19.0] myjob-jobmanager-1 | at org.apache.flink.container.entrypoint.StandaloneApplicationClusterEntryPoint.lambda$main$0(StandaloneApplicationClusterEntryPoint.java:90) ~[flink-dist-1.19.0.jar:1.19.0] myjob-jobmanager-1 | at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28) ~[flink-dist-1.19.0.jar:1.19.0] myjob-jobmanager-1 | at org.apache.flink.container.entrypoint.StandaloneApplicationClusterEntryPoint.main(StandaloneApplicationClusterEntryPoint.java:89) [flink-dist-1.19.0.jar:1.19.0] myjob-jobmanager-1 | Caused by: org.apache.flink.client.program.ProgramInvocationException: The program's entry point class 'my.company.job.MyJob' was not found in the jar file. myjob-jobmanager-1 | at org.apache.flink.client.program.PackagedProgram.loadMainClass(PackagedProgram.java:481) ~[flink-dist-1.19.0.jar:1.19.0] myjob-jobmanager-1 | at org.apache.flink.client.program.PackagedProgram.<init>(PackagedProgram.java:153) ~[flink-dist-1.19.0.jar:1.19.0] myjob-jobmanager-1 | at org.apache.flink.client.program.PackagedProgram.<init>(PackagedProgram.java:65) ~[flink-dist-1.19.0.jar:1.19.0] myjob-jobmanager-1 | at org.apache.flink.client.program.PackagedProgram$Builder.build(PackagedProgram.java:691) ~[flink-dist-1.19.0.jar:1.19.0] myjob-jobmanager-1 | at org.apache.flink.client.program.DefaultPackagedProgramRetriever.getPackagedProgram(DefaultPackagedProgramRetriever.java:228) ~[flink-dist-1.19.0.jar:1.19.0] myjob-jobmanager-1 | ... 4 more myjob-jobmanager-1 | Caused by: java.lang.ClassNotFoundException: my.company.job.MyJob myjob-jobmanager-1 | at java.net.URLClassLoader.findClass(Unknown Source) ~[?:?] myjob-jobmanager-1 | at java.lang.ClassLoader.loadClass(Unknown Source) ~[?:?] myjob-jobmanager-1 | at org.apache.flink.util.FlinkUserCodeClassLoader.loadClassWithoutExceptionHandling(FlinkUserCodeClassLoader.java:67) ~[flink-dist-1.19.0.jar:1.19.0] myjob-jobmanager-1 | at org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:74) ~[flink-dist-1.19.0.jar:1.19.0] myjob-jobmanager-1 | at org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:51) ~[flink-dist-1.19.0.jar:1.19.0] myjob-jobmanager-1 | at java.lang.ClassLoader.loadClass(Unknown Source) ~[?:?] myjob-jobmanager-1 | at org.apache.flink.util.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.loadClass(FlinkUserCodeClassLoaders.java:197) ~[flink-dist-1.19.0.jar:1.19.0] myjob-jobmanager-1 | at java.lang.Class.forName0(Native Method) ~[?:?] myjob-jobmanager-1 | at java.lang.Class.forName(Unknown Source) ~[?:?] myjob-jobmanager-1 | at org.apache.flink.client.program.PackagedProgram.loadMainClass(PackagedProgram.java:479) ~[flink-dist-1.19.0.jar:1.19.0] myjob-jobmanager-1 | at org.apache.flink.client.program.PackagedProgram.<init>(PackagedProgram.java:153) ~[flink-dist-1.19.0.jar:1.19.0] myjob-jobmanager-1 | at org.apache.flink.client.program.PackagedProgram.<init>(PackagedProgram.java:65) ~[flink-dist-1.19.0.jar:1.19.0] myjob-jobmanager-1 | at org.apache.flink.client.program.PackagedProgram$Builder.build(PackagedProgram.java:691) ~[flink-dist-1.19.0.jar:1.19.0] myjob-jobmanager-1 | at org.apache.flink.client.program.DefaultPackagedProgramRetriever.getPackagedProgram(DefaultPackagedProgramRetriever.java:228) ~[flink-dist-1.19.0.jar:1.19.0] myjob-jobmanager-1 | ... 4 more
I have changed some text in the stack trace to keep it anonymous so it is possible there is a typo but that is not the issue. As you can see, the stack trace leads to PackagedProgram and DefaultPackagedProgramRetriever to which the only commits after Flink 1.18 are PackagedProgram commit and DefaultPackagedProgramRetriever commit and we suspect the culprit is the latter, specifically this line which we think has made the artifact check non-recursive. We assume it is intended to have your artifacts directly in /opt/flink/usrlib without the artifacts directory so we are planning on changing that for our Dockerfiles anyway, but it is still a breaking change so we wanted to make an issue on it first.