Details
Description
Kernel restart from Jupyter kernel gateway is failing with a timeout. The kernel is restarted, but kernel gateway times out waiting for a kernel_info_reply message that it is
expecting in response to kernel_info_request that it sends after initiating the restart.
The problem is reproducible most of the time with something like this:
curl -v -X POST --data '
{ "name":"apache_toree_scala" }' http://127.0.0.1:8888/api/kernels
curl -v -X POST --data '{}' http://127.0.0.1:8888/api/kernels/<kernelid-from-above>/restart
From the IPython message protocol doc, this is the message format:
[
b'u-u-i-d', # zmq identity(ies)
b'<IDS|MSG>', # delimiter
b'baddad42', # HMAC signature
b'
', # serialized header dict
b'
', # serialized parent header dict
b'
', # serialized metadata dict
b'
, # serialized content dict
b'blob', # extra raw data buffer(s)
...
]
The first frame of the message contains zmq identities which, in some cases in a Router-type socket, are generated by jeromq and then consist of five bytes - 0 followed by a random int.
In Toree, all frames are treated as Strings. Conversion to UTF-8 corrupts the zmq id, replacing non-UTF-8 characters by the replacement character 0xEFBFBD.
When the corrupted id is used in a message sent to the Router socket, the peer to send the message to is not found and the message is dropped.
This affects other messages as well, not just kernel_info_reply.