Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-13107

Invalid TExecPlanFragmentInfo received by executor with instance number as 0

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • Impala 4.5.0, Impala 4.4.1
    • Backend
    • None
    • ghx-label-1

    Description

      In a customer reported case, TExecPlanFragmentInfo received by executors with instance number equals 0, which caused impala daemon to crash. Here are log messages collected on the Impala executors:

      impalad.executor.net.impala.log.INFO.20240522-160138.197583:I0523 00:59:16.892853 199528 control-service.cc:148] 624c47e9264ebb62:5aa89af300000000] ExecQueryFInstances(): query_id=624c47e9264ebb62:5aa89af300000000 coord=coordinator.net:27000 #instances=0
      ......
      I0523 00:59:19.306522 199185 kMinidump in thread [1890723]query-state-624c47e9264ebb62:5aa89af300000000 running query 624c47e9264ebb62:5aa89af300000000, fragment instance 0000000000000000:0000000000000000
      Wrote minidump to /var/log/impala-minidumps/impalad/021b06ea-1627-4c69-9f27858a-f3cd9026.dmp
      #
      # A fatal error has been detected by the Java Runtime Environment:
      #
      #  SIGSEGV (0xb) at pc=0x00000000012ff9d9, pid=197583, tid=0x00007eefc98a0700
      #
      # JRE version: Java(TM) SE Runtime Environment (8.0_381) (build 1.8.0_381-b09)
      # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.381-b09 mixed mode linux-amd64 )
      # Problematic frame:
      # C  [impalad+0xeff9d9]  impala::FragmentState::FragmentState(impala::QueryState*, impala::TPlanFragment const&, impala::PlanFragmentCtxPB const&)+0xf9
      #
      # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
      #
      

      From the collected profiles, there was no fragment with instance number as 0 in the corresponding query plan so coordinator should not send fragments to executor with number of instances as 0. Executor log files showed that there were lots of KRPC errors around the time when receiving invalid TExecPlanFragmentInfo. It seems KRPC messages were truncated due to KRPC failures, but truncation might not cause thrift deserialization error. The invalid TExecPlanFragmentInfo caused Impala daemon to crash with following stack trace when the query was started on executor.

      #0  SubstituteArg (value=..., this=0x7f86cec79d30) at ../gutil/strings/substitute.h:79
      #1  impala::FragmentState::FragmentState (this=0x35c78f40, query_state=0x7972db00, fragment=..., 
          fragment_ctx=<error reading variable: Cannot access memory at address 0x35c78f88>) at fragment-state.cc:143
      #2  0x00000000013019aa in impala::FragmentState::CreateFragmentStateMap (fragment_info=..., exec_request=..., 
          state=state@entry=0x7972db00, fragment_map=...) at fragment-state.cc:47
      #3  0x0000000001292d71 in impala::QueryState::StartFInstances (this=this@entry=0x7972db00) at query-state.cc:820
      #4  0x0000000001284810 in impala::QueryExecMgr::ExecuteQueryHelper (this=0x11943b00, qs=0x7972db00)
          at query-exec-mgr.cc:162
      #5  0x0000000001752915 in operator() (this=0x7f86cec7ab40)
          at ../../../toolchain/toolchain-packages-gcc7.5.0/boost-1.61.0-p2/include/boost/function/function_template.hpp:770
      #6  impala::Thread::SuperviseThread(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long, (impala::PromiseMode)0>*) (name=..., category=..., functor=..., 
          parent_thread_info=<optimized out>, thread_started=0x7f87b7b9acb0) at thread.cc:360
      #7  0x0000000001753c9b in operator()<void (*)(const std::__cxx11::basic_string<char>&, const std::__cxx11::basic_string<char>&, boost::function<void()>, const impala::ThreadDebugInfo*, impala::Promise<long int>*), boost::_bi::list0> (
          a=<synthetic pointer>, f=@0x1f66f3b8: <error reading variable>, this=0x1f66f3c0)
          at ../../../toolchain/toolchain-packages-gcc7.5.0/boost-1.61.0-p2/include/boost/bind/bind.hpp:531
      #8  operator() (this=0x1f66f3b8)
          at ../../../toolchain/toolchain-packages-gcc7.5.0/boost-1.61.0-p2/include/boost/bind/bind.hpp:1222
      #9  boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long, (impala::PromiseMode)0>*), boost::_bi::list5<boost::_bi::value<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::ThreadDebugInfo*>, boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*> > > >::run() (this=0x1f66f200)
          at ../../../toolchain/toolchain-packages-gcc7.5.0/boost-1.61.0-p2/include/boost/thread/detail/thread.hpp:116
      #10 0x0000000001fb4322 in thread_proxy ()
      #11 0x00007f98af288ea5 in start_thread () from /lib64/libpthread.so.0
      #12 0x00007f98ac2dfb0d in gnu_dev_makedev () from /lib64/libc.so.6
      #13 0x0000000000000000 in ?? ()
      

      Note that this issue happened when extra loads were added to the Impala cluster. It caused large RPC failures.

      Attachments

        Issue Links

          Activity

            People

              wzhou Wenzhe Zhou
              wzhou Wenzhe Zhou
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: