Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-5205

The first three Cassandra node is very busy , GC pause the world (Real production Env. Exp.)

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Low
    • Resolution: Invalid
    • 1.1.10
    • None
    • None
    • cassandra 1.1.5 release
      centos 5.5
      jdk1.7u9
      vmware(TM)'s exsi based VM : 30GB RAM , 4*4core CPU
      Hard ware : Dell R720 , 2*6core CPU , 128GB RAM , made 3 node as above
      data hosted by each node : about 8GB

    Description

      hi dear cares ,

      I have 10 nodes before , all on the centos VM with 16GB ram and 8core CPU , and running the cassandra 1.1.5 with only one User keyspace (RF=3) . Heap(Old:8GB,New:2GB)

      matters :
      1. the first three nodes (from token 0) goes very busy all the time , but the left 7 nodes seems nothing to do , both the CPU and RAM was freely .

      2. all of the first three nodes' JVM ram cost increasing crazy , CMS GC fires nearly every seconds

      3. when GC happened , the world seems stopped . checking via node tool , when running node tool on the first three node , nodetool will hung up . when running on the left 7 nodes , it shows that the first three node down

      4. when GC finished , the node comes back , but it will gone in mins later .

      5. kill java process , reboot the frozen node , it will up in mins , and the JVM ram will be increasing full in mins as well , and everythings above repeating ....

      6. even if only one of the first three node frozen , the client request will failed . but my client request CL=QUORUM , and I am playing with hector client lib.

      7. disable the three nodes' thrift api , nothing changed.

      ############change#############
      0. stop the coming user request (stop our user service to make cassandra free)
      1. decommission 4 nodes (one by one)
      2. moving tokens to banlance the left 6 nodes (one by one)
      3. change the left 6 node resource to : 30GB RAM 16core CPU , heap(16G old , 4GB new)
      4. enable JNA
      5. do major compaction on the 6nodes , do repair on the 6nodes
      6. start the new cluster ...
      7. everything seems ok in the early running time , but 5hours past , every bad matters come back .
      8. because of we have got double RAM now , the dead repeating cycle goes hourly
      9. JVM opts : -ea -javaagent:./../lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms16G -Xmx16G -Xmn4G -Xss180k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -Djava.net.preferIPv4Stack=true -Djava.rmi.server.hostname=10.0.0.22 -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlog4j.configuration=log4j-server.properties -Dlog4j.defaultInitOverride=true

      some screen short attached .

      Attachments

        1. the-trouble-maker-node.jpg
          82 kB
          sunjian
        2. the-normal-free-node-no-presure.jpg
          81 kB
          sunjian

        Activity

          People

            Unassigned Unassigned
            sun74533 sunjian
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: