Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-12756

Duplicate (cql)rows for the same primary key

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Low
    • Resolution: Duplicate
    • None
    • None
    • Linux, Cassandra 3.7 (upgraded at one point from 2.?).

    • Low

    Description

      I observe what looks like duplicates when I run cql queries against a table. It only show for rows written during a couple of hours on a specific date but it shows for several partions and serveral clustering keys for each partition during that time range.

      We've loaded data in two ways.
      1) through a normal insert
      2) through sstableloader with sstables created using update-statements (to append to the map) and an older version of SSTableWriter. During this processes several months of data was re-loaded.

      The table DDL is

      create statement
      CREATE TABLE climate.climate_1510 (
          installation_id bigint,
          node_id bigint,
          time_bucket int,
          gateway_time timestamp,
          humidity map<int, float>,
          temperature map<int, float>,
          PRIMARY KEY ((installation_id, node_id, time_bucket), gateway_time)
      ) WITH CLUSTERING ORDER BY (gateway_time DESC)
          AND bloom_filter_fp_chance = 0.01
          AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
          AND comment = ''
          AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
          AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
          AND crc_check_chance = 1.0
          AND dclocal_read_repair_chance = 0.1
          AND default_time_to_live = 0
          AND gc_grace_seconds = 864000
          AND max_index_interval = 2048
          AND memtable_flush_period_in_ms = 0
          AND min_index_interval = 128
          AND read_repair_chance = 0.0
          AND speculative_retry = '99PERCENTILE';
      

      and the result from the SELECT is

      cql output
      > select * from climate.climate_1510 where installation_id = 133235 and node_id = 35453983 and time_bucket = 189 and gateway_time > '2016-08-10 20:00:00' and gateway_time < '2016-08-10 21:00:00' ;
      
       installation_id | node_id  | time_bucket | gateway_time             | humidity | temperature
      -----------------+----------+-------------+--------------------------+----------+---------------
                133235 | 35453983 |         189 | 20160810 20:23:28.000000 |  {0: 51} | {0: 24.37891}
                133235 | 35453983 |         189 | 20160810 20:23:28.000000 |  {0: 51} | {0: 24.37891}
                133235 | 35453983 |         189 | 20160810 20:23:28.000000 |  {0: 51} | {0: 24.37891}
      

      I've used Andrew Tolbert's sstable-tools to be able to dump the json for this specific time and this is what I find.

      json dump
      [133235:35453983:189] Row[info=[ts=1470878906618000] ]: gateway_time=2016-08-10 22:23+0200 | del(humidity)=deletedAt=1470878906617999, localDeletion=1470878906, [humidity[0]=51.0 ts=1470878906618000], del(temperature)=deletedAt=1470878906617999, localDeletion=1470878906, [temperature[0]=24.378906 ts=1470878906618000]
      [133235:35453983:189] Row[info=[ts=-9223372036854775808] del=deletedAt=1470864506441999, localDeletion=1470864506 ]: gateway_time=2016-08-10 22:23+0200 | , [humidity[0]=51.0 ts=1470878906618000], , [temperature[0]=24.378906 ts=1470878906618000]
      [133235:35453983:189] Row[info=[ts=-9223372036854775808] del=deletedAt=1470868106489000, localDeletion=1470868106 ]: gateway_time=2016-08-10 22:23+0200 | 
      [133235:35453983:189] Row[info=[ts=-9223372036854775808] del=deletedAt=1470871706530999, localDeletion=1470871706 ]: gateway_time=2016-08-10 22:23+0200 | 
      [133235:35453983:189] Row[info=[ts=-9223372036854775808] del=deletedAt=1470878906617999, localDeletion=1470878906 ]: gateway_time=2016-08-10 22:23+0200 | , [humidity[0]=51.0 ts=1470878906618000], , [temperature[0]=24.378906 ts=1470878906618000]
      

      From my understanding this should be impossible. Even if we have duplicates in the sstables (which is normal) it should be filtered away before being returned to the client.

      I'm happy to add details to this bug if anything is missing.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              wederbrand Andreas Wederbrand
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: