Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
0.94.0
-
None
-
Reviewed
Description
It is a big problem we found in 0.94, and you could reproduce the problem in Trunk using the test case I uploaded.
When we do compaction, we will use region.getSmallestReadPoint() to keep MVCC for opened scanners;
However,It will make data mistake after majorCompaction because we will skip delete type KV but keep the put type kv in the compacted storefile.
The following is the reason from code:
In StoreFileScanner, enforceMVCC is false when compaction, so we could read the delete type KV,
However, we will skip this delete type KV in ScanQueryMatcher because following code
if (kv.isDelete()) { ... if (includeDeleteMarker && kv.getMemstoreTS() <= maxReadPointToTrackVersions) { System.out.println("add deletes,maxReadPointToTrackVersions=" + maxReadPointToTrackVersions); this.deletes.add(bytes, offset, qualLength, timestamp, type); } ... }
Here maxReadPointToTrackVersions = region.getSmallestReadPoint();
and kv.getMemstoreTS() > maxReadPointToTrackVersions
So we won't add this to DeleteTracker.
Why test case passed if remove the line MultiVersionConsistencyControl.setThreadReadPoint(smallestReadPoint);
Because in the StoreFileScanner#skipKVsNewerThanReadpoint
if (cur.getMemstoreTS() <= readPoint) {
cur.setMemstoreTS(0);
}
So if we remove the line MultiVersionConsistencyControl.setThreadReadPoint(smallestReadPoint);
Here readPoint is LONG.MAX_VALUE, we will set memStore ts as 0, so we will add it to DeleteTracker in ScanQueryMatcher
Solution:
We use smallestReadPoint of region when compaction to keep MVCC for OPENED scanner, So we should retain delete type kv in output in the case(Already deleted KV is retained in output to make old opened scanner could read this KV) even if it is a majorcompaction.
Attachments
Attachments
- HBASE-6311-test.patch
- 2 kB
- Chunhui Shen
- HBASE-6311v1.patch
- 3 kB
- Chunhui Shen
- HBASE-6311v2.patch
- 3 kB
- Chunhui Shen
Activity
I think I has found the reason:
In StoreFileScanner, enforceMVCC is false when compaction, so we could read the delete type KV,
However, we will skip this delete type KV in ScanQueryMatcher because following code
if (kv.isDelete()) { ... if (includeDeleteMarker && kv.getMemstoreTS() <= maxReadPointToTrackVersions) { System.out.println("add deletes,maxReadPointToTrackVersions=" + maxReadPointToTrackVersions); this.deletes.add(bytes, offset, qualLength, timestamp, type); } ... }
Here maxReadPointToTrackVersions = region.getSmallestReadPoint();
and kv.getMemstoreTS() > maxReadPointToTrackVersions
So we won't add this to DeleteTracker.
Why test case passed if remove the line MultiVersionConsistencyControl.setThreadReadPoint(smallestReadPoint);
Because in the StoreFileScanner#skipKVsNewerThanReadpoint
if (cur.getMemstoreTS() <= readPoint) {
cur.setMemstoreTS(0);
}
So if we remove the line MultiVersionConsistencyControl.setThreadReadPoint(smallestReadPoint);
Here readPoint is LONG.MAX_VALUE, we will set memStore ts as 0, so we will add it to DeleteTracker in ScanQueryMatcher
Doing MultiVersionConsistencyControl.setThreadReadPoint(smallestReadPoint) in the compaction is used to ensure scanner keep MVCC after compaction,
But I don't know why we add the the condition( kv.getMemstoreTS() <= maxReadPointToTrackVersions) in the ScanQueryMatcher :
if (includeDeleteMarker && kv.getMemstoreTS() <= maxReadPointToTrackVersions) { this.deletes.add(bytes, offset, qualLength, timestamp, type); }
/** readPoint over which the KVs are unconditionally included */ protected long maxReadPointToTrackVersions;
I think it is not accordant with the javaDoc of this variable
But I don't know why we add the the condition( kv.getMemstoreTS() <= maxReadPointToTrackVersions) in the ScanQueryMatcher
I'm clear about this, because we should keep old scanner see the KV even if it is already been deleted.
So I think we could fix this issue as the following:
if(kv.isDelete()){ if(kv.getMemstoreTS() > maxReadPointToTrackVersions) return MatchCode.INCLUDE; }
@Chunhui
Very good catch and good analysis. But this patch will now will now write the delete marker also even during Major compaction? Ideally the expected behaviour is the put and deleted both should not be written while major compaction.
So we need to add this to this.deletes. What you feel Chunhui?
bq.So we need to add this to this.deletes
What i meant is we need to add to this.deletes so that we are able to track the put that may come in and ensure whether we should skip them or not.{Edit}
@ram
But this patch will now write the delete marker also even during Major compaction?
Yes, we write the delete marker only if kv.getMemstoreTS() > maxReadPointToTrackVersions, because in that case, the already deleted put KV will be retained in output.
Ideally the expected behaviour is the put and deleted both should not be written while major compaction.
Assume that one scanner could read put type KV first, and then we delete this KV, but this scanner should also could read this scannner as the MVCC , If we don't write put and delete KV in majorCompaction, this scanner couldn't read the put type KV and break the MVCC, is it right?
The code in ScanQueryMatcher under discussion was introduced in HBASE-5569.
Assume that one scanner could read put type KV first, and then we delete this KV, but this scanner should also could read this scannner as the MVCC , If we don't write put and delete KV in majorCompaction, this scanner couldn't read the put type KV and break the MVCC, is it right?
Raises the question about MVCC. What you say is suppose a read started and at that time row r1 was there. Later this row is deleted by some other client. Still as per the MVCC this reader will be able to read this row.
But if in between the read (after the delete) a flush and a compact (major) happened and then only reader is reaching the row r1, it will not see r1. Correct right Chunhui? This is what you are telling?
[Not sure whether this actions can really happen. but just saying theoretically]
That makes your point valid Chunhui IMO
In Chunhui's test case, if the 1st scanner [scanner1] is calling next() 1st time after the major compact, still he should be able to see the row.
If we move the scanner1.next to the last we are able to get the result because of the bug.
Ideally the major compact again gets the PUT that was deleted and the same is written into the compacted file. so the first scanner is able to read the data.
if the 1st scanner [scanner1] is calling next() 1st time after the major compact, still he should be able to see the row
I consider it is right as per the MVCC, because the scanner could read the row when opened,
is it so as per the MVCC?
I am also inline with Chunhui on this. As per the MVCC the scanner should be able to see this row
As per MVCC the behaviour like the firstscanner should read is correct. But when the compaction is done and the in this case the put and delete is removed, we have a new set of store files.. Now already when starting a scan we have logic
public synchronized boolean next(List<KeyValue> outResult, int limit) throws IOException { if (checkReseek()) { return true; }
So now i have doubt on this behaviour. Pls correct me if am wrong.
But when the compaction is done and the in this case the put and delete is removed
In this case, after the compaction the put is retained and delete is dropped without patch, so make data mistake. With the patch, the put and delete are both retained after compaction,so the firstscanner could also read this row as per MVCC.
@Chunhui
I did not mean your patch does it. Am just saying what happens if the put/delete gets removed and we end up in an empty file. I was just trying to do some changes to your patch and the testcase..
@ram
In order to keep MVCC for the firstscanner, we retained already deleted KV output to compacted file in current logic.
What my patch do is retaining delete type KV output to compacted file also if the above happen:
if(kv.getMemstoreTS() > maxReadPointToTrackVersions) return MatchCode.INCLUDE;
Include the delete type KV as the above code
Am just saying what happens if the put/delete gets removed and we end up in an empty file.
We shouldn't end up in an empty file, because the put type KV should be able to read by earlier opened scanner as per MVCC
@All,
Could someone take a look at this? Seems important wrt MVCC.
@ram
What doubt do you have about my patch v2?
I update the test case to verify MVCC for scanners after majorCompaction.
@Chunhui
I am clear with your patch. Your patch tries to keep MVCC concepts intact and that is what is needed. No problem. Even Anoop also has reviewed it.
Just wanted others to review this because now even on major compaction we create a file with delete marker if the condition is kv.getMemstoreTS() > maxReadPointToTrackVersions.
But in a normal case we will not write delete marker on major compaction. Is this ok?
But in a normal case we will not write delete marker on major compaction
Yes, it's so
Conclusion and patch look good to me. I guess in HBASE-5569 I only fixed half the problem (or created it?).
I'll take a closer look tomorrow.
I should note that this all is needed for HBASE-3584 (multi op transactions).
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12535148/HBASE-6311v2.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 4 new or modified tests.
+1 hadoop2.0. The patch compiles against the hadoop 2.0 profile.
+1 javadoc. The javadoc tool did not generate any warning messages.
-1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings).
-1 findbugs. The patch appears to introduce 7 new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests:
org.apache.hadoop.hbase.regionserver.TestAtomicOperation
org.apache.hadoop.hbase.replication.TestReplication
Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2325//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2325//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2325//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2325//console
This message is automatically generated.
if (includeDeleteMarker && kv.getMemstoreTS() <= maxReadPointToTrackVersions) { this.deletes.add(bytes, offset, qualLength, timestamp, type); }
The check here is needed, so that deleted rows are not pulled from under a scanner that started before the rows were deleted (needed for HBASE-3584).
So if kv.getMemstoreTS() > maxReadPointToTrackVersions the effect of the delete marker is ignored. So in case of the major compaction that delete marker should indeed be included.
The only concern I'd have: When would the delete marker ever be collected?
@Lars
So in case of the major compaction that delete marker should indeed be included
For this case, we retained both put and delete marker output compacted file, so the scanner you mentioned above could read the row, and also data is correct.
Yes. I was just restating the problem.
How about my last question: When will the delete markers now be collected?
When will the delete markers now be collected?
Do you mean collected to compacted file or to DeleteTracker in ScanQueryMatcher?
We won't collected delete markers to compacted file in majorCompaction except deleted KV is also collectd to compacted file.
For example through the code
If a delete marker is not collected to deleteTracker because
kv.getMemstoreTS()<=maxReadPointToTrackVersions, it must will be collected to compacted file
Never mind, re-reading my question, it does not make sense.
Older delete marker will eventually be collected during a major compaction, so all is good.
+1 on patch. If there are no objection I'll commit this tomorrow.
I think this warrants for a quicker new release.
Lars if you are ok in retaining the deletes on major compaction I am +1 on this patch too.
We're only retaining them as long as an OPEN scanner could potentially be affected by them (kv.getMemstoreTS() > maxReadPointToTrackVersions), so should be OK. In fact we're doing the same for all other KVs as well. In HBASE-5569 I attempted to extend that logic to deleted rows, but I forgot the delete markers.
I also ran TestReplication locally. Passes. Going to commit.
Thanks for the patch Chunhui and the review/discussion Ram!
Integrated in HBase-0.94 #302 (See https://builds.apache.org/job/HBase-0.94/302/)
HBASE-6311 Data error after majorCompaction caused by keeping MVCC for opened scanners (chunhui shen) (Revision 1358333)
Result = FAILURE
larsh :
Files :
- /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java
- /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java
Integrated in HBase-0.94-security #39 (See https://builds.apache.org/job/HBase-0.94-security/39/)
HBASE-6311 Data error after majorCompaction caused by keeping MVCC for opened scanners (chunhui shen) (Revision 1358333)
Result = SUCCESS
larsh :
Files :
- /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java
- /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java
Integrated in HBase-0.94-security-on-Hadoop-23 #6 (See https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/6/)
HBASE-6311 Data error after majorCompaction caused by keeping MVCC for opened scanners (chunhui shen) (Revision 1358333)
Result = FAILURE
larsh :
Files :
- /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/ScanQueryMatcher.java
- /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java
Store#compactStore(final Collection<StoreFile> filesToCompact, final boolean majorCompaction, final long maxId)
This setting the read point is the issue? With removal of this line I am able to pass ur test case. smallestReadPoint ideally should be used for resetting the memstoreTs. [Any other usages?]
Obviously it should not be used for deciding what to scan. Compact need to scan every thing from the files.