[OAK-3349] Partial compaction - ASF JIRA

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.7.5, 1.8.0
Component/s: segment-tar
Labels:

Epic Link:
TarMK GC scalability

Description

On big repositories compaction can take quite a while to run as it needs to create a full deep copy of the current root node state. For such cases it could be beneficial if we could partially compact the repository thus splitting full compaction over multiple cycles.
Partial compaction would run compaction on a sub-tree just like we now run it on the full tree. Afterwards it would create a new root node state by referencing the previous root node state replacing said sub-tree with the compacted one.

Todo: Asses feasibility and impact, implement prototype.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

compaction-time.png
30/Mar/17 15:56
37 kB
Michael Dürig
cycle-count.png
30/Mar/17 15:56
36 kB
Michael Dürig
post-gc-size.png
30/Mar/17 15:56
33 kB
Michael Dürig

Issue Links

blocks

OAK-6399 Re-establish full offline compaction functionality

Closed

depends upon

OAK-6009 Simplify cancellation of compaction by timeout

Resolved

OAK-5996 Revisions.setHead(Function) should return the new head or null instead of boolean

Closed

OAK-6005 Add record id of the compacted root to the GC journal

Closed

is blocked by

OAK-4122 Replace the commit semaphore in the segment node store with a scheduler

Closed

is related to

OAK-6456 Include the tail generation in the TAR index

Closed

OAK-6457 Increment the segment version number

Closed

OAK-6458 Increment the manifest version number

Closed

OAK-6468 Include the tail generation in the binary references index

Closed

OAK-6484 Include tail generation in the GC journal

Closed

OAK-6485 Expose tail and full compaction via different methods

Closed

relates to

OAK-6520 Improve tail compactions resilience when base state cannot be determined

Closed

OAK-6519 Properly handle tail compactions in deduplication caches

Closed

OAK-6522 Implement unit tests for OnlineCompactor

Closed

links to

GitHub Pull Request #773

GitHub Pull Request #1216

(6 is related to, 3 relates to, 2 links to)

Activity

Ascending order - Click to sort in descending order

Michael Dürig added a comment - 22/Apr/16 15:22

See my comment on ~~OAK-4122~~ on how such an approach could look like.

Michael Dürig added a comment - 22/Apr/16 15:22 See my comment on OAK-4122 on how such an approach could look like.

Michael Dürig added a comment - 16/Jan/17 16:02 - edited

An alternative approach (let's call it tail compaction) would be to rebase some revisions on a previous, already compacted revision instead of rebasing all revisions on the empty revision (as we currently do): Let r0 be an initial, compact revisions and r1, r2, r3 be subsequent revisions in that order created by normal repository operation. The current compaction approach rewrites r3, which also rewrites everything in r0 (which is already compact). Instead we could rebase the difference from r3 to r0 on r0, effectively only rewriting the changes that came in with r1,r2 and r3. The latter approach should be much lighter as it only rewrites recent changes, leaving everything alone that was compacted previously already.

Michael Dürig added a comment - 16/Jan/17 16:02 - edited An alternative approach (let's call it tail compaction) would be to rebase some revisions on a previous, already compacted revision instead of rebasing all revisions on the empty revision (as we currently do): Let r0 be an initial, compact revisions and r1 , r2 , r3 be subsequent revisions in that order created by normal repository operation. The current compaction approach rewrites r3 , which also rewrites everything in r0 (which is already compact). Instead we could rebase the difference from r3 to r0 on r0 , effectively only rewriting the changes that came in with r1 , r2 and r3 . The latter approach should be much lighter as it only rewrites recent changes, leaving everything alone that was compacted previously already.

Alex Deparvu added a comment - 28/Mar/17 08:44

I think a prerequisite is to add the revision to the gc.log, so that the 'last compacted state' is easily available.

Alex Deparvu added a comment - 28/Mar/17 08:44 I think a prerequisite is to add the revision to the gc.log, so that the 'last compacted state' is easily available.

Michael Dürig added a comment - 30/Mar/17 16:03

Initial POC for tail compaction at https://github.com/mduerig/jackrabbit-oak/commits/OAK-3349-POC.

The following graphs show the effect of this approach in SegmentCompactionIT through 20-something hourly gc runs.

Post compaction size is comparable up until the 12th runs. At this point tail compaction probably starts to accumulate too much "previous compacted state". A full compaction might be beneficial here.

Looking at compaction time tail compaction is actually much faster! Only with the 20th run does it seem to start to degenerate. (Later on it seems to recover... I have the test still running and will post further graphs once I have the data).

The graph of the number of cycles per gc run confirms above picture: tail compaction only went into a force compact (6-th cycle) once before the 20th run, where the base line did much earlier.

Michael Dürig added a comment - 30/Mar/17 16:03 Initial POC for tail compaction at https://github.com/mduerig/jackrabbit-oak/commits/OAK-3349-POC . The following graphs show the effect of this approach in SegmentCompactionIT through 20-something hourly gc runs. Post compaction size is comparable up until the 12th runs. At this point tail compaction probably starts to accumulate too much "previous compacted state". A full compaction might be beneficial here. Looking at compaction time tail compaction is actually much faster! Only with the 20th run does it seem to start to degenerate. (Later on it seems to recover... I have the test still running and will post further graphs once I have the data). The graph of the number of cycles per gc run confirms above picture: tail compaction only went into a force compact (6-th cycle) once before the 20th run, where the base line did much earlier.

Michael Dürig added a comment - 16/May/17 14:47

Turns out that the patch has a bug causing still referenced external binaries to be collected by the DSGC. Working on a test/fix.

Michael Dürig added a comment - 16/May/17 14:47 Turns out that the patch has a bug causing still referenced external binaries to be collected by the DSGC. Working on a test/fix.

Michael Dürig added a comment - 16/May/17 15:03

Fixed at: https://github.com/mduerig/jackrabbit-oak/commit/aad097d4c241651abefda501455970438625999b

The problem was that collection of blob references did not take the new even / odd scheme of the segments under tail compaction. The fix passes an explicit predicate encoding this to the actual collector instead of having an ad-hoc implementation there.

Michael Dürig added a comment - 16/May/17 15:03 Fixed at: https://github.com/mduerig/jackrabbit-oak/commit/aad097d4c241651abefda501455970438625999b The problem was that collection of blob references did not take the new even / odd scheme of the segments under tail compaction. The fix passes an explicit predicate encoding this to the actual collector instead of having an ad-hoc implementation there.

Michael Dürig added a comment - 29/Jun/17 12:49 - edited

Implementation note on tail compaction

In contrast to the existing compaction approach (full compaction) tail compaction rebases all changes since the last compaction on top of the result of that last compaction. Cleanup subsequently cleans up the uncompacted changes. Each tail compaction cycle creates a new generation incrementing the generation number. Cleanup remove all non compacted segments whose generation is no bigger than the current generation minus a certain amount of retained generations (2 by default).

To make this work we need to be able to determine the age of a segment (in number of generations) and whether a segment has been written by the compactor or by a regular writer (and is thus uncompacted). The POC implemented this by assigning even generation numbers to regular segments and odd ones to segment written by tail compaction while at the same time completely removing support for full compaction.

To combine tail compaction with full compaction I suggest to introduce a young generation field in the segment header, which is used by tail compaction as described. The existing generation field will thus keep being used for full compaction without changing its semantics.

The proposed approach has the advantage of tail and full compaction being completely orthogonal. You can run either of which or both without one affecting or influencing the other.
Both compaction and cleanup methods solely rely on the information in the segment headers. A predicate for determining which segments to retain can be inferred from the segment containing the head revision. There is no need to rely on auxiliary information with the small exception of tail compaction using the gc.log file to determine the base revision to compact onto. This is not problematic though wrt. to resilience as we can always fall back to full compaction should the base revision be invalid. (A base revision can be invalid in two ways: either is is not found or it is one not written by the compactor. Both cases can only occur after manual tampering with the journal.log.)
Finally the approach plays well with upgrading: while the additional young generation field requires us to bump the segment version we can easily maintain backwards compatibility and do a rolling upgrade segment by segment. Segments of the prevision version will just not be eligible for cleanup under tail compaction.

Michael Dürig added a comment - 29/Jun/17 12:49 - edited Implementation note on tail compaction In contrast to the existing compaction approach (full compaction) tail compaction rebases all changes since the last compaction on top of the result of that last compaction. Cleanup subsequently cleans up the uncompacted changes. Each tail compaction cycle creates a new generation incrementing the generation number. Cleanup remove all non compacted segments whose generation is no bigger than the current generation minus a certain amount of retained generations (2 by default). To make this work we need to be able to determine the age of a segment (in number of generations) and whether a segment has been written by the compactor or by a regular writer (and is thus uncompacted). The POC implemented this by assigning even generation numbers to regular segments and odd ones to segment written by tail compaction while at the same time completely removing support for full compaction. To combine tail compaction with full compaction I suggest to introduce a young generation field in the segment header, which is used by tail compaction as described. The existing generation field will thus keep being used for full compaction without changing its semantics. The proposed approach has the advantage of tail and full compaction being completely orthogonal. You can run either of which or both without one affecting or influencing the other. Both compaction and cleanup methods solely rely on the information in the segment headers. A predicate for determining which segments to retain can be inferred from the segment containing the head revision. There is no need to rely on auxiliary information with the small exception of tail compaction using the gc.log file to determine the base revision to compact onto. This is not problematic though wrt. to resilience as we can always fall back to full compaction should the base revision be invalid. (A base revision can be invalid in two ways: either is is not found or it is one not written by the compactor. Both cases can only occur after manual tampering with the journal.log .) Finally the approach plays well with upgrading: while the additional young generation field requires us to bump the segment version we can easily maintain backwards compatibility and do a rolling upgrade segment by segment. Segments of the prevision version will just not be eligible for cleanup under tail compaction.

Michael Dürig added a comment - 29/Jun/17 13:00 - edited

adulceanu, frm do above notes make sense to you? Is there anything fundamental I forgot? Especially re. tooling and, upgrade, standby etc.

Michael Dürig added a comment - 29/Jun/17 13:00 - edited adulceanu , frm do above notes make sense to you? Is there anything fundamental I forgot? Especially re. tooling and, upgrade, standby etc.

Michael Dürig added a comment - 29/Jun/17 14:34

Following a short description of tail / full compaction and the proposed data structure in pseudo code:

For each segment S we encode in its header: a young generation number y, a full generation number f and a flag indicating segments written by tail compaction c.

Full compaction:

determine the segment S of the current head state h
clone h and increase the full generation number for all written Segments T to T.f = S.f + 1

Tail compaction:

determine the segment S of the current head state h
determine the head state created by the last compaction (full or tail) h'
rebase h on top of h' and increase the young generation number for all written Segments T to T.y = S.y + 1

Cleanup:

let r be the number of retained generations
determine the segment S of the current head state
For each segment T
- if T.f <= S.f - r reclaim T.
- if T.y <= S.y - r && T.c == false reclaim T.

Below I tried to visualise the evolution of a repository through a couple of tail / full compaction cycles. Segment retention is 2 generations. A sequence of segments is symbolised by a triple in square brackets. The first component indicates the operations that created the segments (w: a regular write operation, tc: a tail compaction, fc: a full compaction). The second component is the young generation pertaining to tail compaction and the third component is the full generation pertaining to full compaction.

Write content:
[w,0,0] 

Tail compact:
[w,0,0] [tc,1,0]

Clean up (no effect since segment retention is 2 young generations):
[w,0,0] [tc,1,0]

Write content:
[w,0,0] [tc,1,0] [w,1,0] 

Tail compact:
[w,0,0] [tc,1,0] [w,1,0] [tc,2,0]

Clean up (remove w segments that are at least 2 young generations old):
[tc,1,0] [w,1,0] [tc,2,0]

Write content:
[tc,1,0] [w,1,0] [tc,2,0] [w,2,0]

Tail compact:
[tc,1,0] [w,1,0] [tc,2,0] [w,2,0] [tc,3,0]

Clean up (remove w segments that are at least 2 young generations old):
[tc,1,0] [tc,2,0] [w,2,0] [tc,3,0]

...

After a couple of more write / tail compaction cycles:
[tc,1,0] [tc,2,0] [tc,3,0] [w,3,0] [tc,4,0] [w,4,0] 

Full compact:
[tc,1,0] [tc,2,0] [tc,3,0] [w,3,0] [tc,4,0] [w,4,0] [fc,4,1]

Cleanup looking (no effect since segment retention is 2 full generations)
[tc,1,0] [tc,2,0] [tc,3,0] [w,3,0] [tc,4,0] [w,4,0] [fc,4,1]

Write content:
[tc,1,0] [tc,2,0] [tc,3,0] [w,3,0] [tc,4,0] [w,4,0] [fc,4,1] [w,4,1]

Tail compact:
[tc,1,0] [tc,2,0] [tc,3,0] [w,3,0] [tc,4,0] [w,4,0] [fc,4,1] [w,4,1] [tc,5,1]

Clean up (remove w segments that are at least 2 young generations old):
[tc,1,0] [tc,2,0] [tc,3,0] [tc,4,0] [w,4,0] [fc,4,1] [w,4,1] [tc,5,1]

...

After a couple of more write / tail compaction cycles:
[tc,1,0] [tc,2,0] [tc,3,0] [tc,4,0] [fc,4,1] [tc,5,1] [tc,6,1] [w,6,1] [tc,7,1] [w,7,1]

Full compact:
[tc,1,0] [tc,2,0] [tc,3,0] [tc,4,0] [fc,4,1] [tc,5,1] [tc,6,1] [w,6,1] [tc,7,1] [w,7,1] [fc,7,2]

Cleanup up (remove all segments that are at least 2 full generations old)
[fc,4,1] [tc,5,1] [tc,6,1] [w,6,1] [tc,7,1] [w,7,1] [fc,7,2]

Michael Dürig added a comment - 29/Jun/17 14:34 Following a short description of tail / full compaction and the proposed data structure in pseudo code: For each segment S we encode in its header: a young generation number y , a full generation number f and a flag indicating segments written by tail compaction c . Full compaction: determine the segment S of the current head state h clone h and increase the full generation number for all written Segments T to T.f = S.f + 1 Tail compaction: determine the segment S of the current head state h determine the head state created by the last compaction (full or tail) h' rebase h on top of h' and increase the young generation number for all written Segments T to T.y = S.y + 1 Cleanup: let r be the number of retained generations determine the segment S of the current head state For each segment T if T.f <= S.f - r reclaim T . if T.y <= S.y - r && T.c == false reclaim T . Below I tried to visualise the evolution of a repository through a couple of tail / full compaction cycles. Segment retention is 2 generations. A sequence of segments is symbolised by a triple in square brackets. The first component indicates the operations that created the segments (w: a regular write operation, tc: a tail compaction, fc: a full compaction). The second component is the young generation pertaining to tail compaction and the third component is the full generation pertaining to full compaction. Write content: [w,0,0] Tail compact: [w,0,0] [tc,1,0] Clean up (no effect since segment retention is 2 young generations): [w,0,0] [tc,1,0] Write content: [w,0,0] [tc,1,0] [w,1,0] Tail compact: [w,0,0] [tc,1,0] [w,1,0] [tc,2,0] Clean up (remove w segments that are at least 2 young generations old): [tc,1,0] [w,1,0] [tc,2,0] Write content: [tc,1,0] [w,1,0] [tc,2,0] [w,2,0] Tail compact: [tc,1,0] [w,1,0] [tc,2,0] [w,2,0] [tc,3,0] Clean up (remove w segments that are at least 2 young generations old): [tc,1,0] [tc,2,0] [w,2,0] [tc,3,0] ... After a couple of more write / tail compaction cycles: [tc,1,0] [tc,2,0] [tc,3,0] [w,3,0] [tc,4,0] [w,4,0] Full compact: [tc,1,0] [tc,2,0] [tc,3,0] [w,3,0] [tc,4,0] [w,4,0] [fc,4,1] Cleanup looking (no effect since segment retention is 2 full generations) [tc,1,0] [tc,2,0] [tc,3,0] [w,3,0] [tc,4,0] [w,4,0] [fc,4,1] Write content: [tc,1,0] [tc,2,0] [tc,3,0] [w,3,0] [tc,4,0] [w,4,0] [fc,4,1] [w,4,1] Tail compact: [tc,1,0] [tc,2,0] [tc,3,0] [w,3,0] [tc,4,0] [w,4,0] [fc,4,1] [w,4,1] [tc,5,1] Clean up (remove w segments that are at least 2 young generations old): [tc,1,0] [tc,2,0] [tc,3,0] [tc,4,0] [w,4,0] [fc,4,1] [w,4,1] [tc,5,1] ... After a couple of more write / tail compaction cycles: [tc,1,0] [tc,2,0] [tc,3,0] [tc,4,0] [fc,4,1] [tc,5,1] [tc,6,1] [w,6,1] [tc,7,1] [w,7,1] Full compact: [tc,1,0] [tc,2,0] [tc,3,0] [tc,4,0] [fc,4,1] [tc,5,1] [tc,6,1] [w,6,1] [tc,7,1] [w,7,1] [fc,7,2] Cleanup up (remove all segments that are at least 2 full generations old) [fc,4,1] [tc,5,1] [tc,6,1] [w,6,1] [tc,7,1] [w,7,1] [fc,7,2]

Francesco Mari added a comment - 29/Jun/17 14:41

mduerig, thanks for the example, it helped me a lot to clarify how cleanup interacts with partial compaction. While we discussed offline about the young and full generation, how are we going to encode T.c in the segments? Are we going to use some of the unused space in the header?

Francesco Mari added a comment - 29/Jun/17 14:41 mduerig , thanks for the example, it helped me a lot to clarify how cleanup interacts with partial compaction. While we discussed offline about the young and full generation, how are we going to encode T.c in the segments? Are we going to use some of the unused space in the header?

Michael Dürig added a comment - 29/Jun/17 14:50

how are we going to encode T.c in the segments?

I would not use an extra slot in the segment header just for this flag but rather encode it within the generation number. The POC uses a even odd scheme to encode this within the young generation number (i.e. an odd number indicated a segment written by tail compaction). We could either stick with this or preferably use a positive/negative scheme using a negative sign to indicate segments written by tail compaction.

Michael Dürig added a comment - 29/Jun/17 14:50 how are we going to encode T.c in the segments? I would not use an extra slot in the segment header just for this flag but rather encode it within the generation number. The POC uses a even odd scheme to encode this within the young generation number (i.e. an odd number indicated a segment written by tail compaction). We could either stick with this or preferably use a positive/negative scheme using a negative sign to indicate segments written by tail compaction.

Francesco Mari added a comment - 29/Jun/17 14:56

Both the solutions sound reasonable to me. Another question: the algorithm uses the same r to compare both full and young generation numbers. Would it make sense to have two different r for the two different generation numbers?

Francesco Mari added a comment - 29/Jun/17 14:56 Both the solutions sound reasonable to me. Another question: the algorithm uses the same r to compare both full and young generation numbers. Would it make sense to have two different r for the two different generation numbers?

Michael Dürig added a comment - 29/Jun/17 15:02

Would it make sense to have two different r for the two different generation numbers?

So far I don't think so as the intended semantics is that revisions no older than r generations are retained at any point in time. This is currently the case no matter how we interleave tail and full compaction runs.

Michael Dürig added a comment - 29/Jun/17 15:02 Would it make sense to have two different r for the two different generation numbers? So far I don't think so as the intended semantics is that revisions no older than r generations are retained at any point in time. This is currently the case no matter how we interleave tail and full compaction runs.

Michael Dürig added a comment - 05/Jul/17 07:39

A further optimisation we could do (possibly as an afterthought once above is implemented) is to clean up tail compacted generations once no w segments reference them any more. This is the case after a full compaction and at least n further tail compactions where n is the number of retained generations. One question here is how to best detect that above criteria is fulfilled.

Michael Dürig added a comment - 05/Jul/17 07:39 A further optimisation we could do (possibly as an afterthought once above is implemented) is to clean up tail compacted generations once no w segments reference them any more. This is the case after a full compaction and at least n further tail compactions where n is the number of retained generations. One question here is how to best detect that above criteria is fulfilled.

Michael Dürig added a comment - 05/Jul/17 08:10

I rebased my earlier branch OAK-3349-POC branch on top of the current Oak trunk. I used an overnight run of this SCIT configuration to confirm cleanup is as efficient as before rebasing.

On the OAK-3349 branch I started integrating tail compaction and full compaction. In a first step I replaced the integer representing the gc generation with an ADT GCGeneration capturing the same semantics (passes ITs).
In a next step I started integrating tail compaction and full compaction. There are many loose ends still:

The tail generation is not going into the tar index which makes cleanup awkward and inefficient and precise blob reference collection impossible.
Deduplication cache generations are not properly selected as they only take the full generation in account.
There is no way to schedule/trigger full vs. tail compaction.
Graceful and correct degradation to full compaction in the case where a base version cannot be determined in not implemented.
The gc.log does not reflect tail compactions.
No UT and IT coverage for tail compaction.

All respective locations in the code are marked with FIXME OAK-3349.

Michael Dürig added a comment - 05/Jul/17 08:10 I rebased my earlier branch OAK-3349-POC branch on top of the current Oak trunk. I used an overnight run of this SCIT configuration to confirm cleanup is as efficient as before rebasing. On the OAK-3349 branch I started integrating tail compaction and full compaction. In a first step I replaced the integer representing the gc generation with an ADT GCGeneration capturing the same semantics (passes ITs). In a next step I started integrating tail compaction and full compaction. There are many loose ends still: The tail generation is not going into the tar index which makes cleanup awkward and inefficient and precise blob reference collection impossible. Deduplication cache generations are not properly selected as they only take the full generation in account. There is no way to schedule/trigger full vs. tail compaction. Graceful and correct degradation to full compaction in the case where a base version cannot be determined in not implemented. The gc.log does not reflect tail compactions. No UT and IT coverage for tail compaction. All respective locations in the code are marked with FIXME OAK-3349 .

Michael Dürig added a comment - 06/Jul/17 07:06

Added a missing bit in the predicate for cleaning up old segments: segments generated by tail compaction must not be cleaned up unless the full generation is far enough in the past. See https://github.com/mreutegg/jackrabbit-oak/commit/9a4301e8ba450bf1a41e721f1c773e803f3d9ca8

Michael Dürig added a comment - 06/Jul/17 07:06 Added a missing bit in the predicate for cleaning up old segments: segments generated by tail compaction must not be cleaned up unless the full generation is far enough in the past. See https://github.com/mreutegg/jackrabbit-oak/commit/9a4301e8ba450bf1a41e721f1c773e803f3d9ca8

Francesco Mari added a comment - 27/Jul/17 09:09

Fixed at r1803146.

Francesco Mari added a comment - 27/Jul/17 09:09 Fixed at r1803146.

Francesco Mari added a comment - 27/Jul/17 09:22

Minor cleanup made at r1803152.

Francesco Mari added a comment - 27/Jul/17 09:22 Minor cleanup made at r1803152.

Davide Giannella added a comment - 03/Aug/17 13:55

Bulk close for 1.7.5

Davide Giannella added a comment - 03/Aug/17 13:55 Bulk close for 1.7.5

People

Assignee:: Francesco Mari

Reporter:: Michael Dürig

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 04/Sep/15 08:45

Updated:: 15/Nov/23 17:11

Resolved:: 27/Jul/17 09:09

Jackrabbit Oak

Details

Description

Attachments

Attachments

Issue Links

Activity

Implementation note on tail compaction

People

Dates