Uploaded image for project: 'Giraph (Retired)'
  1. Giraph (Retired)
  2. GIRAPH-1139

Resuming from checkpoint doesn't work

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.2.0
    • None
    • bsp
    • None

    Description

      I ran into a couple of issues when trying to get Giraph to resume from checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).

      • If we just wrote a checkpoint, the master expects the workers to checkpoint again, while the workers (correctly) clear the checkpointing flag.
      • When workers restart, they take their task id from the partition number, which stays the same across multiple attempts. This gets transferred to the Netty clientId, and the server starts ignoring messages from restarted workers because it thinks it processed them already.

      I believe I've fixed these issues. I'll send a GitHub PR shortly.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              nseggert Nic Eggert
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: