[HDFS-2632] existing in_use.lock file is removed after failing to lock this file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 0.21.0
Fix Version/s: None
Component/s: namenode
Labels:
None
Environment:

Scientific Linux 5.3

Description

If an attempt is made to start the namenode when it is already running, an exception is generated on failure to lock in_use.lock. However, there is a bug: in_use.lock is deleted! After that, if another attempt is made to start the namenode, there is no in_use.lock file, so the new instance goes ahead and starts messing with the namenode state files. It eventually fails to bind to the TCP port, but it has already done damage by that time. Specifically, the 'edits' file being written to by the running instance is moved to 'previous.checkpoint' so all further transactions are lost when the HDFS service is next restarted. We observed a case of data loss because of this.

This issue relates to HDFS-1690, but the problem in HDFS-1690 was stated in a way that is specific to -format.

Attachments

Issue Links

duplicates

HDFS-2877 If locking of a storage dir fails, it will remove the other NN's lock file on exit

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Dan Bradley

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 05/Dec/11 19:26

Updated:: 07/Feb/12 19:30

Resolved:: 07/Feb/12 19:30