[COUCHDB-1505] Error on cancelling replication - possbily related to hanging replications - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 1.2
Fix Version/s: None
Component/s: Replication
Labels:
- cancel
- hang
- replication
Environment:

CentOS 5.6 x64. WAN replication (between datacentres). Cronjob controlled replication curls every 5 mins. Using pull replication with a filter.

Description

We run a cronjob to cancel replication, and then start it again every 5 minutes. Occasionally when cancelling replication jobs, a stack trace appears in the couchdb log (attached)

Other observations : perhaps unrelated, but over time we slowly start to gather "zombie" couchjs processes. After a month or so (different for each server) we start to get up to near our os_process_limit of 200 and we restart couchdb. "zombie" is speculation here, but there seems to be no need for the hundred+ couchjs processes when just replicating 10 databases and occasional indexing, after restart it drops right back down. The started time of those processes are also weeks old. This may be normal, not sure.

Why do we cancel replication and restart it? We found that if we don't do this then WAN replications can hang, where curling /_replicate would say that the continuous replication is already running, but that the replications were not updating, and the document counts in the databases would diverge. Immediately after re-enabling the "cancel":true /_replicate beforehand, these stack traces re-appeared and the replication caught up.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

couchjs.txt
27/Jun/12 09:31
16 kB
Alex Markham
replicationcancelerror1.log
27/Jun/12 09:31
44 kB
Alex Markham
couchcrash171012redact.log
18/Oct/12 09:19
42 kB
Alex Markham

Activity

People

Assignee:: Unassigned

Reporter:: Alex Markham

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 27/Jun/12 09:30

Updated:: 02/Sep/18 06:39

Resolved:: 02/Sep/18 06:39