Jump to content


Helix replica lags for hours after restart


  • Please log in to reply
2 replies to this topic

#1 Miles O'Neal

Miles O'Neal

    Advanced Member

  • Members
  • PipPipPip
  • 146 posts

Posted 19 June 2019 - 04:35 PM

We are still running a master/replica configuration. All servers are high-end with plenty of fast CPUs and RAM, SSDs for metadata, SSDs or high-speed disks for logs, etc. One of our R/O replicas sits in the same rack as the master, connected via the same 10GB switches. Until this week, like the other replicas, this one has always been very quick to catch up on journal data via "p4 pull".

The problem replica is our backup system. Once a day we drop the Helix service (p4d) and backup metadata and versioned files. When backups complete, we restart Helix services on that host.

As of quite recently, this replica has started taking several hours to catch up after Helix services are started. A few thousand bytes will trickle in, then it will sit for 3-10 seconds and a few thousand more will trickle in- while far more are queuing up on the master. After somewhere between 5-8 hours, it will catch up and more or less stay caught up.
Our other R/O replica (2.5 msec away), a FR (forwarding replica) in the same rack, and even the two FRs in the UK (130msec) all behave normally.

There do not appear to be any network anomalies. Ping times each direction are normal. No network errors.
On the problem host, I notice that the journal pull thread reports as
   pull -i 1 [server.locks/replica/69,d/pull(W)]
while on all other replicas (read-only or forwarding), it reports as
   pull rotating journal [server.locks/replica/69,d/pull(W)]
All replicas are running the same OS (RHEL 6.9), kernel (2.6.32-696.18.7.el6.x86_64), and p4d version (P4D/LINUX26X86_64/2017.1/1534792 (2017/07/26). Yes, those are all ancient. Downtime is hard to come by. The machines are all running reasonably identical hardware (the R/O replicas in particular are identical).
. p4d configs appear identical to me.

During the time things are wonky, the p4d on the problem host is consistently running at 96%-98% while this occurs. During that time frame, the other R/O replica consistently runs in the 70%-80% range, with drops to below 40%, which the problem host does not have. Once the bad replica catches up, each host's p4 pull thread CPU usage drops to below 1% average.

Our pull interval is 1 second. That's never been a problem.

If there are any clues in the system or Helix logs, we're missing them.

Any ideas?

#2 Miles O'Neal

Miles O'Neal

    Advanced Member

  • Members
  • PipPipPip
  • 146 posts

Posted 19 June 2019 - 11:23 PM

Edited to add pull interval.

#3 Matt Janulewicz

Matt Janulewicz

    Advanced Member

  • Members
  • PipPipPip
  • 191 posts
  • LocationSan Francisco, CA

Posted 03 July 2019 - 06:27 PM

We have a similar setup and have run into similar mysteries before. In our case it came down to one of two things, or both:

1. Filesystem. Any changes recently? Journal volume is not full? Do you discard/trim your SSD's regularly? (I would recommend not mounting with 'discard' and enabling the fstrim.timer service.) The journal is written to the journal file and live DB simultaneously, so high I/O (wait) on either will slow down the journal update.

2. Are your P4TARGET settings IP addresses or DNS names? We originally had hostnames in there then saw things tank when DNS was acting wonky. Once we switched all our P4TARGET settings to IP addresses we haven't had that type of failure.

One other thing off the top of my head that I know can cause random problems, especially if you have millions of library files, is 'mlocate', a file indexer that RHEL installs by default (at least RHEL 7.x does.) Run this on all your servers if you haven't already:

# yum erase mlocate

'perf top' and 'iotop' are also useful to find out if something is messing with your journal file or otherwise taking up an unusual amount of resources.
-Matt Janulewicz
Staff SCM Engineer, Perforce Administrator
Dolby Laboratories, Inc.
1275 Market St.
San Francisco, CA 94103, USA
majanu@dolby.com




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users