Jump to content


Replication failure early warning system

replica monitor p4 pull failure detection

  • Please log in to reply
8 replies to this topic

#1 Miles O'Neal

Miles O'Neal

    Advanced Member

  • Members
  • PipPipPip
  • 208 posts
  • LocationAustin. Texas. Y'all.

Posted 24 August 2015 - 11:40 PM

We need to monitor replication to detect problems; we recently had a convergence of a trans-Atlantic link failure (the backup link is rather slow), a checkin storm, and the replica's superuser ticket getting invalidated. By the time someone notified us, the replica was too far behind to catch up in less than several hours due to the slow backup link (which is being addressed by the network group).

We plan to run a script regularly in our monitoring software (PRTG, for the record). We will look at "p4 pull -lj" output over time to determine when there may be a problem. With spot monitoring every few seconds over the course of a minute, I see numbers varying from 0 to over 800,000.

I initially thought we should look for a journal state difference over a certain threshold, then see if that number keeps growing. But it dawned on me to just look at the two states, and if the master state number grows and the replica doesn't, we know we have a problem. But  how long do we wait to declare a likely problem? We saw a condition last week where the replica sat for somewhere between 20 and 60 seconds because something, somewhere was locked. I really don't want to be getting false alarms during the night.

If anyone has any experience with this, I would love to hear it.

Thanks,
Miles

#2 Domenic

Domenic

    Advanced Member

  • Members
  • PipPipPip
  • 105 posts

Posted 25 August 2015 - 08:30 PM

Every 2 minutes we check for a mix of the sequence delta (threshold of 20MB) as well as the state file delta (threshold of 200 seconds) and send a message if either one is exceeded. This has generally worked well for us and we don't get many false positives, except for when we're monitoring our backup replicas and their checkpoints complete :) As a point of reference, our main server journals are ~ 8GB per day, and we don't have journals enabled on the forwarding replicas.

Are you planning to run 'p4 pull -lj' every few seconds or was that just for your initial testing to get an idea of how big the deltas are? If the former, if replication gets into a bad state it is possible for the monitoring to pile on as the state file will be locked (albeit very briefly) just to figure out where replication is at.

View Postroadkills_r_us, on 24 August 2015 - 11:40 PM, said:

But it dawned on me to just look at the two states, and if the master state number grows and the replica doesn't, we know we have a problem. But how long do we wait to declare a likely problem?

By "state" do you mean the journal sequence numbers or "the statefile was last modified at XYZ" info? It is possible for replication to be in a bad state (no pun intended :)) even if the master / replica numbers are both growing.. For example, if your trans-Atlantic link gets its bandwidth cut in half for whatever reason, just checking that the numbers are growing may not show you that the replica is falling behind.

#3 Miles O'Neal

Miles O'Neal

    Advanced Member

  • Members
  • PipPipPip
  • 208 posts
  • LocationAustin. Texas. Y'all.

Posted 25 August 2015 - 11:28 PM

"Every few seconds" was just for the test to get a feel for the general range we might see during a busy period. I'm not yet sure what the real polling frequency will be, but it will likely be somewhere in the one to five minute range.

By state I meant the sequence numbers from the "jorrnal state" lines that "p4 pull -lj" returns. I'm not sure why I just typed "state"...

#4 Harsha

Harsha

    Member

  • Members
  • PipPip
  • 16 posts
  • LocationCambridge, UK

Posted 08 September 2015 - 01:18 PM

We have a custom script which runs every 10 minutes to check the replica status, it basically checks:
- Current Master/Replica Journal and its offset through p4 pull –lj. If there’s a huge difference in journal offset, the script will sleep for 30 seconds and checks the same again. This is to make sure that we are not alarming for unknown glitches.
- Check the RCS files transfer through “p4 pull –ls” to make sure there are no pending/failed files transfers.
- Check the latest change in both Master and Replica through “p4 changes –m1”. If the replication is in sync, we will get same CL # from both Master and Replica.
- Check the timestamp on “/P4ROOT/state” file in replica. Perforce Replication keeps track of journal/offset details in “state” file which will be under /P4ROOT/. There will be no update to this file if the replication is stalled, we check the file age of “state” file using ‘stat’ function to make sure this file is modified very frequently.

You can also use “p4 journaldbchecksums” to check the integrity across servers. This commands writes records to the journal that contain the checksums of the specified/all tables. Replica servers, upon receiving these records, compare these checksums with those computed against their own database tables. Results of the comparisons are written in the replica's log.

#5 johntconklin

johntconklin

    Newbie

  • Members
  • Pip
  • 6 posts

Posted 14 September 2015 - 06:23 PM

I recently wrote a nagios plugin that checks "p4 pull -lj" output after discovering that our backup replica had become desynchronized with master a few weeks prior.  It was dutifully backing up old data, so our existing check to see if new backups were being written was not being triggered.

View PostDomenic, on 25 August 2015 - 08:30 PM, said:

Every 2 minutes we check for a mix of the sequence delta (threshold of 20MB) as well as the state file delta (threshold of 200 seconds) and send a message if either one is exceeded. This has generally worked well for us and we don't get many false positives, except for when we're monitoring our backup replicas and their checkpoints complete :) As a point of reference, our main server journals are ~ 8GB per day, and we don't have journals enabled on the forwarding replicas.

How did you come up with the 200 second / 20MB thresholds?  The current version of my script sleeps for 2 seconds and retries up to 5 times if the journal sequence doesn't match exactly, and I get very few false positives. But changing it use thresholds like yours seems compelling, as that makes execution time of each check deterministic.

#6 Domenic

Domenic

    Advanced Member

  • Members
  • PipPipPip
  • 105 posts

Posted 16 September 2015 - 05:05 PM

View PostHarsha, on 08 September 2015 - 01:18 PM, said:

You can also use “p4 journaldbchecksums” to check the integrity across servers. This commands writes records to the journal that contain the checksums of the specified/all tables. Replica servers, upon receiving these records, compare these checksums with those computed against their own database tables. Results of the comparisons are written in the replica's log.

Do you use build farm replicas in your infrastructure? If so, have you experienced any false positives or other issues with the results of 'p4 journaldbchecksums' on the farm replica?

#7 Domenic

Domenic

    Advanced Member

  • Members
  • PipPipPip
  • 105 posts

Posted 16 September 2015 - 05:31 PM

View Postjohntconklin, on 14 September 2015 - 06:23 PM, said:

How did you come up with the 200 second / 20MB thresholds?  The current version of my script sleeps for 2 seconds and retries up to 5 times if the journal sequence doesn't match exactly, and I get very few false positives. But changing it use thresholds like yours seems compelling, as that makes execution time of each check deterministic.

The details are a bit hazy because it was implemented a while back, but I believe we'd setup a script to log the values every X (30?) seconds. After a couple of days of running we looked at the logs across the replicas to determine averages and outliers then added a little more buffer and called it good :) The values are almost certainly site specific - we've had some cases of our tools hammering the state file on the replica so when that is combined with a large integration (hundreds of thousands of files) being submitted we've gotten into some cases where the replicas fall behind by 1-2 minutes.

#8 Harsha

Harsha

    Member

  • Members
  • PipPip
  • 16 posts
  • LocationCambridge, UK

Posted 25 September 2015 - 02:14 PM

Hi Domenic,

No, we don't use build farm replicas. We use warm standby replicas with "db.replication = readonly" and "lbr.replication = shared" and "Services: replica". I haven't seen any false positives so far.

I think its expected that the checksums won't be same in build farm replicas as DB tables (mainly db.have and few more tables? ) will have different data in build farm replicas as these are meant to allow commands like "p4 client, p4 sync" which updates the DB tables.

#9 Matt Janulewicz

Matt Janulewicz

    Advanced Member

  • Members
  • PipPipPip
  • 226 posts
  • LocationSan Diego, CA

Posted 28 September 2015 - 05:53 PM

I'll throw in my two cents, because why not?

We use Zabbix and I run a script to parse the important parts of 'p4 pull -lj' every minute on all our replicas. I pull out the journal numbers on master and replica (we're in a commit->edge environment), the sequence numbers, plus I parse out the state timestamp and replica server time and convert them to epoch time.

The main thing we alert on is when the state timestamp and master time diverge too much (five minutes.) There are times when a large transaction (perhaps importing a large GitFusion repo) will cause a big chunk of journal to be generated, and sometimes it might take 2-3 minutes to ingest it.

We also alert if the journal numbers are different for more than a few minutes because if the same large journal transaction occurs during the day, our nightly journal rotation might take longer than a few seconds.

But, primarily it's the time divergence. Trying to make heads or tails of the sequence number, which is always different, and may update quickly or slowly, was an exercise in futility for us. There was no consistent way to send an alert based only on the sequence/journal combination because the behavior of it varied so wildly. When something goes wrong with replication, the times will pretty much always diverge and it was easy to look at the history of those time divergences in Zabbix and come up with a limit that ignored the usual day to day time differences but alert when we knew something was amiss.
-Matt Janulewicz
Currently unemployed, looking for work in Boise, ID!





Also tagged with one or more of these keywords: replica, monitor, p4 pull, failure, detection

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users