Jump to content


Best practices for backup with a mandatory failover

backup failover

  • Please log in to reply
5 replies to this topic

#1 davidair

davidair

    Newbie

  • Members
  • Pip
  • 3 posts

Posted 10 February 2020 - 04:40 AM

We have a setup where the main Helix commit server is replicated to a mandatory failover server. What are the best practices for backing up the system if we want to have no downtime? According to the Offline Checkpoints article, a no-downtime checkpoint system involves using a second, read-only replica server.

However, it looks like a mandatory standby cannot be used for this purpose - it looks like it is not possible to run a "p4d -jc" command on it (instead, checkpoint operations need to be run against the main commit server, not the mandatory failover replica).

In the case where we want to provide a highly-available environment, do we need to have another replica for the purposes of doing backups?

#2 Matt Janulewicz

Matt Janulewicz

    Advanced Member

  • Members
  • PipPipPip
  • 210 posts
  • LocationSan Francisco, CA

Posted 16 February 2020 - 11:42 PM

Short answer: Nope! You have enough servers.

Long answer:

You might want to familiarize yourself with the SDP (https://swarm.worksh...ain/README.html) At my last job (currently unemployed) we used it but then automated most of the functionality through other means.

The SDP creates a second 'offline' copy of the db on all servers. When a live journal is truncated, it is replayed into the offline version of the database. Then that's where you make checkpoints from. And that's what you need to back up.

If the standby is a read-only replica of the commit server, then you can do backups there. It doesn't make sense to back up a live database, it will pretty much alway be out of date (db files relative to each other) because of the time it takes to do the backup, but you have the solid, not-constantly-being-written-to offline database that is as up to date as the last journal rotation.

Are you running edge servers or do users commit directly to the commit server? If you have edge servers, those databases have unique characteristics (db.have, namely, as well as db.client and others) and need to be backed up, too. Same method, though, using the SDP.

One other note off the top of my head is that if you are using edge servers and do _not_ use global shelves, then the archive/library files on the edge are unique, too, and need to be backed up separately. I personally would recommend defaulting/requiring global shelves, which will auto-replicate to your standby and can be backed up there.

Even longer musings:

I can't take credit for this idea, it was implemented by a consultant years ago (Rusty.) In our case we did have a separate server just for backups.

In the end we grew to six edge servers plus the commit that needed backing up. Sort of a hassle when you have to back up seven databases from seven different servers, then collectively calls  them all 'the backup'.

So we created a new replica of every new server that needed backing up and target it at this server we called the 'edge replicator'. The SDP makes it easy to create multiple instances of Perforce on the same server, in our case seven of them.

All seven instances then do the same SDP offline DB thing, we dump checkpoints (seven of them each time!) This way you have everything you need for 'a backup' on one host and can do what you will with them. Backup to tape if you want. Or ...

In our case we used ZFS and took daily snapshots. Then sent those snapshots to a 'dummy' ZFS-enabled instance in AWS. Additionally, we actually had a second one of these replicators offsite. In addition, two read-only replicas for the commit server.

So in all, at any given time, we had three separate copies of all our databases and checkpoints/journals (two replicators plus one cloud), each of those snapshotted daily with 30 day retention, plus one year of monthlies. Then additionally we had the two commit server replicas that could substitute for any local servers in the same data center. Also a nice source for corrupt or missing library files, etc.

For our IT requirements, this was enough replication to satisfy everyone so we didn't back up to tape or any backup device. That business never worked properly with he 90 million or so library files we needed to save.

Probably overkill if you don't yet have an edge server, but as soon as you need to backup two Perforce servers I'd consider getting another server to use exclusively for replication and backup. As you grow you'll be happy you planned ahead. :)
-Matt Janulewicz
Staff SCM Engineer, Perforce Administrator
Dolby Laboratories, Inc.
1275 Market St.
San Francisco, CA 94103, USA
majanu@dolby.com

#3 Miles O'Neal

Miles O'Neal

    Advanced Member

  • Members
  • PipPipPip
  • 170 posts

Posted 18 February 2020 - 08:50 PM

We have a r/o replica for backups, with its own storage (in addition to a r/o replica that exists purely to failover as a master if necessary). A cron job that stops the p4d and backs up *everything*- /depotdata, /logs, /metadata. This way we know it's all in sync. We run test restores regularly. We have used backup products that get used for other data, but in our case, we expect to either want it all, or a specific version of a file or set of files.  So we've gone with basic *nix utilities to handle it all. The backups happen much quicker this way. We have quite a lot of data, and master and forwarding replicas stay busy 24/7, so running the backup r/o replica on one of those systems is not a good option.

We went with XFS some time ago; we tried doing an LV-level snapshot several years ago and it cause all sorts of problems. We haven't used ZFS on Linux. Hence the need for the server to halt; when we were using Solaris, we had used the snapshot and backup metadata trick Matt discussed.

We typically spin up new servers from these backups.

FWIW, we have ~71 million versioned files (17+TB), and about 0.75TB of metadata.

#4 Matt Janulewicz

Matt Janulewicz

    Advanced Member

  • Members
  • PipPipPip
  • 210 posts
  • LocationSan Francisco, CA

Posted 21 February 2020 - 07:56 AM

Some additional clarification... We ran (I used past tense because I don't work there, or anywhere, for now) ZFS only on the backup replication server. ZFS on Linux, I found, was very solid as far as reliability was concerned, but they're not yet at the point where they're improving performance. It is sloooow in a lot of cases, even if you have tons of memory (I'd normally drop a URL to my and a co-worker's presentation at MERGE 2016 that talked about this very subject, but I have no idea if that stuff is online any more. Suffice to say if you're running Linux servers, and you really should be, XFS yes in production, ZFS no.)

I had tested LVM with thin provisioning (maybe what Miles was alluding to) and found it unreliable and 'wonky' for lack of a better term.

We had a similar dataset to Miles but with close to 30 TB of versioned files. Lots of binaries in there. :)
-Matt Janulewicz
Staff SCM Engineer, Perforce Administrator
Dolby Laboratories, Inc.
1275 Market St.
San Francisco, CA 94103, USA
majanu@dolby.com

#5 Miles O'Neal

Miles O'Neal

    Advanced Member

  • Members
  • PipPipPip
  • 170 posts

Posted 21 February 2020 - 10:18 PM

Do you mean this one? 8^)

https://www.slidesha...-a-decade-makes

#6 Matt Janulewicz

Matt Janulewicz

    Advanced Member

  • Members
  • PipPipPip
  • 210 posts
  • LocationSan Francisco, CA

Posted Yesterday, 08:36 AM

View PostMiles O, on 21 February 2020 - 10:18 PM, said:

Do you mean this one? 8^)

https://www.slidesha...-a-decade-makes

Yuuuuuup! ;)
-Matt Janulewicz
Staff SCM Engineer, Perforce Administrator
Dolby Laboratories, Inc.
1275 Market St.
San Francisco, CA 94103, USA
majanu@dolby.com




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users