Jump to content


Trouble with testing restore - crash in maintenance mode


  • Please log in to reply
11 replies to this topic

#1 davidair

davidair

    Member

  • Members
  • PipPip
  • 12 posts

Posted 26 February 2020 - 12:09 AM

We are trying to validate that our backups are working by restoring a checkpoint on a new server.

We do this by:

1. Set up a vanilla standalone commit server
2. Copying all the arcgive files
3. Using p4d -r to restore a checkpoint

After restoring the backup, p4d fails to start and the warning is written in the logs:

Warning! You have exceeded the usage limits of Perforce Helix. Version 2019.2 allows up to five users without commercial licenses. You may continue your current usage with previous versions of our software.
Try deleting old users with 'user -d'.

This is expected. We then start the p4 in maintenance mode using the -n flag.

However, when we do this, some Perforce commands no longer work and the server crashes.

For example:

p4 users works
p4 user -d Username results in a "Partner exited unexpectedly." error, with the following error in the logs:

Perforce server error:
Process 28855 exited on a signal 11!

This is unexpected. Is there another way of trying to test a backup restore without running into this issue?
Is this behavior (crash with signal 11) a bug? Is this bug known?

We are running P4D Rev. P4D/LINUX26X86_64/2019.2/1918134 (2020/02/12).

#2 Sambwise

Sambwise

    Advanced Member

  • Members
  • PipPipPip
  • 1037 posts

Posted 26 February 2020 - 12:29 AM

Quote

We then start the p4 in maintenance mode using the -n flag.

However, when we do this, some Perforce commands no longer work and the server crashes.

Yikes.  Is there a coredump you can send to Perforce support?  I'm pretty sure it's not supposed to be doing that.

#3 Matt Janulewicz

Matt Janulewicz

    Advanced Member

  • Members
  • PipPipPip
  • 217 posts
  • LocationSan Francisco, CA

Posted 26 February 2020 - 11:29 AM

Stab in the dark, but 'p4 undoc' says:

"Maintenance mode does not police the user and file count restrictions listed in the license file."

‚ÄčMaybe it's freaking out in the absence of a license file? The result is still bug-ish but maybe 'p4d -n' wasn't intended/tested for unlicensed servers.

In any case, I think the easiest way to test a restore is to request a duplicate license for the IP address of your test server.

If you're only testing data integrity locally, you could try something shifty like physically disconnecting the test server from the network, giving it the same IP address as your master server, then using the same license. So, instead of duplicate license, duplicate server. :)

But really, duplicate license is the way to go.
-Matt Janulewicz
Currently unemployed, looking for work in Boise, ID!

#4 Miles O'Neal

Miles O'Neal

    Advanced Member

  • Members
  • PipPipPip
  • 182 posts

Posted 26 February 2020 - 05:57 PM

+1 Matt's answer. Safest bet is always a license for any new Helix service, even this.

#5 davidair

davidair

    Member

  • Members
  • PipPip
  • 12 posts

Posted 26 February 2020 - 10:34 PM

Thank you all for your responses! We have received a duplicate license from Perforce but the server is still crashing after a restore with the same "signal 11" error (SIGSEGV FTW).
We have filed a support case with Perforce and will hopefully get to the bottom of it.

A few things worth mentioning:

1. The original server is set up with replication (we're backing up a master commit server, but a mandatory replica also exists - we are not backing it up). The test server is configured as a standalone server.
2. If we restore the entire checkpoint, p4d fails to start with the following error (now that get got past the license issue):

Perforce server error:
        Listen ORIGINAL_SERVER_NAME:1666 failed.
        P4SSLDIR not defined or does not reference a valid directory.

3. We've tried a partial restore:

p4d -r /opt/p4root -jrF -K db.config,db.configh,db.domain,db.server,db.user,db.user.rp /opt/journal/checkpoints/our_server.ckp.6  

This works until we restart the server, after which we're back to "signal 11"

#6 Miles O'Neal

Miles O'Neal

    Advanced Member

  • Members
  • PipPipPip
  • 182 posts

Posted 28 February 2020 - 05:42 PM

Aha! You need to convince the test server Helix instance to use the test server name rather than the name of the host it was restored from.
  • Change the serverID in /opt/p4root/server.id (it should not have a newline; I wrote a quick perl script to change the name).
  • Change the master host name and IP in the any startup scripts. (With the SDP, ours is in /p4/common/config/p4_1.vars ).
  • Before trying to start it, use "p4d -cshow" to look for instances of the original host (e.g., P4TARGET). Use "p4d -cset" to change these if they are not being set in startup scripts. You will need to run these commands as the user that owns the db.* files. You may need to set the -J option to keep the journal up to date, but hopefully not.


#7 davidair

davidair

    Member

  • Members
  • PipPip
  • 12 posts

Posted 28 February 2020 - 11:09 PM

Thanks Miles...

the approach makes sense. However, If I go step by step, then:

1. /opt/p4root/server.id is the same on both servers (we've configured the test server using the same name, let's say FOO)
2. "Our startup script" is a one-liner: "sudo -u p4admin p4d -r /opt/p4root/FOO -d"
3. p4d -cshow doesn't show anything (neither on the production server, nor on the backup one). p4d -cset looks correct, with one exeption: the original server is configured with SSL (P4PORT has the "ssl" prefix), whereas the new one is not.

#8 davidair

davidair

    Member

  • Members
  • PipPip
  • 12 posts

Posted 29 February 2020 - 05:19 PM

Note that I'm also working with Perforce support on this in parallel. Last thing I tried was a full checkpoint restore and starting the server like this:

sudo -u p4admin p4d -r /opt/p4root/FOO -p BAR:1666 -d

Where FOO and BAR are our server and machine names, respectively. This worked and I was able to run some commands like "p4 users", but I'm back to getting "signal 11" when running other commands like "p4 clients". Waiting to hear back from support.

#9 Miles O'Neal

Miles O'Neal

    Advanced Member

  • Members
  • PipPipPip
  • 182 posts

Posted 02 March 2020 - 03:52 PM

For server commands (e.g., "p4d -cshow") you should always include the "-r" option with your root directory (where the db.* files live).

The same server name? Is the restore test server off the network, then?

#10 davidair

davidair

    Member

  • Members
  • PipPip
  • 12 posts

Posted 03 March 2020 - 05:57 AM

Thanks! Sorry, I guess I'm confusing server id and server name (in my cases I was referring to them as FOO and BAR).

The restore server is different, but both the original server and the new server use the same server id (what I refer to as FOO) which is stored under  /opt/p4root/FOO/server.id (the contents of the file is "FOO").

Thank you for the tip for running -cshow. When I run "sudo -u p4admin p4d -cshow -r /opt/p4root/FOO", I see that I have a bunch of servers configured, including the failovers and the edges with their failovers. This is obviously wrong, because my backup restore instance is a standalone server. I wonder if this is causing the trouble.

Taking a step back, here is what's going on:

1. Our production environment consists of a main commit server with its mandatory failover, as well as two edges and their failovers
2. For the moment, we're testing backing up the main commit server (we will eventually also back up the edges) - backup works, of course, but we are having trouble testing restore
3. We don't want to test restore on prod, this is only for actual DR, but we do want to be able to test our backups worked, hence this thread
4. We don't want to setup a full mirror of our prod environment just to test backups

#11 Miles O'Neal

Miles O'Neal

    Advanced Member

  • Members
  • PipPipPip
  • 182 posts

Posted 03 March 2020 - 04:42 PM

We only back up (and thus test restore) the master (we aren't yet running edge/commit because it requires some serious re-architecting of things in our environment).
As noted previously, we change the server.id on the restore test host. I don't know if that's necessary, but it all works for us.
I don't think the extra servers should be causing an issue. We have plenty of replicas that are still in our db.config on the restore test server. OTOH, to avoid taking chances, we have a strict firewall policy on that box regarding what hosts can talk to it (and vice versa) on our Helix service port.

Now that you have a license, you might try running without maintenance mode, and see if that helps. Because we have licenses for all related servers, we have never used that.

#12 davidair

davidair

    Member

  • Members
  • PipPip
  • 12 posts

Posted 11 March 2020 - 04:51 PM

Hey folks, closing the loop in this one - thanks to Perforce Support, they were able to figure this was most likely due to our SSO configuration. We were configured to use the Helix Authentication Service - restoring the checkpoint on a standalone server not configured to use it caused havoc and caused p4d to crash. Additionally, our prod server was configured to use SSL and logging to a non-existing destination.

The ultimate fix was to run the following after restoring:

# Clearing the server ID fixes the P4SSLDIR issue

sudo -u p4admin rm /opt/p4root/<SERVER-NAME>/server.id

# Clearing P4LOG removed the non-existing path
sudo -u p4admin p4d -r /opt/p4root/<SERVER-NAME> "-cunset P4LOG"

# Removing the trigger DB and unsetting auth.sso.allow.passwd turned off SSO
sudo -u p4admin rm /opt/p4root/<SERVER-NAME>/db.trigger
sudo -u p4admin p4d -r /opt/p4root/<SERVER-NAME> "-cunset auth.sso.allow.passwd"

After these steps, we were able to run all commands and test the backup restore.




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users