Jump to content


DB Failure - Checkpoint recovery fails

checkpoint recovery

  • Please log in to reply
16 replies to this topic

#1 DanKolle

DanKolle

    Member

  • Members
  • PipPip
  • 12 posts

Posted 24 June 2015 - 12:40 PM

Hello everybody,

out of the sudden my database failed yesterday, I worked with it five minutes before and then I got a db error.

I thought "no problem" as I do backups everyday along with checkpoints. The problem is that recovering them simply does not work. I always end up with an empty Perforce server.

Please allow me to go into details.

I create my checkpoints using "p4d -jc". I also backup all data from "C:\Program Files\Perforce\Server".

This is basically what I did. Please keep in mind that I tried around several times with different backups so I am giving you the basic steps that I followed.
  • Stop Perforce server
  • cmd.exe
  • cd C:\Program Files\Perforce\Server
  • p4d -jv c:/restore/snapshot/checkpoint.1  << No output, so I guess this is fine?
  • del db.*
  • del journal
  • p4d -jr c:/restore/snapshot/checkpoint.1 c:/restore/snapshot/journal
The generated output is:

Perforce db files in '.' will be created if missing...
Recovering from c:/restore/snapshot/checkpoint.1...
Recovering from c:/restore/snapshot/journal...

So everything is looking fine now. The db files and the journal re where they belong. Also the actual depot data is still there as it was not affected.

Then I start the Perforce server again and try to connect with p4admin it says "No such user". When I create a new user which then is promoted to super user, I can see that my depots do not exist and instead I see an empty "depot" depot.

The server was a case-insensitive unicode server.

Any suggestion what I can still try?

I even tried to recover Perforce on another computer but with the same result.

Thanks
Dan

#2 P4Sam

P4Sam

    Advanced Member

  • Members
  • PipPipPip
  • 484 posts
  • LocationSan Francisco, CA

Posted 24 June 2015 - 02:47 PM

It sounds like you're connected to a completely empty server, which doesn't agree with your observation that the db files were created in the right place (I assume they "look right" in terms of being different sizes etc -- if they're all the same size they're empty).  Double-check "p4 info" against this empty server to see what the root is, and see whether it matches the directory where those db files went.

#3 DanKolle

DanKolle

    Member

  • Members
  • PipPip
  • 12 posts

Posted 24 June 2015 - 02:58 PM

Here you go:

C:\Program Files\Perforce\Server>p4 info
User name: #######
Client name: #######
Client host: #######
Client unknown.
Current directory: c:\Program Files\Perforce\Server
Peer address: 127.0.0.1:58424
Client address: 127.0.0.1
Server address: #######:#######
Server root: C:\Program Files\Perforce\Server
Server date: 2015/06/24 16:51:13 +0200
Server uptime: 00:00:06
Server version: P4D/NTX64/2015.1/1054991 (2015/05/05)
Server license: none
Case Handling: insensitive

I noticed something odd. In p4admin it says that the size of the journal is 0 Bytes where the actual file is around 2GB.

Posted Image



C:\Program Files\Perforce\Server>dir journal
Volume in Laufwerk C: hat keine Bezeichnung.
Volumeseriennummer: D009-FAC5

Verzeichnis von C:\Program Files\Perforce\Server

24.06.2015  16:53     2.478.356.360 journal
               1 Datei(en),  2.478.356.360 Bytes
               0 Verzeichnis(se), 260.975.603.712 Bytes frei

I checked the access rights and they should be fine as p4s runs under the system user.

Any suggestion?

Thanks
Dan

#4 P4Sam

P4Sam

    Advanced Member

  • Members
  • PipPipPip
  • 484 posts
  • LocationSan Francisco, CA

Posted 24 June 2015 - 03:39 PM

Are the db files on disk as small as indicated in P4Admin?  As I mentioned earlier, if they're all small and the same size (8k or 16k or some round number like that, I think it depends on platform), they're empty, which indicates that the restore didn't work.

That giant journal could be valuable; don't lose it!  If your checkpoint process has been failing, that journal contains every transaction ever made in your server, and might be your only path to recovery.

How large is the checkpoint file that you recovered from?

#5 DanKolle

DanKolle

    Member

  • Members
  • PipPip
  • 12 posts

Posted 24 June 2015 - 03:52 PM

Okay this is a bit weird. All files of my snapshot are very small as you said like 16 KB.

The big journal is not from the snapshot but from the actual backup of "C:/Program Files/Perforce/Server".

#6 P4Sam

P4Sam

    Advanced Member

  • Members
  • PipPipPip
  • 484 posts
  • LocationSan Francisco, CA

Posted 24 June 2015 - 05:37 PM

So those db files are completely empty -- the checkpoint restore did nothing.  I'm wondering if the checkpoint itself is empty (i.e. your checkpoint process has been checkpointing an empty directory this entire time) and that's why it produced an empty db without any errors.  That would also explain why the journal is giant and does not match what's in your backups -- you've never actually checkpointed this database.

You didn't mention how big the checkpoint file is (if it's only a few bytes it's essentially empty), but proceeding on the assumption that it is in fact empty, my next step would be to try recovering from the 2G journal file.  If it's been running continuously for the entire life of your server you should be able to recover everything from it.

#7 DanKolle

DanKolle

    Member

  • Members
  • PipPip
  • 12 posts

Posted 24 June 2015 - 05:47 PM

Yes the snapshot seems completely empty.

https://www.dropbox...._files.txt?dl=0

How would I proceed to recover from the 2G journal?

#8 P4Sam

P4Sam

    Advanced Member

  • Members
  • PipPipPip
  • 484 posts
  • LocationSan Francisco, CA

Posted 24 June 2015 - 06:09 PM

In your "server" directory (that currently has the db.* files):

del db.*
p4d -r. -jr journal

If this works without any errors, you should be good to go.  I'd say there's a decent chance there will be errors due to the journal being interrupted at some point (due to disk failure, configuration change, etc -- between the fact that it's 2GB worth of data and the fact that you had mysterious db errors that might indicate an unreliable disk, I'd be surprised if all 2GB worth of it is perfect).  If there are errors, try:

p4d -r. -f -jr journal

The "-f" tells the restore to skip past failed operations and hope for the best.

#9 DanKolle

DanKolle

    Member

  • Members
  • PipPip
  • 12 posts

Posted 24 June 2015 - 06:45 PM

Good news. The restore ran through with the latest journal file. I checked with reconcile offline work and it found no difference.

I will create a full backup now just to be sure, with the server stopped.

Could you have a quick look at my backup script? I guess the lack of -r was the problem.

The script lays in the root of my depot and creates snapshots at ../Snapshots.

@echo off
cd ..
rmdir /s /q Snapshot
md Snapshot
cd Snapshot
p4d -jc


#10 Robert Cowham

Robert Cowham

    Advanced Member

  • PCP
  • 270 posts
  • LocationLondon, UK

Posted 24 June 2015 - 07:07 PM

If you don't specify -r then whatever is in the environemtn as P4ROOT is used. This can be dangerous.

I would always recommend specifying the -r flag if running p4d directly.

The alternative is to run "p4 admin checkpoint" as a perforce superuser from any client machine. While this works fine and uses default root for the server, there is a danger in that once your checkpoints take tens of minutes, you are locking all users out of Perforce. Therefore savvy admins tend to block such commands in a broker...
Co-Author of "Learning Perforce SCM", PACKT Publishing, 25 September 2013, ISBN 9781849687645

"It's wonderful to see a new book about Perforce, especially one written by Robert Cowham and Neal Firth. No one can teach Perforce better than these seasoned subject matter experts"
  • Laura Wingerd, author of Practical Perforce, former VP of Product Technology at Perforce

#11 DanKolle

DanKolle

    Member

  • Members
  • PipPip
  • 12 posts

Posted 24 June 2015 - 07:16 PM

Is it required or recommend to stop the Perforce server for the checkpoint creation?

#12 Robert Cowham

Robert Cowham

    Advanced Member

  • PCP
  • 270 posts
  • LocationLondon, UK

Posted 24 June 2015 - 07:28 PM

When you run a checkpoint command, the whole database (all the db.* files) is locked so that the checkpoint is consistent. This means you can do it while the server is running, but effectively all user commands are locked out for the duration of the checkoint operation.

If you stop the server, the users are locked out anyway so the database is consistent that way too. But while the server is stopped user commands will give an error instead of hanging. This can be more disruptive. It's a choice...
Co-Author of "Learning Perforce SCM", PACKT Publishing, 25 September 2013, ISBN 9781849687645

"It's wonderful to see a new book about Perforce, especially one written by Robert Cowham and Neal Firth. No one can teach Perforce better than these seasoned subject matter experts"
  • Laura Wingerd, author of Practical Perforce, former VP of Product Technology at Perforce

#13 DanKolle

DanKolle

    Member

  • Members
  • PipPip
  • 12 posts

Posted 24 June 2015 - 07:40 PM

Okay I see the error in my perception now. I thought checkpoints were saved into another folder than the p4 root.

Would this script be fine? It should only leave the current journal file in it and the latest checkpoint.

del "C:\Program Files\Perforce\Server\checkpoint.*"

p4d -r "C:\Program Files\Perforce\Server" -jc

del "C:\Program Files\Perforce\Server\journal.*"

Versioning then happens through the backups I create so no .0 .1 files should be needed.

#14 Robert Cowham

Robert Cowham

    Advanced Member

  • PCP
  • 270 posts
  • LocationLondon, UK

Posted 24 June 2015 - 08:22 PM

That is basically correct.

Thought experiment:
  • create a client workspace, but don't sync it (no get latest in p4v
  • create a checkpoint, e.g. 23
  • sync the workspace which writes records to db.have
  • and then unsync it (i.e. sync #0) to remove all files - thus deleting records in db.have
  • create another checkpoint, e.g. 24
  • compare the 2 checkpoints - what do you expect to see?
In reaility, you will see some differences to do with access times and journal notes, but basically they are the same.

However, the record of the sync/unsync is in the journal - this can have value, but that value decreases over time.

Does that help?
Co-Author of "Learning Perforce SCM", PACKT Publishing, 25 September 2013, ISBN 9781849687645

"It's wonderful to see a new book about Perforce, especially one written by Robert Cowham and Neal Firth. No one can teach Perforce better than these seasoned subject matter experts"
  • Laura Wingerd, author of Practical Perforce, former VP of Product Technology at Perforce

#15 DanKolle

DanKolle

    Member

  • Members
  • PipPip
  • 12 posts

Posted 24 June 2015 - 08:33 PM

Hey, sorry I edited my previous post which kind of overlapped with your reply.

So the journal is not required for a backup using checkpoints, but checkpoints can be created out of a journal?

So If I only keep the checkpoint.x file it should be fine? Or should I also keep the journal file in case changes came in between taking the checkpoint and creating the backup?

The other way around I can delete checkpoint.* and journal.* (excluding journal itself) safely from a running server?

So this script should work fine to get right of old checkpoints?

del "C:\Program Files\Perforce\Server\checkpoint.*"

p4d -r "C:\Program Files\Perforce\Server" -jc

del "C:\Program Files\Perforce\Server\journal.*"


#16 Domenic

Domenic

    Advanced Member

  • Members
  • PipPipPip
  • 102 posts

Posted 24 June 2015 - 10:09 PM

Out of curiosity, what is the rush to delete old checkpoint files vs. just moving them to another folder? If your latest checkpoint can't be restored are you able to go back to (checkpoint - X) to get back to a good state?

View PostDanKolle, on 24 June 2015 - 08:33 PM, said:

So the journal is not required for a backup using checkpoints, but checkpoints can be created out of a journal?
I'm not quite sure what you mean by the journal not being required for a backup using checkpoints. From http://www.perforce....ter.backup.html:
  • A checkpoint is a snapshot or copy of the database at a particular moment in time.
  • A journal is a log of updates to the database since the last snapshot was taken.
Given that, and assuming you do daily checkpoints, if you did a checkpoint RIGHT NOW it should be equivalent to restoring yesterday's checkpoint and replaying today's journal into it.

View PostDanKolle, on 24 June 2015 - 08:33 PM, said:

So If I only keep the checkpoint.x file it should be fine? Or should I also keep the journal file in case changes came in between taking the checkpoint and creating the backup?
As Robert mentioned, taking a checkpoint locks the db.* files so there won't be any activity in the journal between when you started taking the checkpoint and when it finished.

For restoring, the checkpoint will get you back to the state when it was taken. If you take your checkpoints at 2am and the server dies at 9am, restoring that checkpoint will only get you back to 2am. The journal should have everything that happened from 2am -> 9am so you'd want to restore the checkpoint then replay the journal to get back to the state of things as they were at 9am.

Separate to all of this, if you're just setting up your checkpoint process I'd highly recommend setting up offline checkpoints. More info is at http://answers.perfo...rticles/KB/2419. There are multiple benefits to offline checkpointing, one of which is the minimal impact to users since the db.* files don't get locked on the main server since the backup server is doing the checkpoint. That may not be an issue for you now but your db files will only grow over which means checkpoints will take longer so you may as well take the leap now :)

#17 DanKolle

DanKolle

    Member

  • Members
  • PipPip
  • 12 posts

Posted 24 June 2015 - 10:30 PM

Thanks for taking the time to answer my questions.

The idea of deleting old checkpoints is that when I want an older checkpoint, I will simply open an older server backup. So the checkpoint backups exist within the normal backup cycle as all the other software backups. Perforce is only one of many programs that is backed up during the process.

Ah okay thanks for clarifying the journal checkpoint relationship. I thought the journal would also need to be restored but I guess this is not needed unless there is a serious issue with it.

I will take a look at offline checkpoints. Thanks for the advice.

Thanks again to you all. The server is back and running as well as the checkpoints being correctly created now.
Dan





Also tagged with one or more of these keywords: checkpoint, recovery

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users