Jump to content


reverts hanging

revert hang

  • Please log in to reply
19 replies to this topic

#1 Miles O'Neal

Miles O'Neal

    Advanced Member

  • Members
  • PipPipPip
  • 84 posts

Posted 03 July 2018 - 10:13 PM

A user asked for help with a hung revert. With 32+K accidentally deleted files, it documented reverting one hundred or so, and then froze. Two hours later I got to look at it and killed the process on the server. I became that user via sudo and played with things; it would consistently hang a short ways into the revert. I eventually ran "p4 opened //foo/... | sed -e 's/#.*//' > /tmp/revert.dat to get the paths of all the files that were open, then ran a loop to revert each file individually. It took about 1 minute per 1,000 files (32 minutes or so), but none of them hung. There were no streams or labels involved.
This occurred from both P4v (the user) and p4 (me).
Any ideas?
Thanks!

RHEL6 (required by third party software we use)
Rev. P4V/LINUX26X86_64/2017.2/1532340
Rev. P4/LINUX26X86_64/2017.1/1534792 (2017/07/26).
Rev. P4D/LINUX26X86_64/2017.1/1534792 (2017/07/26).

#2 Sambwise

Sambwise

    Advanced Member

  • Members
  • PipPipPip
  • 637 posts

Posted 04 July 2018 - 12:46 AM

That's an interesting one.  Did the hundred or so files actually get reverted?  Would the revert always revert a hundred files and then hang, or after the first time did it hang without reverting anything?  My guess would be that if it hung after 100 files it was something specific to the 101st file (meaning that thereafter it would always hang immediately on that same file) rather than the number of files reverted (meaning that each attempt would hang on a different file since it'd process 100 and then get stuck in a new place).  That'd be a good thing to validate, though, and if you know a particular file that's triggering the bug then it's a lot easier to scrutinize it for possible root causes.

As to why a particular file would be hanging, but only as part of a larger revert, my guess would be that there was some cleanup that needed to happen on that file (like something involving a move pair, or a shelf, or even a have record) that because the file was in an unusual state required a db probe with the reverted path as a filter (which means that it'd go a lot quicker if you were reverting that file by itself as opposed to as part of a larger batch).

The way I'd try to diagnose exactly what was going on would be to run the server with the -vdmc=5 (or some other N) flag to see if it dumps any helpful logging.  I think -vdb=N might also give something helpful.  At the end of the day, though, the exact diagnosis isn't that helpful if you don't have access to the source to fix it.  If you're trying to get a fix from the development team, sending a checkpoint that reproduces the problem is probably the most expeditious way.

#3 p4rfong

p4rfong

    Advanced Member

  • Staff Moderators
  • 285 posts

Posted 06 July 2018 - 05:52 PM

You can also try running
p4 monitor show -ael
p4 lockstat
p4 lockstat -C
as seen in "Fixing a hung server" https://community.pe.../s/article/3785
Also make sure you are running.
p4 configure set db.monitor.interval=30
This won't help this time, but it will help by allowing
p4 monitor terminate <pid>
to work in the future.   The results of the above lockstat commands may provide a clue.

#4 Miles O'Neal

Miles O'Neal

    Advanced Member

  • Members
  • PipPipPip
  • 84 posts

Posted 16 July 2018 - 03:40 PM

Thanks, y'all.
Yes, everything reverted. When I would kill the hung revert and start again, it would take up where it left off until it hung farther on at a different location, so no specific file was not revertable. When I dumped the list of opened files in the depot path, and reverted them individually, they all reverted with no problems.
I'd tried some of these things; there were no obvious lock problems.
If it happens again will check them all.
Thanks again.

#5 Sambwise

Sambwise

    Advanced Member

  • Members
  • PipPipPip
  • 637 posts

Posted 16 July 2018 - 08:38 PM

I dimly remember there being a bug at some point where TCP windows being too big (or too small?) would cause a hang with sync after a certain number of files (some kind of problem with the duplex transfer where it'd get wedged waiting for one of the buffers to get filled and sit there forever rather than flushing it).  The same thing would probably have happened on revert.  Is this with a direct connection or is there a proxy (or edge, or broker, or...) in between?

#6 Miles O'Neal

Miles O'Neal

    Advanced Member

  • Members
  • PipPipPip
  • 84 posts

Posted 20 July 2018 - 07:23 PM

There is a broker in between. I don't recall whether this user was going through it or not (we have some rogue going direct, and have not yet locked out direct access to the server). I'll check the TCP sizes on all systems. Thanks.

#7 p4rfong

p4rfong

    Advanced Member

  • Staff Moderators
  • 285 posts

Posted 14 August 2018 - 02:32 AM

Also run "top" and press the number 1 and check whether any CPU is at 100%.  You may be out of CPU.

#8 Miles O'Neal

Miles O'Neal

    Advanced Member

  • Members
  • PipPipPip
  • 84 posts

Posted 16 August 2018 - 07:39 PM

We routinely see 100% usage on several of the cores. It's been that way forever, but it's not normally one p4d hogging a single core for any length of time. Here's a not atypical sample.

    PID USER   PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
4011049 perforce  20   0  218m 192m 1688 R 100.0  0.0   0:04.44 p4d_1   
4011050 perforce  20   0  234m 209m 1688 R 100.0  0.0   0:04.44 p4d_1   
4011052 perforce  20   0  218m 193m 1688 R 100.0  0.0   0:04.44 p4d_1

#9 Miles O'Neal

Miles O'Neal

    Advanced Member

  • Members
  • PipPipPip
  • 84 posts

Posted 16 August 2018 - 07:41 PM

Or if you want some "1" output:

Cpu7  : 95.7%us,  4.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  4.0%us,  0.7%sy,  0.0%ni, 95.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 : 40.5%us,  2.0%sy,  0.0%ni, 57.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

#10 p4rfong

p4rfong

    Advanced Member

  • Staff Moderators
  • 285 posts

Posted 20 August 2018 - 05:54 PM

Does look like you are out of CPU, so if you revert smaller groups of files at a time the revert should go faster.  You have a number of CPU's, but faster CPU's would help.

#11 Thandesha

Thandesha

    Advanced Member

  • Members
  • PipPipPip
  • 164 posts
  • LocationSunnyvale, CA, USA

Posted 04 September 2018 - 10:11 PM

I generally delete the client spec from the admin side to mitigate this situation as things can get worse if any of those files are binary and have exclusive locking enabled.
That doesn't address why it hangs though. However it is just to put off the fire

#12 Miles O'Neal

Miles O'Neal

    Advanced Member

  • Members
  • PipPipPip
  • 84 posts

Posted 12 September 2018 - 09:06 PM

The issue here isn't speed. It's that the revert simply hangs. It has reverted some number of files (average around 100), and then just freezes. It would sit for hours doing exactly nothing.

#13 p4rfong

p4rfong

    Advanced Member

  • Staff Moderators
  • 285 posts

Posted 13 September 2018 - 06:24 PM

Did you run "p4 lockstat", "p4 lockstat -C"?  Did you try reverting a smaller directory so that CPU is not near 100%?  Hopefully this will provide some clues.

#14 Sambwise

Sambwise

    Advanced Member

  • Members
  • PipPipPip
  • 637 posts

Posted 13 September 2018 - 06:30 PM

It doesn't make sense to me that "p4 revert" would be thrashing the CPU in the first place.  Long shot -- you haven't changed your map.joinmax tunables, have you?

#15 Miles O'Neal

Miles O'Neal

    Advanced Member

  • Members
  • PipPipPip
  • 84 posts

Posted 17 September 2018 - 03:52 PM

any: map.joinmax1 = 1M
any: map.joinmax2 = 5M

Not sure if these have changed. If they have, it was almost certainly either on the recommendation of Perforce, or as a result of something we found in the forums.
But it's rare this happens, and I haven't seen anything odd about the situations I've dealt with. (Which doesn't mean it wasn't there.)

#16 Sambwise

Sambwise

    Advanced Member

  • Members
  • PipPipPip
  • 637 posts

Posted 17 September 2018 - 06:12 PM

Quote

map.joinmax1 = 1M
map.joinmax2 = 5M


Not sure if these have changed. If they have, it was almost certainly either on the recommendation of Perforce, or as a result of something we found in the forums.

Yeah, those have definitely been increased.  The default values are:

map.joinmax1 10K Produce at most map1+map2+joinmax1
map.joinmax2		 1M Produce at most joinmax2

Typically if someone has increased these values it means they encountered some really bad mapping explosion that triggered an "excessive wildcards" error.  This is basically like a carbon monoxide alarm going off in your basement, and there are two schools of thought on how to deal with it:
1. Figure out what's setting it off (which means taking measurements and inspecting your furnace and whatnot)
2. Yank the batteries to make the beeping stop

Option 2 results in better call resolution time metrics, but is not always the healthiest long-term solution.

The hanging revert you describe is the equivalent of feeling mysteriously light-headed.  It doesn't necessarily have anything to do with that annoying beeping noise you heard a while back and hit the snooze alarm on but...

#17 Miles O'Neal

Miles O'Neal

    Advanced Member

  • Members
  • PipPipPip
  • 84 posts

Posted 17 September 2018 - 07:57 PM

I tracked it down. It came from this article:
http://answers.perforce.com/articles/KB/3124/?q=upgrade+to+2013.3&l=en_US&fs=Search&pn=1
.
We merged with a company that still treats everything the way they did in SVN, which means they might have a couple hundred streams in a client, and those streams might have streams, and... Basically each of their clients is the equivalent of the Mississippi River waterways system. That is (unfortunately) very unlikely to change. If we could prove that were the case here, we might have the leverage (though it would take quite a while)- and in the meantime, we would still have to run this way.
Thanks! At least we have a possible culprit.

#18 Sambwise

Sambwise

    Advanced Member

  • Members
  • PipPipPip
  • 637 posts

Posted 17 September 2018 - 08:11 PM

View PostMiles O, on 17 September 2018 - 07:57 PM, said:

I tracked it down. It came from this article:
http://answers.perfo...&fs=Search&pn=1
.
We merged with a company that still treats everything the way they did in SVN, which means they might have a couple hundred streams in a client, and those streams might have streams, and... Basically each of their clients is the equivalent of the Mississippi River waterways system. That is (unfortunately) very unlikely to change. If we could prove that were the case here, we might have the leverage (though it would take quite a while)- and in the meantime, we would still have to run this way.
Thanks! At least we have a possible culprit.

It's not really about the number of lines in your client mapping so much as it is the complexity of the interactions between the mappings in the system -- in the case of revert, it's the join of the protection table with the client view.  In fact it's generally far more efficient to have a large number of simple mapping lines than it is to have a small number of complex ones (which runs counter to most people's intuition, but it makes sense once you look at the combinatorics of complex mapping joins).  Some of the constraints on stream views (i.e. for Perforce streams, not the streams you're talking about) are there specifically to help keep users from shooting themselves in the foot by introducing unnecessarily complex mappings.

When I was doing tech support (way back in the oughts) I spent a good amount of time digging into these and would find in many cases that there were very simple changes that could be made to the protection table to massively reduce its complexity without actually changing the permission semantics.  Usually a mapping blowup is due to a combination of protections and client views rather than one on its own, and since as the admin you have control over the protection table you can prevent most problems (even if your users are doing inadvisable things with client views) just by making sure things are tuned well on that end -- and if you leave the joinmax limits at their default, a user who does inadvisable things with client views will "fail fast" rather than wedge the server.

#19 Miles O'Neal

Miles O'Neal

    Advanced Member

  • Members
  • PipPipPip
  • 84 posts

Posted 18 September 2018 - 03:44 PM

They translated whatever they were doing into Perforce streams, so Perforce streams is what I am talking about, unless I am misreading something. Thanks.

#20 Sambwise

Sambwise

    Advanced Member

  • Members
  • PipPipPip
  • 637 posts

Posted 18 September 2018 - 08:17 PM

Ah -- a single client can only map one Perforce stream, so the "couple hundred streams in a client" confused me.  I'm guessing they mapped each Perforce stream to an arbitrarily large set of SVN streams?  That should actually work fine as long as they aren't also putting a lot of junk into the "Ignored" field (which is the one spot in a Perforce stream definition where you can have multiple wildcards in a view).  You're not going to blow past a 10k joinmax setting with a couple of hundred lines in the Path field of a stream (since those are nice normal one-to-one wildcards).





Also tagged with one or more of these keywords: revert, hang

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users