Jump to content

Miles O'Neal

Member Since 28 Oct 2014
Offline Last Active Yesterday, 04:26 PM

Topics I've Started

Protects table usage statistics

15 August 2019 - 09:25 PM

Is the any visibility into protects table usage? We have a fairly complex protects table, which I want to optimize for performance (without breaking authorization). It would be nice to have real data as to which rules are applied the most, or what paths are tested the most times. I might be able to pull a very rough approximation out of the logs testing client names as well as depot and disk paths, but that's really kludgey and, I suspect, not real accurate.

Am I missing some built-in functionality? Is there another tool to help with this? I toyed with the p4 log analyzer but that is slow (we have large logs) and takes a lot of space if we want to track data over time. It's useful for some things, and I would consider it for this, but it doesn't excite me for this purpose.


Helix in the cloud

13 August 2019 - 05:58 PM

Has anyone moved a large-ish Perforce installation (let's say, millions of files and/or terabytes of files, and/or at least 500GB or metadata) to the cloud?

If so, which cloud platform? How along ago? What has your experience been?

This is purely curiosity; sooner or later, it's bound to come up. When it does, I'd like to know if and how others are coping.

If you'd rather not say anything publicly, please feel free to message me directly (though I doubt I am the only person interested in the answers).


autoreload, especially vs replicas: odd behaviors

16 July 2019 - 09:29 PM

I've been looking for a while at disk size discrepancies, and have narrowed a big part down to autoreload labels.

For historical reasons, part of our CI process[1] generates several autoreoad labels and tags many files (potentially hundreds of thousands) with each label. Some of these labels are removed very soon after the tag (or labelsync, I forget which). I've run across multiple issues that look related to this. The level of resultant discrepancies (outlined below) varies between master, local replicas, and remote replicas (>120mS pings).
  • After running this way for several years, the number of dirs/files living in the unload depot varies widely across the systems (currently ranging from 70,000 to > 90,000 unload files).
  • Some of the dirs (ex: /p4/1/depots/unloaded/label/42,d/label:thing1-fred.ckp,d/ ) contain the expected file named 1.0 . Others were empty. Others contained a temp file (I believe it was tmp.nn.nnnn or something simlar). A few contained both a 1.0 and a tmp file!
  • Some labels no longer had a directory, much less a file, in the unload depot.
  • Some dirs and files in the unload depot no longer had a label associated.
Some of the tmp files are transitory; I'm not speaking of those. I assume the aberrations with just a tmp file are from interrupted transfers[2]. I'm not sure why some dirs would have both. No file within the directory? No clue.

I also don't understand why the directory would be missing when the label is there.

Has anyone else noticed this?

For the record, we're still at 2017.1 . We have a workaround for the 2018 showstopper we hit, but are hopefully going to 2019 soon.

[1] We're reworking that process.
[2] Some of the temp files are deleted within a few seconds of creation. On busy servers, I can see the journal entry to delete a label hitting the replica while a large file was still being transferred. This could leave a directory or file behind without a label.

Helix replica lags for hours after restart

19 June 2019 - 04:35 PM

We are still running a master/replica configuration. All servers are high-end with plenty of fast CPUs and RAM, SSDs for metadata, SSDs or high-speed disks for logs, etc. One of our R/O replicas sits in the same rack as the master, connected via the same 10GB switches. Until this week, like the other replicas, this one has always been very quick to catch up on journal data via "p4 pull".

The problem replica is our backup system. Once a day we drop the Helix service (p4d) and backup metadata and versioned files. When backups complete, we restart Helix services on that host.

As of quite recently, this replica has started taking several hours to catch up after Helix services are started. A few thousand bytes will trickle in, then it will sit for 3-10 seconds and a few thousand more will trickle in- while far more are queuing up on the master. After somewhere between 5-8 hours, it will catch up and more or less stay caught up.
Our other R/O replica (2.5 msec away), a FR (forwarding replica) in the same rack, and even the two FRs in the UK (130msec) all behave normally.

There do not appear to be any network anomalies. Ping times each direction are normal. No network errors.
On the problem host, I notice that the journal pull thread reports as
   pull -i 1 [server.locks/replica/69,d/pull(W)]
while on all other replicas (read-only or forwarding), it reports as
   pull rotating journal [server.locks/replica/69,d/pull(W)]
All replicas are running the same OS (RHEL 6.9), kernel (2.6.32-696.18.7.el6.x86_64), and p4d version (P4D/LINUX26X86_64/2017.1/1534792 (2017/07/26). Yes, those are all ancient. Downtime is hard to come by. The machines are all running reasonably identical hardware (the R/O replicas in particular are identical).
. p4d configs appear identical to me.

During the time things are wonky, the p4d on the problem host is consistently running at 96%-98% while this occurs. During that time frame, the other R/O replica consistently runs in the 70%-80% range, with drops to below 40%, which the problem host does not have. Once the bad replica catches up, each host's p4 pull thread CPU usage drops to below 1% average.

Our pull interval is 1 second. That's never been a problem.

If there are any clues in the system or Helix logs, we're missing them.

Any ideas?

Notes on ldapsync behavior, especially the SearchFilter field

29 May 2019 - 03:58 PM

The forum editor has serious issues. It just erased part of this article when I edited it. Hopefully I have everything back.

The original Helix docs were ambiguous as to what the search filter in a Helix ldap spec should contain.[3] Until we started using the user sync feature of ldapsync, we were fine with the simplest value:

SearchFilter:   (sAMAccountName=%user%)

When we tried to use the user feature, we hit a snag. The ldapsync command does not allow the '@' or '#' characters in the AttributeUid field (set to sAMAccountName). When it hits an LDAP (in our case, AD) account name with one of those characters, it complains and bails out.[1] The LDAP records causing the issue were actually groups, not users! The "-u" (user) option of ldapsync refers only to what gets updated on the Helix side. What happens against AD is purely based on the Search*variables in the ldap spec. This makes sense, but isn't necessarily intuitive.

Restricting the search to an object class of %user% both eliminates this problem[2] and speeds up the search. The command used in testing was:

% time p4 ldapsync -u -U -n myLDAPspec

The following times were against a test server with only 10 users.

SearchFilter Run time / (min:sec)

(sAMAccountName=%user%) 1:28
(&(objectClass=user)(sAMAccountName=%user%)) 1:06

That last time was cut in half by adding a clause to ignore disabled accounts, but (at least ion 2019.1) it turned out that at least some valid LDAP users were no longer recognized by Helix, so I removed it: (userAccountControl:1.2.840.113556.1.4.803:=2)

P4D configurables

The following configurables needed to be set or modified for ldapsync to work against the user database.

Configurable        Value       Notes

auth.ldap.timeout   As needed   Default is 30 seconds, not enough for large LDAP databases.
auth.ldap.pagesize  As needed   Our Windows pagesize is large.
dm.user.numeric     1           Only available in 2018.2 and later.

  • I filed a request that the behavior be changed to simply warn of this and skip that record.
  • Realizing the problem had the happy side effect of getting us to look at the AD groups and clean up some group names.
  • Randall said he would update the KB based on this.