Jump to content


Calculating the digest of a file locally

digest md5 p4api.net

  • Please log in to reply
9 replies to this topic

#1 John Scott

John Scott

    Member

  • Members
  • PipPip
  • 15 posts

Posted 16 March 2015 - 05:15 PM

I'm trying to write my own digest creation algorithm, and am having some issues with text files.

As far as I can tell, the process is something like this:

* Convert all strings (including UTF16) to UTF8 without a BOM.
* Convert Windows line endings ("\r\n") to Unix line endings ("\n")
* Leave Mac line endings ("\r") alone
* Run the MD5 checksum on that data.

This works for the vast majority of files, but I must be missing something as it doesn't work for some. The digest comes out different, but a diff of the different file against the depot version reports no difference. The files are identical when viewed in hex too.

The attached file has a depot digest of 378BE809DF7D15AAC75A175693E25FBB, but I can only seem to get a digest of 3DCF20D7995C25F69EB9AB55019B0757 locally. It has Windows line endings.

Here's a LinqPad query to calculate the digest:

// "378BE809DF7D15AAC75A175693E25FBB"
// "3DCF20D7995C25F69EB9AB55019B0757"

FileInfo Info = new FileInfo( "D:\\Root\\ThirdParty\\Mono\\Mac\\etc\\mono\\browscap.ini" );
FileStream InputFile = Info.OpenRead();
byte[] EntireTextFile = new byte[Info.Length];
InputFile.Read( EntireTextFile, 0, ( int )Info.Length );
string InputString = Encoding.UTF8.GetString( EntireTextFile );

// Convert Windows line endings to Unix line endings
InputString = InputString.Replace( "\r\n", "\n" );
// Convert Mac line endings to Unix line endings
//InputString = InputString.Replace( '\r', '\n' );

EntireTextFile = Encoding.UTF8.GetBytes( InputString );

//EntireTextFile.Count( x => x == '\r' ).Dump();
//EntireTextFile.Count( x => x == '\n' ).Dump();

MD5 Checksummer = new MD5CryptoServiceProvider();
byte[] Checksum = Checksummer.ComputeHash( EntireTextFile );
string Digest = "";
foreach( byte Check in Checksum )
{
Digest += Check.ToString( "X2", CultureInfo.InvariantCulture );
}

Digest.Dump();

Cheers
John

Attached Files



#2 P4Matt

P4Matt

    Advanced Member

  • Members
  • PipPipPip
  • 1383 posts

Posted 16 March 2015 - 08:43 PM

Your process generally seems correct. I flipped through our code and nothing jumped out to me as to what might be different. It may be worth trying to use our FileSys and MD5 classes; we have an example of using them in the merge3 command code:

https://swarm.worksh...entmain.cc#L564

I suspect there is some interesting process done by our FileSys object when reading the file that is unique to us.

#3 John Scott

John Scott

    Member

  • Members
  • PipPip
  • 15 posts

Posted 16 March 2015 - 09:11 PM

Thanks!

I don't suppose the source to StrBuf::Append( char* ) is available anywhere? That looks to do the interesting stuff.

I'll try the C++ approach when I have some time.

Cheers
John

#4 P4Matt

P4Matt

    Advanced Member

  • Members
  • PipPipPip
  • 1383 posts

Posted 16 March 2015 - 09:39 PM

You can browse here:

    https://swarm.worksh...ware/p4/support

Or download it here:

    https://swarm.worksh...rce_software-p4

The currently published source is the 14.1 p4 source. I need to sit down and get the 14.2 source out. However StrBuf hasn't changed in a meaningful way in a  long time.

#5 John Scott

John Scott

    Member

  • Members
  • PipPip
  • 15 posts

Posted 17 March 2015 - 06:49 PM

Got it. There was an ANSI character in the file (รข), and I was reading the text as UTF8. Reading the file with code page 1252 makes everything fine and dandy =)

To clarify:

text -> read with the local code page -> checksum the resulting ANSI

UTF16 -> read as UTF16 -> convert to UTF8 -> remove BOM -> checksum resulting UTF8

unicode -> read as UTF8 -> I don't have access to a unicode server, but I'd speculate skipping the BOM and checksumming that

Would the above assumptions be correct?

Cheers
John

#6 ThatGuy

ThatGuy

    Advanced Member

  • Members
  • PipPipPip
  • 33 posts

Posted 17 March 2015 - 10:11 PM

I don't mean to go off topic here but what are the filetypes of the files where the digests are different, are they symlink's?

I've also seen these lines in your code which might not be necessary...
// Convert Windows line endings to Unix line endings
InputString = InputString.Replace( "\r\n", "\n" );
// Convert Mac line endings to Unix line endings
//InputString = InputString.Replace( '\r', '\n' );

I think you can remove those lines and change the value of client spec field"LineEnd:" to "shared" instead of local.

Also refer to the Unicode filetype section of this document as it most likely has some useful info for how you are handling unicode files http://www.perforce....r/i18nnotes.txt
This link will also have useful information you require about BOM's http://stackoverflow...e-unicode-files

Thanks,

Tunga.
Certified P4.

#7 John Scott

John Scott

    Member

  • Members
  • PipPip
  • 15 posts

Posted 17 March 2015 - 10:22 PM

> but what are the filetypes of the files where the digests are different

Currently, just 'text' files, but it's early days yet =) I think I may also be trying to redelete files that have already been deleted.

>
change the value of client spec field"LineEnd:" to "shared" instead of local.

I'm writing a utility, so it needs to work on as many flavours of clientspecs as possible.

The above procedure is what I've found to work; whether it is correct or not is another matter altogether! =)


Cheers
John

#8 P4Matt

P4Matt

    Advanced Member

  • Members
  • PipPipPip
  • 1383 posts

Posted 19 March 2015 - 05:13 AM

Your process above sounds about right John. In general Perforce always operates on UTF-8. With non-Unicode servers we just assume everything is UTF-8. With Unicode servers we translate everything to UTF-8 and then store it/diff it/mangle it as needs be.

#9 bsg_jjoyce

bsg_jjoyce

    Newbie

  • Members
  • Pip
  • 2 posts

Posted 20 January 2017 - 08:48 PM

So I'm trying to speed up cleans by md5 hashing in C# I have the majority of file encodings working but UTF-16LE (unicode) seems to be doing something drastically different.

Here is the code I have currently (so people in the future trying to figure this out have a strong starting point) I highly recommend the UDE.Signed on nuget.org which was amazing for determining the file encodings.

So if anyone has any helpful tips on what UTF-16LE is doing that would be amazing.
Currently using unicode encoding and replacing \r with nothing and including the utf8 bom gets me within a byte of the expected filesize but obviously that was just me shooting in the dark.

private bool IsFileDirty(char[] buffer, byte[] md5InputBuffer, FileMetaData entry, File file)
{
using (var fileReader = file.Open(FileMode.Open, FileAccess.Read, FileShare.Read))
{
var header = new byte[4];
fileReader.Read(header, 0, 4);
fileReader.Seek(0, SeekOrigin.Begin);
var cdet = new CharsetDetector();
cdet.Feed(fileReader);
cdet.DataEnd();
fileReader.Seek(0, SeekOrigin.Begin);
Encoding encoding;
switch (cdet.Charset)
{
case "EUC-JP":
encoding = Encoding.GetEncoding(20932);
break;
case "x-mac-cyrillic":
encoding = Encoding.GetEncoding(10007);
break;
case "windows-1251":
encoding = Encoding.GetEncoding(1251);
break;
case "windows-1252":
case "ASCII":
encoding = Encoding.Default;
break;
case "UTF-8":
encoding = Encoding.UTF8;
break;
case "UTF-16LE":
return true;
default:
encoding = Encoding.GetEncoding(1252);
break;
}
var md5 = new MD5CryptoServiceProvider();
var totalWrite = 0;
if (encoding is UTF8Encoding)
{
var preamble = encoding.GetPreamble();
if (header.StartsWith(preamble))
{
md5.TransformBlock(preamble, 0, preamble.Length, null, 0);
totalWrite += preamble.Length;
}
}
using (var reader = new StreamReader(fileReader, encoding))
{
int read;
while ((read = reader.Read(buffer, 0, buffer.Length)) > 0)
{
var startIndex = 0;
var write = 0;
for (var i = 0; i < read; ++i)
{
if (i + 1 >= read || buffer[i] != '\r' || buffer[i + 1] != '\n')
continue;
ConvertCharacters(encoding, buffer, startIndex, i - startIndex, md5InputBuffer, write, ref write, ref totalWrite);
startIndex = i + 1;
}
ConvertCharacters(encoding, buffer, startIndex, read - startIndex, md5InputBuffer, write, ref write, ref totalWrite);
md5.TransformBlock(md5InputBuffer, 0, write, null, 0);
}
md5.TransformFinalBlock(new byte[0], 0, 0);
}
var hash = md5.Hash.ToHex();
return hash != entry.Digest && GetBinaryFileMD5(entry, file);
}
}
private void ConvertCharacters(Encoding encoding, char[] buffer, int index, int count, byte[] destBuffer, int destIndex, ref int write, ref int totalWrite)
{
var bytes = encoding.GetBytes(buffer, index, count, destBuffer, destIndex);
write += bytes;
totalWrite += bytes;
}
private bool GetBinaryFileMD5(FileMetaData entry, File file)
{
if (entry.FileSize != file.Size.SizeInBytes)
return true;
return file.GetMD5() != entry.Digest;
}


#10 Sambwise

Sambwise

    Advanced Member

  • Members
  • PipPipPip
  • 1084 posts

Posted 02 February 2017 - 07:09 PM

View Postbsg_jjoyce, on 20 January 2017 - 08:48 PM, said:

So if anyone has any helpful tips on what UTF-16LE is doing that would be amazing.

https://swarm.worksh.../charcvt.cc#220

leads us to:

https://swarm.worksh.../basecvt.cc#448

I have no idea what any of that code does but I think it's what you're looking for.  Good luck!  :)





Also tagged with one or more of these keywords: digest, md5, p4api.net

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users