encode.su Open in urlscan Pro
67.205.31.151  Public Scan

URL: https://encode.su/threads/3633-Is-wyhash-a-good-hash
Submission Tags: falconsandbox
Submission: On September 15 via api from US — Scanned from DE

Form analysis 5 forms found in the DOM

POST login.php?s=193350fdb34a19c1711802aac20aea1d&do=login

<form id="navbar_loginform" action="login.php?s=193350fdb34a19c1711802aac20aea1d&amp;do=login" method="post" onsubmit="md5hash(vb_login_password, vb_login_md5password, vb_login_md5password_utf, 0)">
  <fieldset id="logindetails" class="logindetails">
    <div>
      <div>
        <input type="text" class="textbox default-value" name="vb_login_username" id="navbar_username" size="10" accesskey="u" tabindex="101" value="User Name">
        <input type="password" class="textbox" tabindex="102" name="vb_login_password" id="navbar_password" size="10" style="display: none;">
        <input type="text" class="textbox default-value" tabindex="102" name="vb_login_password_hint" id="navbar_password_hint" size="10" value="Password" style="display: inline;">
        <input type="submit" class="loginbutton" tabindex="104" value="Log in" title="Enter your username and password in the boxes provided to login, or click the 'register' button to create a profile for yourself." accesskey="s">
      </div>
    </div>
  </fieldset>
  <div id="remember" class="remember">
    <label for="cb_cookieuser_navbar"><input type="checkbox" name="cookieuser" value="1" id="cb_cookieuser_navbar" class="cb_cookieuser_navbar" accesskey="c" tabindex="103"> Remember Me?</label>
  </div>
  <input type="hidden" name="s" value="193350fdb34a19c1711802aac20aea1d">
  <input type="hidden" name="securitytoken" value="guest">
  <input type="hidden" name="do" value="login">
  <input type="hidden" name="vb_login_md5password">
  <input type="hidden" name="vb_login_md5password_utf">
</form>

POST search.php?s=193350fdb34a19c1711802aac20aea1d&do=process

<form action="search.php?s=193350fdb34a19c1711802aac20aea1d&amp;do=process" method="post" id="navbar_search" class="navbar_search">
  <input type="hidden" name="securitytoken" value="guest">
  <input type="hidden" name="do" value="process">
  <span class="textboxcontainer"><span><input type="text" value="" name="query" class="textbox" tabindex="99"></span></span>
  <span class="buttoncontainer"><span><input type="image" class="searchbutton" src="images/buttons/search.png" name="submit" onclick="document.getElementById('navbar_search').submit;" tabindex="100"></span></span>
</form>

POST profile.php?do=dismissnotice

<form action="profile.php?do=dismissnotice" method="post" id="notices" class="notices">
  <input type="hidden" name="do" value="dismissnotice">
  <input type="hidden" name="s" value="s=193350fdb34a19c1711802aac20aea1d&amp;">
  <input type="hidden" name="securitytoken" value="guest">
  <input type="hidden" id="dismiss_notice_hidden" name="dismiss_noticeid" value="">
  <input type="hidden" name="url" value="">
  <ol>
    <li class="restore" id="navbar_notice_1">
      <b>Welcome to the Encode's Forum!</b><br> It's probably the biggest forum about the data compression software and algorithms on the web! Here you can find state of the art compression software, detailed description on algorithms, the latest
      news and, most importantly, you may ask a professional and get the answers! <a href="register.php?s=193350fdb34a19c1711802aac20aea1d&amp;"><b>Join us</b></a> today!
    </li>
  </ol>
</form>

POST search.php

<form action="search.php" method="post">
  <ul class="popupbody" id="yui-gen6">
    <li>
      <input type="text" name="query" class="searchbox" value="Search..." tabindex="13">
      <input type="submit" class="button" value="Search" tabindex="14">
    </li>
    <li class="formsubmit" id="popupsearch">
      <div class="submitoptions">&nbsp;</div>
      <div class="advancedsearchlink"><a href="search.php?s=193350fdb34a19c1711802aac20aea1d&amp;search_type=1&amp;searchthreadid=3633&amp;contenttype=vBForum_Post">Advanced Search</a></div>
    </li>
  </ul>
  <input type="hidden" name="s" value="193350fdb34a19c1711802aac20aea1d">
  <input type="hidden" name="securitytoken" value="guest">
  <input type="hidden" name="do" value="process">
  <input type="hidden" name="searchthreadid" value="3633">
  <input type="hidden" name="search_type" value="1">
  <input type="hidden" name="contenttype" value="vBForum_Post">
</form>

GET forum.php

<form action="forum.php" method="get" id="footer_select" class="footer_select">
  <!-- BEGIN hide quick style chooser using a vb if conditon
			
		
			<select name="styleid" onchange="switch_id(this, 'style')">
				<optgroup label="Quick Style Chooser">
					
				</optgroup>
			</select>	
		

END hide quick style chooser -->
</form>

Text Content

 * Help
 * Remember Me?



--------------------------------------------------------------------------------

 * What's New?
 * Forum
    * New Posts
    * FAQ
    * Calendar
    * Forum Actions
      * Mark Forums Read
    * Quick Links
      * Today's Posts

 * Advanced Search

 * 
 * Forum
 * General
 * Data Compression
 * Is wyhash a good hash?

--------------------------------------------------------------------------------

 1. Welcome to the Encode's Forum!
    It's probably the biggest forum about the data compression software and
    algorithms on the web! Here you can find state of the art compression
    software, detailed description on algorithms, the latest news and, most
    importantly, you may ask a professional and get the answers! Join us today!

Results 1 to 19 of 19


THREAD: IS WYHASH A GOOD HASH?

 * THREAD TOOLS
   
   * Show Printable Version
   * Email this Page…
   * 

 * SEARCH THREAD
   
    * 
    *  
      Advanced Search

 * DISPLAY
   
   * Linear Mode
   * Switch to Hybrid Mode
   * Switch to Threaded Mode

 1.  28th May 2021, 14:55 #1
     fcorbelli
      * View Profile
      * View Forum Posts
      * Private Message
     
     Member
     
     --------------------------------------------------------------------------------
     
     Join Date Dec 2013 Location Italy Posts 1,358 Thanks 64 Thanked 182 Times
     in 155 Posts
     
     
     
     IS WYHASH A GOOD HASH?
     
     > https://github.com/wangyi-fudan/wyhash
     > 
     > Has anyone used it and give ratings?
     > 
     > Thanks
     > 
     > Just to compare against XXH3 (128 bit)
     > 
     > Code:
     > 
     > C:\zpaqfranz>zpaqfranz sha1 c:\dropbox\dropbox -xxhash -summary 1 -all
     > zpaqfranz v51.29-experimental snapshot archiver, compiled May 28 2021
     > franz: summary 1
     > franz:use xxhash
     > Getting XXH3 ignoring .zfs and :$DATA
     > Computing filesize for 1 files/directory...125.383)
     > Found (7.99 GB) => 8.577.791.284 bytes (7.99 GB) / 21.164 files in 0.172000
     > 
     > Creating 32 hashing thread(s)
     > 001% 00:00:00 (  81.82 MB) of (   7.99 GB)           85.793.075 /sec
     > (...)
     > Algo XXH3 by 32 threads
     > Scanning filesystem time  0.172000 s
     > Data transfer+CPU   time  0.454000 s
     > Data output         time  0.016000 s
     > Total size                      8.577.791.284 (   7.99 GB)
     > Tested size                     8.577.791.284 (   7.99 GB)
     > Duplicated size                   304.398.860 ( 290.30 MB)
     > Duplicated files                        4.120
     > Worked on 8.577.791.284 bytes avg speed (hashtime) 18.893.813.400 B/s
     > GLOBAL SHA256: A9DA355009CD36FB639BC00A5177760E10E6A5C030402DD032407D6D2C5D2F50
     > 
     > 0.656 seconds (all OK)
     > 
     > 
     > Code:
     > 
     > C:\zpaqfranz>zpaqfranz sha1 c:\dropbox\dropbox -wyhash -summary 1 -all
     > zpaqfranz v51.29-experimental snapshot archiver, compiled May 28 2021
     > franz: summary 1
     > franz:use wyhash
     > Getting WYHASH ignoring .zfs and :$DATA
     > Computing filesize for 1 files/directory...125.383)
     > Found (7.99 GB) => 8.577.791.284 bytes (7.99 GB) / 21.164 files in 0.172000
     > 
     > Creating 32 hashing thread(s)
     > 001% 00:00:01 (  82.12 MB) of (   7.99 GB)           86.110.333 /sec
     > (...)
     > Algo WYHASH by 32 threads
     > Scanning filesystem time  0.172000 s
     > Data transfer+CPU   time  0.486000 s
     > Data output         time  0.000000 s
     > Total size                      8.577.791.284 (   7.99 GB)
     > Tested size                     8.577.791.284 (   7.99 GB)
     > Duplicated size                   304.398.860 ( 290.30 MB)
     > Duplicated files                        4.120
     > Worked on 8.577.791.284 bytes avg speed (hashtime) 17.649.776.304 B/s
     > GLOBAL SHA256: 3D251A28EA12AF07D846354AD30F89D4E527359253915A4BD8CCEC0E83737104
     > 
     > 0.704 seconds (all OK)
     
     > Last edited by fcorbelli; 28th May 2021 at 16:44.
     
     
     Reply With Quote
     
     --------------------------------------------------------------------------------

 2.  
 3.  30th May 2021, 13:04 #2
     birdie
      * View Profile
      * View Forum Posts
      * Private Message
      * Visit Homepage
     
     Member
     
     --------------------------------------------------------------------------------
     
     Join Date Jan 2020 Location Artem S. Tashkinov Posts 84 Thanks 18 Thanked
     17 Times in 15 Posts
     
     
     > I'm personally a fan of BLAKE3.
     
     
     Reply With Quote
     
     --------------------------------------------------------------------------------

 4.  
 5.  30th May 2021, 13:26 #3
     Gotty
      * View Profile
      * View Forum Posts
      * Private Message
     
     Member
     
     --------------------------------------------------------------------------------
     
     Join Date Oct 2017 Location Switzerland Posts 1,757 Thanks 826 Thanked 959
     Times in 544 Posts
     
     
     > wyhash is listed among the "fastest hash functions on x86_64 without
     > quality problems" on https://github.com/rurban/smhasher
     > blake3 being a cryptographic hash is slower.
     > 
     > I believe the use case of fcorbelli is to have a fast, good quality hash.
     > In this use case blake3 is an overkill - it gives no additional benefit
     > but is slower.
     
     
     Reply With Quote
     
     --------------------------------------------------------------------------------

 6.  
 7.  6th June 2021, 03:45 #4
     tansy
      * View Profile
      * View Forum Posts
      * Private Message
     
     Member
     
     --------------------------------------------------------------------------------
     
     Join Date Apr 2019 Location Europa Posts 480 Thanks 58 Thanked 45 Times in
     38 Posts
     
     
     > It is fast, that's true, I used it to hash strings - it was so much
     > faster than comparing them. But it was long time ago.
     > There is one thing though. As far as I'm concerned it's not rolling hash,
     > you have to check it, it might have changed. Rolling hash, I think that's
     > how they call it - is like CRC where you hash some data and then take
     > that hash as seed for next part of data. For example you have very long
     > data set to hash, like "asdfghjkl". Normal hash has to hash it all at
     > once (0 is initial seed): Y = H("asdf", 0); rolling hash can separate
     > them so Y =H("g", H("f", H("d", H("s", H("a", 0))))). It can work with
     > any subset of data: Y =H("gh", H("d", H("as", 0))).
     > If it's not rolling you have to hash whole data set at once.
     > 
     > All comes down to your application. What are you intending to use it for.
     
     
     Reply With Quote
     
     --------------------------------------------------------------------------------

 8.  
 9.  6th June 2021, 13:48 #5
     Gotty
      * View Profile
      * View Forum Posts
      * Private Message
     
     Member
     
     --------------------------------------------------------------------------------
     
     Join Date Oct 2017 Location Switzerland Posts 1,757 Thanks 826 Thanked 959
     Times in 544 Posts
     
     
     > What you described is a "resumable hash".
     > 
     > A "rolling hash" uses a fixed-size window. That is: when you add a new
     > value (char), you also remove the oldest value (char). This is mostly
     > used to find a fixed-size substring in a big string.
     > What you described (the possibility to "pause" and resume hashing) is
     > another feature: it is satisfied when the hash value is exactly the
     > internal state (that is you don't have internal state other than the hash
     > value itself - your hash function is a pure function), so the internal
     > state (the hash value) is always preserved and passed from one run to
     > another run. You can pick up where you left.
     > 
     > These are two different hash features - for different use cases.
     > 
     > As I understood, fcorbelli needs a hash function that is fast and is good
     > quality (with as few collisions as possible). Hashing in a window
     > (rolling hash) or pausing+resuming (resumable hash) are most probably not
     > among the desired features. Or are they?
     
     
     Reply With Quote
     
     --------------------------------------------------------------------------------


 10. THANKS:
     
     > tansy (24th April 2022)
     
     --------------------------------------------------------------------------------

 11. 6th June 2021, 20:18 #6
     Jyrki Alakuijala
      * View Profile
      * View Forum Posts
      * Private Message
     
     Member
     
     --------------------------------------------------------------------------------
     
     Join Date Jun 2015 Location Switzerland Posts 1,057 Thanks 277 Thanked 394
     Times in 244 Posts
     
     
     > Yes. It is good.
     > 
     > It is in the same category with xxHash, murmur, CityHash, and FarmHash
     > hashes, no attempts towards guarantees, just good mixture. The basic
     > primitive of 64x64->128 multiplication is very well fitted for mixing
     > bits, and was the best performing mixing for CityHash -- I left it out
     > because of cross-platform/compiler support, and we went with the easier
     > but worse 64x64->64.
     > 
     > (((Sorry for lack of impressive professionalism here: I got my original
     > inspiration to try out 64x64->128 and xoring the upper and lower halfs
     > from something that Sheldon said in Big Bang Theory )))
     
     
     Reply With Quote
     
     --------------------------------------------------------------------------------

 12. 
 13. 7th June 2021, 16:02 #7
     fcorbelli
      * View Profile
      * View Forum Posts
      * Private Message
     
     Member
     
     --------------------------------------------------------------------------------
     
     Join Date Dec 2013 Location Italy Posts 1,358 Thanks 64 Thanked 182 Times
     in 155 Posts
     
     
     > I am trying to replace SHA-1 within ZPAQ as a method of deduplicating the
     > data, with something faster.
     > 
     > After profiling an execution it is clear that the hash function (in this
     > case SHA1) plays an important role in the overall execution time.
     > 
     > So I started looking for various non-cryptographic functions, but still
     > suitable for this use.
     > 
     > myhash, from a quick check above that I done, has roughly the performance
     > of XXH3 (at least in the implementations I get), differences of a few
     > points% are irrelevant.
     > And yes, XXH3 is 160 bit long (with a 32bit CRC-32 for commutative check;
     > 128+32), there are reasons to keep a total length of 160 bits or 20 bytes
     > in the file format that I do not deepen
     > 
     > However, after a crude attempt to replace SHA1 with XXH3 inside ZPAQ's
     > source, to some surprise I realized that the execution time is reduced by
     > (about) 15% and that's it.
     > 
     > This improvement does not justify, in my opinion, breaking backwards
     > compatibility, also because non-trivial completion works are necessary.
     > 
     > Why is SHA1 not so slower (in the TOTAL execution time, therefore with
     > IO, "thinking time", compression etc) against a much faster XXH3?
     > 
     > Mainly because in Matt Mahoney's implementation a sort of "cache" is
     > used: the hash function is called one byte at a time, but the "real" hash
     > is actually computed in blocks of 64 bytes
     > 
     > 
     > Code:
     > 
     >   void put(int c) {  // hash 1 byte
     >     U32& r=w[U32(len)>>5&15];
     >     r=(r<<8)|(c&255);
     >     len+=8;
     >     if ((U32(len)&511)==0) process();
     >   }
     > 
     > Incidentally it is not a great surprise that Mahoney's code, while often
     > practically incomprehensible, is in fact very, very good.
     > I certainly don't have to explain who he is.
     > 
     > So, for now, I suspend the work of replacement, because it requires a
     > very thorough analysis of the operating logic under the deduplicator that
     > is not at all trivial (I will try for the compression stage, another real
     > REAL complex birth of Mahoney. Did I say complex?).
     > 
     > Mr. Matt no longer answers the questions, so it becomes a difficult job,
     > too difficult to save 15 or 20% in the total time, at least for me.
     > 
     > After all the third lesson that is taught in my university is: if it
     > works well, do not touch.
     
     
     Reply With Quote
     
     --------------------------------------------------------------------------------

 14. 
 15. 7th June 2021, 16:14 #8
     tansy
      * View Profile
      * View Forum Posts
      * Private Message
     
     Member
     
     --------------------------------------------------------------------------------
     
     Join Date Apr 2019 Location Europa Posts 480 Thanks 58 Thanked 45 Times in
     38 Posts
     
     > Originally Posted by Gotty
     > What you described is a "resumable hash".
     > You're probably right, I don't know exact terminology, although I
     > couldn't find anything about "resumable hash" neither in DDG nor in G^.
     > It's more like common sense term.
     > 
     > 
     > Originally Posted by Gotty
     > What you described (the possibility to "pause" and resume hashing) is
     > another feature: it is satisfied when the hash value is exactly the
     > internal state (that is you don't have internal state other than the hash
     > value itself...
     > It's not exactly true. It's almost true for Merkle–Damgård construction
     > (shaX, md5), but, for example, quoted here xxhash is not. Its internal
     > state is 5x32-bit when hash is 32-bit, and it's still resumable (as long
     > as you remember that state).
     
     
     Reply With Quote
     
     --------------------------------------------------------------------------------

 16. 
 17. 7th June 2021, 17:17 #9
     tansy
      * View Profile
      * View Forum Posts
      * Private Message
     
     Member
     
     --------------------------------------------------------------------------------
     
     Join Date Apr 2019 Location Europa Posts 480 Thanks 58 Thanked 45 Times in
     38 Posts
     
     > Originally Posted by fcorbelli
     > After profiling an execution it is clear that the hash function (in this
     > case SHA1) plays an important role in the overall execution time.
     > (...)
     > However, after a crude attempt to replace SHA1 with XXH3 inside ZPAQ's
     > source, to some surprise I realized that the execution time is reduced by
     > (about) 15% and that's it.
     > How "important role" exactly? If it's 30% (that's my guess) then by
     > decreasing its time by half you're gonna get 15%. Even if you reduced it
     > to zero it wouldn't go further than 30%.
     > 
     > 
     > Originally Posted by fcorbelli
     > And yes, XXH3 is 160 bit long (with a 32bit CRC-32 for commutative check;
     > 128+32),
     > 
     > Well, I know CRC-32 is fast these days but it still isn't that fast. If,
     > instead of XXH3+CRC32 you use XXHash128+XXHash32 that could be better.
     > 
     > And it also depends on your data, namely how long are these "strings".
     > XXH3 (and XXhash too) is not designed to mince short sets of bytes but
     > rather looong.
     > WyHash is definitely better with short strings.
     > 
     > So for streaming XXHash is great, for short strings WyHash is better
     > (Wyhash v4 is not optimal for stream processing, I think it is not easy
     > to implement a stream version of wyhash, particularly when wyhash is
     > still in evolution).
     
     > Last edited by tansy; 8th June 2021 at 03:49.
     
     
     Reply With Quote
     
     --------------------------------------------------------------------------------

 18. 
 19. 7th June 2021, 17:38 #10
     fcorbelli
      * View Profile
      * View Forum Posts
      * Private Message
     
     Member
     
     --------------------------------------------------------------------------------
     
     Join Date Dec 2013 Location Italy Posts 1,358 Thanks 64 Thanked 182 Times
     in 155 Posts
     
     > Originally Posted by tansy
     > How "important role" exactly?
     > Heavily depends on data.
     > 
     > 
     > Well, I know CRC-32 is fast thees days but it still isn't that fast. If,
     > instead of XXH3+CRC32 you use XXHash128+XXHash32 that could be better.
     > CRC-32 is commutative, therefore perfect for my needs.
     > 
     > 
     > 
     > And it also depends on your data, namely how long are these "strings".
     > Who knows.
     > 
     > 
     > 
     > XXH3 (and XXhash too) is not designed to mince short sets of bytes but
     > rather looong.
     > WyHash is definitely better with short strings.
     > 
     > So for streaming XXHash is great, for short strings WyHash is better
     > (Wyhash v4 is not optimal for stream processing, I think it is not easy
     > to implement a stream version of wyhash, particularly when wyhash is
     > still in evolution).
     > The first thing is to make wyhash stateful
     > Into the "hash class" I need an hash state
     > Here how is today.
     > First thing is put(), that write 1 byte at time and when the 512 bit
     > buffer is filled, do a process()
     > 
     > 
     > Code:
     > 
     >  
     > class SHA1 {
     > public:
     >   void put(int c) {  // hash 1 byte
     >     U32& r=w[U32(len)>>5&15];
     >     r=(r<<8)|(c&255);
     >     len+=8;
     >     if ((U32(len)&511)==0) process();
     >   }
     >   void write(const char* buf, int64_t n); // hash buf[0..n-1]
     >   double size() const {return len/8;}     // size in bytes
     >   uint64_t usize() const {return len/8;}  // size in bytes
     >   const char* result();  // get hash and reset
     >   SHA1() {init();}
     > private:
     >   void init();      // reset, but don't clear hbuf
     >   U64 len;          // length in bits
     >   U32 h[5];         // hash state
     >   U32 w[16];        // input buffer
     >   char hbuf[20];    // result
     >   void process();   // hash 1 block
     > };
     > 
     > Please note the h[] vector, and the 16*4*8 512 bit w[] (input buffer)
     > 
     > 
     > Code:
     > 
     > // Start a new hash
     > void SHA1::init() {
     >   len=0;
     >   h[0]=0x67452301;
     >   h[1]=0xEFCDAB89;
     >   h[2]=0x98BADCFE;
     >   h[3]=0x10325476;
     >   h[4]=0xC3D2E1F0;
     >   memset(w, 0, sizeof(w));
     >   
     > }
     > 
     > See the state h[]
     > 
     > 
     > Code:
     > 
     > // Return old result and start a new hash
     > const char* SHA1::result() {
     > 
     >   // pad and append length
     >   const U64 s=len;
     >   put(0x80);
     >   while ((len&511)!=448)
     >     put(0);
     >   put(s>>56);
     >   put(s>>48);
     >   put(s>>40);
     >   put(s>>32);
     >   put(s>>24);
     >   put(s>>16);
     >   put(s>>8);
     >   put(s);
     > 
     >   // copy h to hbuf
     >   for (unsigned int i=0; i<5; ++i) {
     >     hbuf[4*i]=h[i]>>24;
     >     hbuf[4*i+1]=h[i]>>16;
     >     hbuf[4*i+2]=h[i]>>8;
     >     hbuf[4*i+3]=h[i];
     >   }
     > 
     >   // return hash prior to clearing state
     >   init();
     >   return hbuf;
     > }
     > 
     > Code:
     > 
     > // Hash buf[0..n-1]
     > void SHA1::write(const char* buf, int64_t n) {
     >   const unsigned char* p=(const unsigned char*) buf;
     >   for (; n>0 && (U32(len)&511)!=0; --n) put(*p++);
     >   for (; n>=64; n-=64) {
     >     for (unsigned int i=0; i<16; ++i)
     >       w[i]=p[0]<<24|p[1]<<16|p[2]<<8|p[3], p+=4;
     >     len+=512;
     >     process();
     >   }
     >   for (; n>0; --n) put(*p++);
     > }
     > 
     > And the "cached" processor
     > 
     > Code:
     > 
     > // Hash 1 block of 64 bytes
     > void SHA1::process() {
     >     
     > 
     >   U32 a=h[0], b=h[1], c=h[2], d=h[3], e=h[4];
     >   static const U32 k[4]={0x5A827999, 0x6ED9EBA1, 0x8F1BBCDC, 0xCA62C1D6};
     >   #define f(a,b,c,d,e,i) \
     >     if (i>=16) \
     >       w[(i)&15]^=w[(i-3)&15]^w[(i-8)&15]^w[(i-14)&15], \
     >       w[(i)&15]=w[(i)&15]<<1|w[(i)&15]>>31; \
     >     e+=(a<<5|a>>27)+k[(i)/20]+w[(i)&15] \
     >       +((i)%40>=20 ? b^c^d : i>=40 ? (b&c)|(d&(b|c)) : d^(b&(c^d))); \
     >     b=b<<30|b>>2;
     >   #define r(i) f(a,b,c,d,e,i) f(e,a,b,c,d,i+1) f(d,e,a,b,c,i+2) \
     >                f(c,d,e,a,b,i+3) f(b,c,d,e,a,i+4)
     >   r(0)  r(5)  r(10) r(15) r(20) r(25) r(30) r(35)
     >   r(40) r(45) r(50) r(55) r(60) r(65) r(70) r(75)
     >   #undef f
     >   #undef r
     >   h[0]+=a; h[1]+=b; h[2]+=c; h[3]+=d; h[4]+=e;
     > }
     > 
     > 
     > 
     > Hashing in a window (rolling hash) or pausing+resuming (resumable hash)
     > Well, I need both ("spiegone" not explained).
     > The first is for deduplication boundaries search, the second for
     > "checksumming" purpose.
     > 
     > Hey, that's Mahoney after all :)
     
     
     Reply With Quote
     
     --------------------------------------------------------------------------------

 20. 
 21. 25th July 2021, 19:26 #11
     fcorbelli
      * View Profile
      * View Forum Posts
      * Private Message
     
     Member
     
     --------------------------------------------------------------------------------
     
     Join Date Dec 2013 Location Italy Posts 1,358 Thanks 64 Thanked 182 Times
     in 155 Posts
     
     
     > OK, after some work here a (just for fun) implementation of wyhash for
     > file hashing.
     > 
     > For lazy windows users here the binaries
     > https://sourceforge.net/projects/zpaqfranz/files/
     > 
     > How to use?
     > Some examples
     > 
     > Just speed check
     > 
     > Code:
     > 
     > zpaqfranz sha1 z:\knb -wyhash -summary
     > 
     > Multithread (for SSDs)
     > 
     > Code:
     > 
     > zpaqfranz sha1 z:\knb -wyhash -all -summary
     > 
     > Multithread (for SSDs), limited to 5 thread
     > 
     > Code:
     > 
     > zpaqfranz sha1 z:\knb -wyhash -all -t5 -summary
     > 
     > Instead of -wyhash, just for fun
     > 
     > -sha1
     > -sha256
     > -xxhash
     > -xxh3
     > -crc32
     > -crc32c
     > 
     > WARNING: other hashes use the "normal" chunked read of files (with
     > different buffer size).
     > wyhash, instead, use memory mapped file. So heavy OS overhead takes
     > place.
     > 
     > Other examples
     > 
     > Code:
     > 
     > zpaqfranz sha1 k:\*.mp4 -xxh3 -all
     > 
     > or
     > 
     > Code:
     > 
     > zpaqfranz ? sha1
     
     Attached Files
      * zpaqfranz.cpp (995.7 KB, 35 views)
     
     
     Reply With Quote
     
     --------------------------------------------------------------------------------

 22. 
 23. 25th July 2021, 19:30 #12
     fcorbelli
      * View Profile
      * View Forum Posts
      * Private Message
     
     Member
     
     --------------------------------------------------------------------------------
     
     Join Date Dec 2013 Location Italy Posts 1,358 Thanks 64 Thanked 182 Times
     in 155 Posts
     
     
     > To check if wyhash is "good" (for this use) run against folder with
     > duplicated files.
     > 
     > Code:
     > 
     > Scanning filesystem time            0.187 s
     > Data transfer+CPU   time            1.110 s
     > Data output         time            0.016 s
     > Total size                      8.807.229.660 (   8.20 GB)
     > Tested size                     8.807.229.660 (   8.20 GB)
     > Duplicated size                   264.663.198 ( 252.40 MB)
     > Duplicated files                        4.094
     > 
     > For every algorithm the duplicated size and count must be equal
     
     
     Reply With Quote
     
     --------------------------------------------------------------------------------

 24. 
 25. 25th July 2021, 23:10 #13
     Bulat Ziganshin
      * View Profile
      * View Forum Posts
      * Private Message
      * Visit Homepage
     
     Programmer
     
     --------------------------------------------------------------------------------
     
     Join Date Mar 2007 Location Uzbekistan Posts 4,736 Thanks 864 Thanked 784
     Times in 423 Posts
     
     
     > hash quality should be checked with SmHasher, each 32 bit of hash output
     > as distinct hash
     
     
     Reply With Quote
     
     --------------------------------------------------------------------------------

 26. 
 27. 26th July 2021, 14:34 #14
     fcorbelli
      * View Profile
      * View Forum Posts
      * Private Message
     
     Member
     
     --------------------------------------------------------------------------------
     
     Join Date Dec 2013 Location Italy Posts 1,358 Thanks 64 Thanked 182 Times
     in 155 Posts
     
     > Originally Posted by Bulat Ziganshin
     > hash quality should be checked with SmHasher, each 32 bit of hash output
     > as distinct hash
     > There is no real way to confirm the non-existence of collisions, even
     > extensive empirical tests give no guarantees (see meow).
     > 
     > I use my dataset which obviously doesn't guarantee anything.
     > Certainly, however, it detects in a negative sense those NOT to use.
     > 
     > Ignored therefore the question of cryptographic solidity (which by the
     > way is not affirmated) wyhash surprised me positively for the speed, the
     > absence of particularly complex source code such as XXH.
     > 
     > However, it lacks a block "chunk" use, it has been removed from one of
     > the previous versions, so it can only be used for hashing from memory
     > maps, which is much, much slower, or for small strings.
     > 
     > Obviously, the greater the abstraction, the greater the overhead
     > 
     > However, I was curious to see it in comparison with others of both
     > similar and "higher" category
     
     
     Reply With Quote
     
     --------------------------------------------------------------------------------

 28. 
 29. 22nd April 2022, 17:21 #15
     SpyFX
      * View Profile
      * View Forum Posts
      * Private Message
     
     Member
     
     --------------------------------------------------------------------------------
     
     Join Date Jun 2015 Location Moscow Posts 60 Thanks 4 Thanked 8 Times in 7
     Posts
     
     
     > hi, i missed this thread for some unknown reason
     > 
     > I tested the implementations of Sha1 and Sha256 in zpaq, they are very
     > slow in principle, if you use a different implementation, I think you can
     > get the same + 15%
     
     
     Reply With Quote
     
     --------------------------------------------------------------------------------

 30. 
 31. 22nd April 2022, 18:33 #16
     fcorbelli
      * View Profile
      * View Forum Posts
      * Private Message
     
     Member
     
     --------------------------------------------------------------------------------
     
     Join Date Dec 2013 Location Italy Posts 1,358 Thanks 64 Thanked 182 Times
     in 155 Posts
     
     > Originally Posted by SpyFX
     > hi, i missed this thread for some unknown reason
     > 
     > I tested the implementations of Sha1 and Sha256 in zpaq, they are very
     > slow in principle, if you use a different implementation, I think you can
     > get the same + 15%
     > mmhhh...
     > Even with an hardware SHA1 the gain is very little (already checked)
     > https://github.com/fcorbelli/ugo
     > 
     > PS SHA-1 in zpaq is one of the fastest ever seen, when used into zpaq
     
     
     Reply With Quote
     
     --------------------------------------------------------------------------------

 32. 
 33. 10th December 2022, 03:33 #17
     tansy
      * View Profile
      * View Forum Posts
      * Private Message
     
     Member
     
     --------------------------------------------------------------------------------
     
     Join Date Apr 2019 Location Europa Posts 480 Thanks 58 Thanked 45 Times in
     38 Posts
     
     > Originally Posted by fcorbelli
     > There is no real way to confirm the non-existence of collisions, even
     > extensive empirical tests give no guarantees (see meow).
     > Smhashers do test hashes well and other that collision tests can show
     > whether it's good or not. There is also a way to test collisions - it's
     > collisionsTest form xxhash test suite.
     > you need lot of RAM though. You can use `--filter` that will help
     > reducing memory consumption though is slower.
     
     
     Reply With Quote
     
     --------------------------------------------------------------------------------

 34. 
 35. 11th December 2022, 15:30 #18
     fcorbelli
      * View Profile
      * View Forum Posts
      * Private Message
     
     Member
     
     --------------------------------------------------------------------------------
     
     Join Date Dec 2013 Location Italy Posts 1,358 Thanks 64 Thanked 182 Times
     in 155 Posts
     
     > Originally Posted by tansy
     > Smhashers do test hashes well and other that collision tests can show
     > whether it's good or not. There is also a way to test collisions - it's
     > collisionsTest form xxhash test suite.
     > you need lot of RAM though. You can use `--filter` that will help
     > reducing memory consumption though is slower.
     > Only empirical tests
     > I made my own with about 10TB of data
     > But wyhash seems no way trustable, just dropped
     > 
     > quick-and-dirty speed test
     > 
     > Code:
     > 
     >       BLAKE3:     3.29 GB/s (done    16.42 GB)
     >     XXHASH64:     5.47 GB/s (done    27.29 GB)
     >         XXH3:     6.24 GB/s (done    31.11 GB)
     >      CRC-32C:     6.41 GB/s (done    32.05 GB)
     >       CRC-32:     7.89 GB/s (done    39.35 GB)
     >     NILSIMSA:     7.92 GB/s (done    39.47 GB)
     >       WYHASH:     7.93 GB/s (done    39.51 GB)
     
     
     Reply With Quote
     
     --------------------------------------------------------------------------------

 36. 
 37. 11th December 2022, 21:02 #19
     tansy
      * View Profile
      * View Forum Posts
      * Private Message
     
     Member
     
     --------------------------------------------------------------------------------
     
     Join Date Apr 2019 Location Europa Posts 480 Thanks 58 Thanked 45 Times in
     38 Posts
     
     > Originally Posted by fcorbelli
     > Only empirical tests
     > It is an empirical test.
     > 
     > 
     > Originally Posted by fcorbelli
     > I made my own with about 10TB of data
     > But wyhash seems no way trustable, just dropped
     > Speed tests are just speed tests.
     > 
     > If you have lot of RAM you can really test the with collisionsTest. You
     > may need to fit right amount of hashes to get some collisions. Here is
     > how to calculate it.
     
     Attached Files
      * xxHash-dev-tests-collisions-wyhash.diff.gz (880 Bytes, 19 views)
     
     > Last edited by tansy; 7th September 2023 at 14:18.
     
     
     Reply With Quote
     
     --------------------------------------------------------------------------------

 38. 




« Previous Thread | Next Thread »

SIMILAR THREADS

 1. MEOW HASH 0.5 - AN IMPROVED AES-NI HASH
    
    By svpv in forum Data Compression
    Replies: 15
    Last Post: 19th July 2021, 16:32

 2. IS THIS REALLY AS GOOD AS IT SAYS
    
    By Earl Colby Pottinger in forum Data Compression
    Replies: 1
    Last Post: 28th September 2017, 20:11

 3. PERFECT HASH FUNCTION TO HASH STRINGS
    
    By joey in forum Data Compression
    Replies: 18
    Last Post: 22nd March 2016, 10:59

 4. I NEED HELP WITH CHOOSING A GOOD PRECOMPRESSOR.
    
    By miyamoto in forum Data Compression
    Replies: 5
    Last Post: 23rd October 2011, 00:31

 5. GOOD COMPRESSION FOR MICROCONTROLLERS
    
    By elektronika in forum Data Compression
    Replies: 12
    Last Post: 23rd March 2010, 19:36

POSTING PERMISSIONS

 * You may not post new threads
 * You may not post replies
 * You may not post attachments
 * You may not edit your posts
 *  

 * BB code is On
 * Smilies are On
 * [IMG] code is On
 * [VIDEO] code is On
 * HTML code is Off

Forum Rules

 * CompressMe.net
   

All times are GMT +3. The time now is 12:14.
Powered by vBulletin; Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.