dfir.science
Open in
urlscan Pro
2606:4700:3037::6815:59ab
Public Scan
URL:
https://dfir.science/2017/07/How-To-Fuzzy-Hashing-with-SSDEEP-%28similarity-matching%29.html
Submission: On April 28 via api from FR — Scanned from FR
Submission: On April 28 via api from FR — Scanned from FR
Form analysis
1 forms found in the DOM<form class="search-content__form" onkeydown="return event.key != 'Enter';" role="search">
<label class="sr-only" for="search"> Enter your search term... </label>
<input type="search" id="search" class="search-input" tabindex="-1" placeholder="Enter your search term...">
</form>
Text Content
* Skip to primary navigation * Skip to content * Skip to footer * Courses * Tutorials * Research * Contact * Shop Toggle search Toggle menu JOSHUA I. JAMES Digital Forensic Scientist Follow * Newsletter * Youtube * Twitter * GitHub * LinkedIn * Reddit * Email * Support * [HOW TO] FUZZY HASHING WITH SSDEEP (SIMILARITY MATCHING) 6 minute read SSDEEP is a fuzzy hashing tool written by Jesse Kornblum. There is quite a bit of work about similarity hashing and comparisons with other methods. The mainstream tools for digital forensics, however, appear to be ssdeep and sdhash. For example, NIST created hash sets using both tools. I wrote a post about sdhash in 2012 if you want to know a little more about how it works. Let’s get to it! SSDEEPPERMALINK SSDEEP creates a hash value that attempts to detect the level of similarity between two files at the binary level. This is different from a cryptographic hash (like SHA1) because a cryptographic hash can check exact matches (or non-matches). A cryptographic hash is useful if we want to ask “Is file 1 exactly like file 2?” A fuzzy hash / similarity hash is useful if we want to ask “Does any part of file 1 exist in file 2?” AN EXAMPLEPERMALINK Imagine the a file with the following text: > One morning, when Gregor Samsa woke from troubled dreams, he found himself > transformed in his bed into a horrible vermin. > He lay on his armour-like back, and if he lifted his head a little he could > see his brown belly, slightly domed and divided by arches into stiff sections. > The bedding was hardly able to cover it and seemed ready to slide off any > moment. > His many legs, pitifully thin compared with the size of the rest of him, waved > about helplessly as he looked. “What’s happened to me? “ he thought. It wasn’t > a dream. His room, a proper human room although a little too small, lay > peacefully between its four familiar walls. > A collection of textile samples lay spread out on the table - Samsa was a > traveling salesman - and above it there hung a picture that he had recently > cut out of an illustrated magazine and housed in a nice, gilded frame. It > showed a lady fitted out with a fur hat and fur boa who sat upright, raising a > heavy fur muff that covered the whole of her lower arm towards the viewer. > Gregor then turned to look out the window at the dull weather. If we use a cryptographic hash, we may get the following hash value (SHA1): joshua@Zeus ~/ $ sha1sum test.txt 2222825996bb74f3824e75e2dd44b0095d3b300a test.txt With ssdeep we get the following fuzzy hash value: joshua@Zeus ~/ $ ssdeep -s test.txt ssdeep,1.1--blocksize:hash:hash,filename 24:Ol9rFBzwjx5ZKvBF+bi8RuM4Pp6rG5Yg+q8wIXhMC:qrFBzKx5s8sM4grq8wIXht,"~/test.txt" So far we can see that ssdeep hashes are much larger that MD5 hashes. That means storing a large number of fuzzy hashes will take a lot more space, so we need to consider when fuzzy hashing is most useful for our investigations. I’m going to output our fuzzy hash into a “database” that I can use to match later. You can name the database anything you want. I’m going to use “fuzzy.db” for now. joshua@Zeus ~/ $ ssdeep -s test.txt > fuzzy.db Now the file fuzzy.db contains the fuzzy hash created from test.txt. Now imagine we remove the words pitifully thin compared with the size of the rest of him from the original file. What happens to our hashes? joshua@Zeus ~/ $ sha1sum test.txt 25ce1f22b6391d552591f1c4bec70047998ab344 test.txt joshua@Zeus ~/ $ ssdeep -s test.txt ssdeep,1.1--blocksize:hash:hash,filename 24:Ol9rFBzwjx5ZKvBBi8RuM4Pp6rG5Yg+q8wIXhMC:qrFBzKx5L8sM4grq8wIXht,"~/test.txt" If we look at the SHA1 hash, it is completely different. This is exactly what it should do. If a single bit changes, the resulting cryptographic hash should change. But what about the fuzzy hash? In the main string, we see some similarities, which a change at BBi8RuM4Pp6rG5Yg. OK, so both hashes are different, so what? When we compare the original SHA1 hash value to the new value, we wont see both files as the “same”, even though text.txt is now just a modified version of the original. For ssdeep, let’s use the saved hash value from before, and compare it to the current version of the the file. joshua@Zeus ~/ $ ssdeep -s -m fuzzy.db test.txt ~/test.txt matches fuzzy.db:~/test.txt (97) Here we see 97, or how similar the two files are. 97 means they are almost the same file. If I remove all of the last paragraph in the text file, I get a score of 72. If I remove the first AND last paragraphs, I get a score of 63. FILE FORMATS MATTERPERMALINK When working with fuzzy hashes, file formats matter a lot. Compressed file types are not going to work as well as non-compressed. Let’s take a look at MS Word document types; docx and doc. Two files, both contain “This is a test.” joshua@Zeus ~/ $ ssdeep -s test* ssdeep,1.1--blocksize:hash:hash,filename 48:9RVyHU/bLrzKkAvcvnU6zjzzNszIpbyzrd:9TyU/bvzK0nUWjzzNszIpm,"~/test.doc" 96:XVgub8YVvnQXcK+Tqq66aKx7vlqH5Zm03s8BL83ZsVlRJ+:Xuub83HKR6OxIjm03s8m32l/+,"~/test.docx" We can already tell the two files are probably not similar, which is correct because the underlying file format data structure is completely different. Similarities are some of the application meta-data and the text. Just for fun, let’s see if the files are similar to each other. joshua@Zeus ~/ $ ssdeep -s test* > fuzzy.db joshua@Zeus ~/ $ ssdeep -s -a -m fuzzy.db test.* ~/test.doc matches fuzzy.db:~/test.doc (100) ~/test.doc matches fuzzy.db:~/test.docx (0) ~/test.docx matches fuzzy.db:~/test.doc (0) ~/test.docx matches fuzzy.db:~/test.docx (100) That would be a nope. The files are similar to themselves, but not to the other format. Next, let’s change the contents of each file and see if it is similar to itself. We add “Hello.” before “This is a test.” joshua@Zeus ~/ $ ssdeep -s -a -m fuzzy.db test.* ~/test.doc matches fuzzy.db:~/test.doc (83) ~/test.doc matches fuzzy.db:~/test.docx (0) ~/test.docx matches fuzzy.db:~/test.doc (0) ~/test.docx matches fuzzy.db:~/test.docx (52) What’s going on here? Doc and Docx are still not similar to each other. But both the new version of the doc and docx file are similar to the prior version. Notice that the doc is “more similar” that the docx. The reason is because docx is a type of compressed file format. Think of docx like a zip container. This means that a small modification has a larger impact on the final binary output when compressed. BITS!PERMALINK The original docx was 4,080 bytes, and the modified docx was 4,085 bytes. Only a 5 byte difference resulted in a difference of 48. The original doc was 9,216 bytes, and the modified doc was 9,216 bytes. I actually wasn’t expecting that, and will look into why it’s the same size. The structure did change, however. That’s why the similarity score is 83. MORE DATAPERMALINK Let’s go back to our original text, which is much longer, and remove the same text as last time. With more text, the application meta-data (timestamps) that change should have less of an effect on our matching. joshua@Zeus ~/ $ ssdeep -s test* > fuzzy.db joshua@Zeus ~/ $ ssdeep -s -a -m fuzzy.db test.* ~/test.doc matches fuzzy.db:~/test.doc (83) ~/test.doc matches fuzzy.db:~/test.docx (0) ~/test.docx matches fuzzy.db:~/test.doc (0) ~/test.docx matches fuzzy.db:~/test.docx (0) Here we can see that for the compressed file type, more data is worse for similarity matching. This is likely in the way that the compression algorithm works. Our change is about mid-way in the document, but the last paragraph is the longest (most data). After our modification, the compression algorithm will compress the data with a different pattern than before. For the doc file, we see that more data is better. We were able to remove more data from the original, but still managed a similarity score of 83. TESTING WITH IMAGESPERMALINK I made a video about ssdeep and testing different image formats. Have a look below: CONCLUSIONSPERMALINK Hopefully this gave you a better idea of fuzzy hashing, and what it can be used for. For certain situation it is extremely useful, but you definitely need to know what data-types you are working with. Uncompressed data will likely give better results. Keep the conversation going Tags: dfir, fuzzy hashing, infosec, ssdeep Updated: 2017-07-18 SHARE ON Twitter Facebook LinkedIn Previous Next YOU MAY ALSO ENJOY OCULUS QUEST 2 FIRST IMPRESSIONS AND RESEARCH NOTES 2022-04-09 9 minutes to read Recently the DFIR Community Hardware Fund purchased a Meta Oculus Quest 2 VR headset. Unboxing and device images can be found here. I finally had time to set... GETTING STARTED IN DFIR: CONFERENCES AND WORKSHOPS 2022-04-09 9 minutes to read Ever wonder how to be accepted to a conference? Today we talk about different types of tech conferences, and how to get started both attending and giving pre... A NEW LOOK FOR DFIR SCIENCE 2022-04-01 9 minutes to read Back in 2008, DFIR Science started as a research blog. It was mostly technical documentation to set up things like OCFA and Debian Live. It was always about ... INTRODUCTION TO MEMORY FORENSICS WITH VOLATILITY 3 2022-02-23 9 minutes to read Volatility is a very powerful memory forensics tool. It is used to extract information from memory images (memory dumps) of Windows, macOS, and Linux systems... Enter your search term... * Follow: * Youtube * Twitter * LinkedIn * Reddit * Mastodon * Facebook * Instagram * Feed © 2022 Joshua I. James. Powered by Jekyll & Minimal Mistakes.