dfir.science Open in urlscan Pro
2606:4700:3037::6815:59ab  Public Scan

URL: https://dfir.science/2017/07/How-To-Fuzzy-Hashing-with-SSDEEP-%28similarity-matching%29.html
Submission: On April 28 via api from FR — Scanned from FR

Form analysis 1 forms found in the DOM

<form class="search-content__form" onkeydown="return event.key != 'Enter';" role="search">
  <label class="sr-only" for="search"> Enter your search term... </label>
  <input type="search" id="search" class="search-input" tabindex="-1" placeholder="Enter your search term...">
</form>

Text Content

 * Skip to primary navigation
 * Skip to content
 * Skip to footer

   
 * Courses
 * Tutorials
 * Research
 * Contact
 * Shop

Toggle search Toggle menu



JOSHUA I. JAMES

Digital Forensic Scientist

Follow
 * Newsletter
 * Youtube
 * Twitter
 * GitHub
 * LinkedIn
 * Reddit
 * Email
 * Support
 * 
   


[HOW TO] FUZZY HASHING WITH SSDEEP (SIMILARITY MATCHING)

6 minute read

SSDEEP is a fuzzy hashing tool written by Jesse Kornblum. There is quite a bit
of work about similarity hashing and comparisons with other methods. The
mainstream tools for digital forensics, however, appear to be ssdeep and sdhash.
For example, NIST created hash sets using both tools. I wrote a post about
sdhash in 2012 if you want to know a little more about how it works.

Let’s get to it!


SSDEEPPERMALINK

SSDEEP creates a hash value that attempts to detect the level of similarity
between two files at the binary level. This is different from a cryptographic
hash (like SHA1) because a cryptographic hash can check exact matches (or
non-matches).

A cryptographic hash is useful if we want to ask “Is file 1 exactly like file
2?” A fuzzy hash / similarity hash is useful if we want to ask “Does any part of
file 1 exist in file 2?”


AN EXAMPLEPERMALINK

Imagine the a file with the following text:

> One morning, when Gregor Samsa woke from troubled dreams, he found himself
> transformed in his bed into a horrible vermin.

> He lay on his armour-like back, and if he lifted his head a little he could
> see his brown belly, slightly domed and divided by arches into stiff sections.
> The bedding was hardly able to cover it and seemed ready to slide off any
> moment.

> His many legs, pitifully thin compared with the size of the rest of him, waved
> about helplessly as he looked. “What’s happened to me? “ he thought. It wasn’t
> a dream. His room, a proper human room although a little too small, lay
> peacefully between its four familiar walls.

> A collection of textile samples lay spread out on the table - Samsa was a
> traveling salesman - and above it there hung a picture that he had recently
> cut out of an illustrated magazine and housed in a nice, gilded frame. It
> showed a lady fitted out with a fur hat and fur boa who sat upright, raising a
> heavy fur muff that covered the whole of her lower arm towards the viewer.
> Gregor then turned to look out the window at the dull weather.

If we use a cryptographic hash, we may get the following hash value (SHA1):

joshua@Zeus ~/ $ sha1sum test.txt
2222825996bb74f3824e75e2dd44b0095d3b300a  test.txt


With ssdeep we get the following fuzzy hash value:

joshua@Zeus ~/ $ ssdeep -s test.txt
ssdeep,1.1--blocksize:hash:hash,filename
24:Ol9rFBzwjx5ZKvBF+bi8RuM4Pp6rG5Yg+q8wIXhMC:qrFBzKx5s8sM4grq8wIXht,"~/test.txt"


So far we can see that ssdeep hashes are much larger that MD5 hashes. That means
storing a large number of fuzzy hashes will take a lot more space, so we need to
consider when fuzzy hashing is most useful for our investigations.

I’m going to output our fuzzy hash into a “database” that I can use to match
later. You can name the database anything you want. I’m going to use “fuzzy.db”
for now.

joshua@Zeus ~/ $ ssdeep -s test.txt > fuzzy.db


Now the file fuzzy.db contains the fuzzy hash created from test.txt. Now imagine
we remove the words pitifully thin compared with the size of the rest of him
from the original file. What happens to our hashes?

joshua@Zeus ~/ $ sha1sum test.txt
25ce1f22b6391d552591f1c4bec70047998ab344  test.txt
joshua@Zeus ~/ $ ssdeep -s test.txt
ssdeep,1.1--blocksize:hash:hash,filename
24:Ol9rFBzwjx5ZKvBBi8RuM4Pp6rG5Yg+q8wIXhMC:qrFBzKx5L8sM4grq8wIXht,"~/test.txt"


If we look at the SHA1 hash, it is completely different. This is exactly what it
should do. If a single bit changes, the resulting cryptographic hash should
change. But what about the fuzzy hash? In the main string, we see some
similarities, which a change at BBi8RuM4Pp6rG5Yg. OK, so both hashes are
different, so what?

When we compare the original SHA1 hash value to the new value, we wont see both
files as the “same”, even though text.txt is now just a modified version of the
original.

For ssdeep, let’s use the saved hash value from before, and compare it to the
current version of the the file.

joshua@Zeus ~/ $ ssdeep -s -m fuzzy.db test.txt
~/test.txt matches fuzzy.db:~/test.txt (97)


Here we see 97, or how similar the two files are. 97 means they are almost the
same file. If I remove all of the last paragraph in the text file, I get a score
of 72. If I remove the first AND last paragraphs, I get a score of 63.


FILE FORMATS MATTERPERMALINK

When working with fuzzy hashes, file formats matter a lot. Compressed file types
are not going to work as well as non-compressed. Let’s take a look at MS Word
document types; docx and doc. Two files, both contain “This is a test.”

joshua@Zeus ~/ $ ssdeep -s test*
ssdeep,1.1--blocksize:hash:hash,filename
48:9RVyHU/bLrzKkAvcvnU6zjzzNszIpbyzrd:9TyU/bvzK0nUWjzzNszIpm,"~/test.doc"
96:XVgub8YVvnQXcK+Tqq66aKx7vlqH5Zm03s8BL83ZsVlRJ+:Xuub83HKR6OxIjm03s8m32l/+,"~/test.docx"


We can already tell the two files are probably not similar, which is correct
because the underlying file format data structure is completely different.
Similarities are some of the application meta-data and the text. Just for fun,
let’s see if the files are similar to each other.

joshua@Zeus ~/ $ ssdeep -s test* > fuzzy.db
joshua@Zeus ~/ $ ssdeep -s -a -m fuzzy.db test.*
~/test.doc matches fuzzy.db:~/test.doc (100)
~/test.doc matches fuzzy.db:~/test.docx (0)
~/test.docx matches fuzzy.db:~/test.doc (0)
~/test.docx matches fuzzy.db:~/test.docx (100)


That would be a nope. The files are similar to themselves, but not to the other
format.

Next, let’s change the contents of each file and see if it is similar to itself.
We add “Hello.” before “This is a test.”

joshua@Zeus ~/ $ ssdeep -s -a -m fuzzy.db test.*
~/test.doc matches fuzzy.db:~/test.doc (83)
~/test.doc matches fuzzy.db:~/test.docx (0)
~/test.docx matches fuzzy.db:~/test.doc (0)
~/test.docx matches fuzzy.db:~/test.docx (52)


What’s going on here? Doc and Docx are still not similar to each other. But both
the new version of the doc and docx file are similar to the prior version.
Notice that the doc is “more similar” that the docx. The reason is because docx
is a type of compressed file format. Think of docx like a zip container. This
means that a small modification has a larger impact on the final binary output
when compressed.


BITS!PERMALINK

The original docx was 4,080 bytes, and the modified docx was 4,085 bytes. Only a
5 byte difference resulted in a difference of 48.

The original doc was 9,216 bytes, and the modified doc was 9,216 bytes. I
actually wasn’t expecting that, and will look into why it’s the same size. The
structure did change, however. That’s why the similarity score is 83.


MORE DATAPERMALINK

Let’s go back to our original text, which is much longer, and remove the same
text as last time. With more text, the application meta-data (timestamps) that
change should have less of an effect on our matching.

joshua@Zeus ~/ $ ssdeep -s test* > fuzzy.db
joshua@Zeus ~/ $ ssdeep -s -a -m fuzzy.db test.*
~/test.doc matches fuzzy.db:~/test.doc (83)
~/test.doc matches fuzzy.db:~/test.docx (0)
~/test.docx matches fuzzy.db:~/test.doc (0)
~/test.docx matches fuzzy.db:~/test.docx (0)


Here we can see that for the compressed file type, more data is worse for
similarity matching. This is likely in the way that the compression algorithm
works. Our change is about mid-way in the document, but the last paragraph is
the longest (most data). After our modification, the compression algorithm will
compress the data with a different pattern than before.

For the doc file, we see that more data is better. We were able to remove more
data from the original, but still managed a similarity score of 83.


TESTING WITH IMAGESPERMALINK

I made a video about ssdeep and testing different image formats. Have a look
below:




CONCLUSIONSPERMALINK

Hopefully this gave you a better idea of fuzzy hashing, and what it can be used
for. For certain situation it is extremely useful, but you definitely need to
know what data-types you are working with. Uncompressed data will likely give
better results.

Keep the conversation going

Tags: dfir, fuzzy hashing, infosec, ssdeep

Updated: 2017-07-18

SHARE ON

Twitter Facebook LinkedIn
Previous Next


YOU MAY ALSO ENJOY


OCULUS QUEST 2 FIRST IMPRESSIONS AND RESEARCH NOTES

2022-04-09 9 minutes to read



Recently the DFIR Community Hardware Fund purchased a Meta Oculus Quest 2 VR
headset. Unboxing and device images can be found here. I finally had time to
set...


GETTING STARTED IN DFIR: CONFERENCES AND WORKSHOPS

2022-04-09 9 minutes to read



Ever wonder how to be accepted to a conference? Today we talk about different
types of tech conferences, and how to get started both attending and giving
pre...


A NEW LOOK FOR DFIR SCIENCE

2022-04-01 9 minutes to read



Back in 2008, DFIR Science started as a research blog. It was mostly technical
documentation to set up things like OCFA and Debian Live. It was always about
...


INTRODUCTION TO MEMORY FORENSICS WITH VOLATILITY 3

2022-02-23 9 minutes to read



Volatility is a very powerful memory forensics tool. It is used to extract
information from memory images (memory dumps) of Windows, macOS, and Linux
systems...

Enter your search term...

 * Follow:
 * Youtube
 * Twitter
 * LinkedIn
 * Reddit
 * Mastodon
 * Facebook
 * Instagram
 * Feed

© 2022 Joshua I. James. Powered by Jekyll & Minimal Mistakes.