nitratine.net Open in urlscan Pro
185.199.108.153  Public Scan

URL: https://nitratine.net/blog/post/how-to-hash-files-in-python/
Submission: On December 19 via manual from US — Scanned from DE

Form analysis 0 forms found in the DOM

Text Content

 * Home (current)
 * Blog
 * YouTube
 * GitHub
 * About


HOW TO HASH FILES IN PYTHON

7 May 2019 Tutorials python hashing cyber-security 4 min read

Hashing files allows us to generate a string/byte sequence that can help
identify a file. This can then be used by comparing the hashes of two or more
files to see if these files are the same as well as other applications.

--------------------------------------------------------------------------------

 * What Does it Mean to Hash a File?
   * What Can I Do With A File Hash?
     * What Can I Not Do With A File Hash?
 * Supported Hashing Algorithms in Python
   * hashlib.algorithms_guaranteed vs hashlib.algorithms_available
 * Hashing a File
   * Why do I Need to Worry About the Buffer Size?

> Note: This post assumes you already know what a hash is, if you don't, read up
> on hashes before reading this post. Sites like this or this can help you out.


WHAT DOES IT MEAN TO HASH A FILE?

Hashing a file is when a file of arbitrary size is read and used in a function
to compute a fixed-length value from it. This fixed-length value can help us get
a sort of 'id' of a file which can then help us do particular tasks (examples
follow).

Strong hashes with large amounts of bytes can help distinguish between many
different files, but we do always need to remember the birthday problem. For
example, the birthday problem reminds us that if we have a 160bit hash, there
can only be 2^160 different hashes, meaning that as soon as we generate 2^160 +
1 hashes of different files, it is guaranteed that we find a duplicate hash.
This duplicate hash means we get the same hash for different files which can
cause issues in some applications.


WHAT CAN I DO WITH A FILE HASH?

File hashes are quite useful as they can represent (not substitute) a file,
meaning that you do not have to store the whole file when trying to identify a
file. When hashing files (or anything in general), you will get the same result
hash result every time you hash a particular file. This makes hashing files
useful for things like:

 * Comparing files: Instead of comparing whole files, take hashes and compare
   hashes together. This is particularly efficient when comparing more than one
   file to another file.
 * Easily identifying files without storing the whole file: If you are looking
   for a file, simply get the hash of your current file and hash other files as
   you look at them while looking for a match. This technique is used by some
   anti-virus software.
 * Stopping filename clashes: Rename files where many files are located in one
   directory to their hash. This will mean two different files will not have the
   same name and will save on space when two files have the same name as they
   will overwrite (does not work if you need two of the same file).
 * Object keys: Just like in the point "Stopping filename clashes", using hashes
   as object keys can help identify what objects match to which file.
 * Detecting changes in a file: If you had a file hash before it was modified,
   you can re-hash the file and compare it against the original hash to see if
   it has been modified.

WHAT CAN I NOT DO WITH A FILE HASH?

Even though you can compute a hash using a file, this does not mean you cannot
get the original file back using this hash. Hashing is a one-way function
(lossy) and is not an encryption scheme.


SUPPORTED HASHING ALGORITHMS IN PYTHON

The hashlib Python module "implements a common interface to many different
secure hash and message digest algorithms". To look at the hashing algorithms
Python offers, execute:

import hashlib
print(hashlib.algorithms_guaranteed)


This will print a set of strings that are hash algorithms guaranteed to be
supported by this module on all platforms. To use these, select a hashing
algorithm from the set and then use it as shown below (this example uses
sha256):

import hashlib
h = hashlib.sha256() # Construct a hash object using our selected hashing algorithm
h.update('My content'.encode('utf-8')) # Update the hash using a bytes object
print(h.hexdigest()) # Print the hash value as a hex string
print(h.digest()) # Print the hash value as a bytes object



HASHLIB.ALGORITHMS_GUARANTEED VS HASHLIB.ALGORITHMS_AVAILABLE

To get all the algorithms available in your interpreter, you can execute
hashlib.algorithms_available. Unlike the output from
hashlib.algorithms_guaranteed, these hashes aren't guaranteed to exist in
interpreters on other machines. You can visit the docs to get more information
on this.


HASHING A FILE

To hash a file, read it in bit-by-bit and update the current hashing functions
instance. When all bytes have been given to the hashing function in order, we
can then get the hex digest.

import hashlib

file = ".\myfile.txt" # Location of the file (can be set a different way)
BLOCK_SIZE = 65536 # The size of each read from the file

file_hash = hashlib.sha256() # Create the hash object, can use something other than `.sha256()` if you wish
with open(file, 'rb') as f: # Open the file to read it's bytes
    fb = f.read(BLOCK_SIZE) # Read from the file. Take in the amount declared above
    while len(fb) > 0: # While there is still data being read from the file
        file_hash.update(fb) # Update the hash
        fb = f.read(BLOCK_SIZE) # Read the next block from the file

print (file_hash.hexdigest()) # Get the hexadecimal digest of the hash




This snippet will print the hash value of the file specified in the file
generated using the SHA256 algorithm. The call .hexdigest() returns a string
object containing only hexadecimal digits; you can use .digest() as shown before
to get the bytes representation of the hash.


WHY DO I NEED TO WORRY ABOUT THE BUFFER SIZE?

You would have noticed in the script above, the variable BLOCK_SIZE is set to
65536. This is the number of bytes that is read into memory in a single read
operation. This is used so larger files are not completely loaded into memory
before computing the hash.

For example, if we did not do this and were hashing a video file that was 2Gb
large, the whole 2Gb file would be loaded into memory (or at least tried to) and
then hashed. This approach reads the file block-by-block so we don't load the
whole file into memory.



The buffer I have used is 64Kb but you can use any value you wish. Making this
larger reads files faster, but in turn, loads more of the file into memory at
once.

← Google Publisher Toolbar: P... JavaScript Date Methods Ret... →


Please enable JavaScript to view the comments powered by Disqus.
About

Owner of PyTutorials and creator of auto-py-to-exe. I enjoy making quick
tutorials for people new to particular topics in Python and tools that help fix
small things.

Search
Categories
 1. 📱 Apps 2
 2. 📰 General 8
 3. 🔍 Investigations 2
 4. 💾 Projects 15
 5. ✂ Snippets 1
 6. 🛠 Tools 5
 7. 📖 Tutorials 29
 8. 🎥 YouTube 15

PyTutorials on YouTube


Recent Videos

Featured Sites




Navigate

Home Blog Categories Tags Archive Portfolio About

Popular Projects

auto-py-to-exe hit-counter whos-on-my-network monopoly-money

Follow

GitHub YouTube Nitratine RSS Feed

🍺 Buy me a beer

--------------------------------------------------------------------------------

© 2022 Brent Vollebregt