death.andgravity.com
Open in
urlscan Pro
2606:50c0:8003::153
Public Scan
URL:
https://death.andgravity.com/pwned
Submission: On September 21 via manual from FR — Scanned from FR
Submission: On September 21 via manual from FR — Scanned from FR
Form analysis
2 forms found in the DOMName: mc-embedded-subscribe-form — POST https://gmail.us7.list-manage.com/subscribe/post?u=9909b0e978d8d8d941bd3c8dc&id=c61d63d661&SIGNUP=pwned
<form action="https://gmail.us7.list-manage.com/subscribe/post?u=9909b0e978d8d8d941bd3c8dc&id=c61d63d661&SIGNUP=pwned" method="post" id="embedded-subscribe-form" name="mc-embedded-subscribe-form" target="_blank" novalidate=""
class="panel subscribe-form">
<div class="panel-header text-large"> Want to know when new articles come out? </div>
<div class="panel-body">
<p>Drop your email in the box below and I'll send new stuff straight to your inbox!</p>
<div class="form-group col-6 col-xs-12 col-md-9">
<input type="text" name="FNAME" id="mce-FNAME" class="form-input input-lg" placeholder="Your first name">
</div>
<div class="form-group col-6 col-xs-12 col-md-9">
<input type="email" name="EMAIL" id="mce-EMAIL" class="form-input input-lg" placeholder="Your email address">
</div>
<!-- bot prevention-->
<div style="position: absolute; left: -5000px;" aria-hidden="true"><input type="text" name="b_9909b0e978d8d8d941bd3c8dc_c61d63d661" tabindex="-1" value=""></div>
</div>
<div class="panel-footer">
<div class="form-group">
<input type="submit" name="subscribe" id="mc-embedded-subscribe" class="btn btn-primary btn-lg" value="Subscribe">
</div>
</div>
</form>
Name: mc-embedded-subscribe-form — POST https://gmail.us7.list-manage.com/subscribe/post?u=9909b0e978d8d8d941bd3c8dc&id=c61d63d661&SIGNUP=pwned
<form action="https://gmail.us7.list-manage.com/subscribe/post?u=9909b0e978d8d8d941bd3c8dc&id=c61d63d661&SIGNUP=pwned" method="post" id="embedded-subscribe-form" name="mc-embedded-subscribe-form" target="_blank" novalidate=""
class="panel subscribe-form">
<div class="panel-header text-large"> Want to know when new articles come out? </div>
<div class="panel-body">
<p>Drop your email in the box below and I'll send new stuff straight to your inbox!</p>
<div class="form-group col-6 col-xs-12 col-md-9">
<input type="text" name="FNAME" id="mce-FNAME" class="form-input input-lg" placeholder="Your first name">
</div>
<div class="form-group col-6 col-xs-12 col-md-9">
<input type="email" name="EMAIL" id="mce-EMAIL" class="form-input input-lg" placeholder="Your email address">
</div>
<!-- bot prevention-->
<div style="position: absolute; left: -5000px;" aria-hidden="true"><input type="text" name="b_9909b0e978d8d8d941bd3c8dc_c61d63d661" tabindex="-1" value=""></div>
</div>
<div class="panel-footer">
<div class="form-group">
<input type="submit" name="subscribe" id="mc-embedded-subscribe" class="btn btn-primary btn-lg" value="Subscribe">
</div>
</div>
</form>
Text Content
* death and gravity HAS YOUR PASSWORD BEEN PWNED? OR, HOW I ALMOST FAILED TO SEARCH A 37 GB TEXT FILE IN UNDER 1 MILLISECOND (IN PYTHON) December 2022 ∙ 19 minute read ∙ Twitter HN Reddit So, there's this website, Have I Been Pwned, where you can check if your email address has appeared in a data breach. There's also a Pwned Passwords section for passwords ... but, typing your password on a random website probably isn't such a great idea, right? Of course, you could read about how HIBP protects the privacy of searched passwords, and understand how k-Anonymity works, and check that the website uses the k-Anonymity API, but that's waaay too much work. Of course, you could use the API yourself, but where's the fun in that? Instead, we'll do it the hard way – we'll check passwords offline. And we're not stopping until it's fast. * The Pwned Passwords list * A minimal plausible solution * Problem: it's slow * Skipping * Problem: it needs tuning, it's still slow * Binary skipping * Better timing * Failing to get to under 1 millisecond * Profile before optimizing * Position heuristic * Index file * Binary file * Getting to under 1 millisecond * Generating the index * Using the index * I heard you like indexes (the end) * Bonus: better data structures THE PWNED PASSWORDS LIST # OK, first we need the password list. Go to Pwned Passwords, scroll to Downloading the Pwned Passwords list, and download the SHA-1 ordered-by-hash file – be nice and use the torrent, if you can. Note You can also use the downloader to get an updated version of the file, but that didn't exist when I started writing this. The archive extracts to a 37 GB text file: $ pushd ~/Downloads $ stat -f %z pwned-passwords-sha1-ordered-by-hash-v8.7z 16257755606 $ 7zz e pwned-passwords-sha1-ordered-by-hash-v8.7z $ stat -f %z pwned-passwords-sha1-ordered-by-hash-v8.txt 37342268646 $ popd ... that looks like this: $ head -n5 ~/Downloads/pwned-passwords-sha1-ordered-by-hash-v8.txt 000000005AD76BD555C1D6D771DE417A4B87E4B4:10 00000000A8DAE4228F821FB418F59826079BF368:4 00000000DD7F2A1C68A35673713783CA390C9E93:873 00000001E225B908BAC31C56DB04D892E47536E0:6 00000006BAB7FC3113AA73DE3589630FC08218E7:3 Each line has the format: <SHA-1 of the password>:<times it appeared in various breaches> Note For more details: why hash the passwords, and why include a count. To make commands shorter, we link the file in the current directory: $ ln -s ~/Downloads/pwned-passwords-sha1-ordered-by-hash-v8.txt pwned.txt A MINIMAL PLAUSIBLE SOLUTION # We'll take an iterative, problem-solution approach to this. But since right now we don't have any solution, we start with the simplest thing that could possibly work. First, we get the imports out of the way: 1 2 3 4 5 import os import sys import time import getpass import hashlib Then, we open the file: 7 8 path = sys.argv[1] file = open(path, 'rb') Opening the file in binary mode makes searching a bit faster, as it skips needless decoding. By opening it early, we get free error checking – no point in reading the password if the file isn't there. Next, the password: 10 11 12 13 14 15 16 17 try: password = sys.argv[2] except IndexError: password = getpass.getpass() hexdigest = hashlib.sha1(password.encode()).hexdigest() del password print("looking for", hexdigest) We either take the password as the second argument to the script, or read it with getpass(), so real passwords don't remain in the shell history. After computing its hash, we delete it; we don't go as far as to zero1 it, but at least we avoid accidental printing. Let's see if it works: $ python pwned.py pwned.txt password looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8 $ python pwned.py pwned.txt Password: looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8 sha1sum seems to agree: $ echo -n password | sha1sum 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8 - To find the hash, we go through the file line by line – remember, simplest thing that could possibly work. 8 9 10 11 12 13 14 def find_line(lines, prefix): for line in lines: if line.startswith(prefix): return line if line > prefix: break return None If a line was found, we print the count: 29 30 31 32 33 34 35 line = find_line(file, hexdigest.upper().encode()) if line: times = int(line.decode().rstrip().partition(':')[2]) print(f"pwned! seen {times:,} times before") else: print("not found") Finally, let's add some timing code: 29 30 31 start = time.monotonic() line = find_line(file, hexdigest.upper().encode()) end = time.monotonic() 39 print(f"in {end-start:.6f} seconds") And, it works: $ python pwned.py pwned.txt blocking looking for 000085013a02852372159cb94101b99ccaec59e1 pwned! seen 587 times before in 0.002070 seconds The code so far. PROBLEM: IT'S SLOW # You may have noticed I switched from password to blocking. That's because I deliberately chose a password whose hash is at the beginning of the file. On my 2013 laptop, password actually takes 86 seconds! To put a lower bound on the time it takes to go through the file: $ time python -c 'for line in open("pwned.txt", "rb"): pass' real 1m28.325s user 1m10.049s sys 0m15.577s $ time wc -l pwned.txt 847223402 pwned.txt real 0m51.729s user 0m27.908s sys 0m12.119s There must be a better way. SKIPPING # There's a hint in the file name: the hashes are ordered. That means we don't have to check all the lines. We can skip ahead repeatedly until we're past the hash, go back one step, and only check each line from there. Lines are different lengths, so we can't skip exactly X lines without reading them. But we don't need to, any line that's a reasonable amount ahead will do. 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 def skip_to_before_line(file, prefix, offset): old_position = file.tell() while True: file.seek(offset, os.SEEK_CUR) file.readline() line = file.readline() # print("jumped to", (line or b'<eof>').decode().rstrip()) if not line or line >= prefix: file.seek(old_position) break old_position = file.tell() So we just seek() ahead a set number of bytes. Since that might not leave us at the start of a line, we discard the incomplete line, and use the next one. Finally, we wrap the original find_line() to do the skipping: 8 9 10 11 12 13 14 def find_line(file, prefix): skip_to_before_line(file, prefix, 16 * 2**20) return find_line_linear(file, prefix) def find_line_linear(lines, prefix): for line in lines: It works: $ python pwned.py pwned.txt password looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8 pwned! seen 9,545,824 times before in 0.027203 seconds To find the magic 16M offset, I had to try a bunch values: offset (MiB) time (s) 1 0.05 4 0.035 8 0.030 16 0.027 <-- sweet spot 32 0.14 The code so far. PROBLEM: IT NEEDS TUNING, IT'S STILL SLOW # While three orders of magnitude faster, we still have a bunch of issues: 1. The ideal offset depends on the computer you're running this on. 2. The run time still increases linearly with file size – we haven't really solved the problem, as much as made it smaller by a (large, but) constant factor. 3. The run time still increases linearly with where the hash is in the file. 4. It's still kinda slow. ¯\_(ツ)_/¯ There must be a better way. BINARY SKIPPING # To make the "linear" part painfully obvious, uncomment the jumped to line. $ python pwned.py pwned.txt password | grep -o 'jumped to .' | uniq -c 139 jumped to 0 139 jumped to 1 139 jumped to 2 139 jumped to 3 139 jumped to 4 103 jumped to 5 Surely, after we've jumped to 0 once, we don't need to do it 138 more times, right? We could jump directly to a line in the middle of the file; if there at all, the hash will be in either of the halves. We could then jump to the middle of that half, and to the middle of that half, and so on, until we either find the hash or there's nowhere left to jump. Note If that sounds a lot like binary search, that's because it is – it's just not wearing its regular array clothing. And most of the work is already done: we can jump to a line at most X bytes from where the hash should be, we only need to do it repeatedly, in smaller and smaller fractions of the file size: 25 26 27 28 29 30 31 32 def skip_to_before_line(file, prefix, offset): while offset > 2**8: offset //= 2 skip_to_before_line_linear(file, prefix, offset) def skip_to_before_line_linear(file, prefix, offset): old_position = file.tell() The only thing left is to get the file size: 8 9 10 11 12 13 def find_line(file, prefix): file.seek(0, os.SEEK_END) size = file.tell() file.seek(0) skip_to_before_line(file, prefix, size) return find_line_linear(file, prefix) It works: $ python pwned.py pwned.txt password looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8 pwned! seen 9,545,824 times before in 0.009559 seconds $ python pwned.py pwned.txt password looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8 pwned! seen 9,545,824 times before in 0.000268 seconds The huge time difference is due to operating system and/or disk caches – on the second run, the same parts of the file are likely already in memory. Anyway, look again at the jumped to output: instead of jumping blindly through the whole file, now we're jumping around the hash, getting closer and closer to it. $ python pwned.py pwned.txt password | grep -o 'jumped to .' | uniq -c 1 jumped to 7 1 jumped to 3 1 jumped to 7 1 jumped to 5 1 jumped to 4 39 jumped to 5 $ python pwned.py pwned.txt password | grep -o 'jumped to ..' | uniq -c 1 jumped to 7F 1 jumped to 3F 1 jumped to 7F 1 jumped to 5F 1 jumped to 4F 1 jumped to 5F 1 jumped to 57 1 jumped to 5F 1 jumped to 5B 1 jumped to 59 1 jumped to 5B 1 jumped to 5A 32 jumped to 5B Notice we land on the same 7F... prefix twice; this makes sense – we skip ahead by half the file size, then back, then ahead two times by a quarter. Caching allows us to not care about that. The code so far. BETTER TIMING # Given the way caching muddies the waters, how fast is it really? This function averages a hundred runs, each with a different password: function average-many { for _ in {1..100}; do python $@ $( python -c 'import time; print(time.time())' ) done \ | grep seconds \ | cut -d' ' -f2 \ | paste -sd+ - \ | sed 's#^#scale=6;(#' | sed 's#$#)/100#' \ | bc } After a few repeats, the average time settles around 3 ms. $ for _ in {1..20}; do average-many pwned.py pwned.txt; done .004802 .003904 .004088 .003486 .003451 .003476 .003414 .003442 .003169 .003297 .002931 .003077 .003092 .003011 .002980 .003147 .003112 .002942 .002984 .002934 Again, this is due to caching: the more we run the script, the more likely it is that the pages at {half, quarters, eights, sixteenths, ...} of the file size are already in memory – and the path to any line starts with a subset of those. -------------------------------------------------------------------------------- And, we're done. I wave my hands, get a 2020 laptop, and a miracle happens. It's far enough into the totally unpredictable future, now, that you can search any password in under 1 millisecond. You can do anything you want. So, there we go. Wasn't that an interesting story? That's the end of the article. Don't look at the scroll bar. Don't worry about it. If you came here to check your password, you can go. Subscribe on the way out if you'd like, but take care :) Go solve Advent of Code or something. I'm just chilling out. See ya. Want to know when new articles come out? Drop your email in the box below and I'll send new stuff straight to your inbox! FAILING TO GET TO UNDER 1 MILLISECOND # I swear this was supposed to be the end. This really was supposed to be a short one. Here's a quote from a friend of mine, that chronologically should be way later into the article, but makes a great summary for what follows: > And now that you have arrived at this point, spend a moment to ponder the > arbitrary nature of 1 millisecond given its dependency on the current year and > the choice of your particular hardware. > > After that moment, continue celebrating. Nah, fuck it, it has to take less than 1 millisecond on the old laptop. ... so yeah, here's a bunch of stuff that didn't work. Liking this so far? Here's another article you might like: Learn by reading code: Python standard library design decisions explained PROFILE BEFORE OPTIMIZING # With the obvious improvements out of the way, it's probably a good time to stop and find out where time is being spent. $ python -m cProfile -s cumulative pwned.py pwned.txt "$( date )" looking for 3960626a8c59fe927d3cf2e991d67f4c505ae198 not found in 0.004902 seconds 1631 function calls (1614 primitive calls) in 0.010 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) ... 1 0.000 0.000 0.005 0.005 02-binary-search.py:8(find_line) 1 0.000 0.000 0.005 0.005 02-binary-search.py:22(skip_to_before_line) 28 0.000 0.000 0.005 0.000 02-binary-search.py:28(skip_to_before_line_linear) 86 0.004 0.000 0.004 0.000 {method 'readline' of '_io.BufferedReader' objects} ... 71 0.000 0.000 0.000 0.000 {method 'seek' of '_io.BufferedReader' objects} ... 44 0.000 0.000 0.000 0.000 {method 'tell' of '_io.BufferedReader' objects} ... From the output above, we learn that: * Most of the time is spent in readline() calls. * Both seek() and tell() calls are basically free. readline() is implemented in C, so there's not much we can change there. What we can change, however, is how often we call it. -------------------------------------------------------------------------------- Another thing of interest is how much individual readline() calls take. In skip_to_before_line_linear(): 34 35 36 37 38 39 40 start = time.monotonic() file.readline() line = file.readline() end = time.monotonic() print("jumped to", (line[:5] or b'<eof>').decode().rstrip(), f"in {(end-start)*1000000:4.0f} us", f"at offset {file.tell():16,}") The output is pretty enlightening: $ python pwned.py pwned.txt asdf looking for 3da541559918a808c2402bba5012f6c60b27661c jumped to 7FF9E in 10 us at offset 18,671,134,394 <-- 1/2 file size jumped to 3FF89 in 4 us at offset 9,335,567,234 <-- 1/4 file size jumped to 1FFBA in 3 us at offset 4,667,783,663 <-- 1/8 file size jumped to 3FF89 in 3 us at offset 9,335,567,322 <-- 1/4 file size jumped to 2FFA4 in 5 us at offset 7,001,675,508 jumped to 3FF89 in 4 us at offset 9,335,567,366 <-- 1/4 file size jumped to 37F98 in 4 us at offset 8,168,621,453 jumped to 3FF89 in 3 us at offset 9,335,567,410 <-- 1/4 file size jumped to 3BF94 in 3 us at offset 8,752,094,477 jumped to 3FF89 in 2 us at offset 9,335,567,498 <-- 1/4 file size jumped to 3DF8E in 3 us at offset 9,043,831,007 jumped to 3CF90 in 3 us at offset 8,897,962,782 jumped to 3DF8E in 2 us at offset 9,043,831,095 jumped to 3D790 in 3 us at offset 8,970,896,964 jumped to 3DF8E in 2 us at offset 9,043,831,139 jumped to 3DB90 in 253 us at offset 9,007,364,072 jumped to 3D990 in 206 us at offset 8,989,130,552 jumped to 3DB90 in 6 us at offset 9,007,364,160 jumped to 3DA8F in 270 us at offset 8,998,247,402 <-- page 2,196,837 jumped to 3DA0F in 189 us at offset 8,993,689,007 jumped to 3DA8F in 5 us at offset 8,998,247,446 <-- page 2,196,837 jumped to 3DA4F in 212 us at offset 8,995,968,274 jumped to 3DA8F in 5 us at offset 8,998,247,534 <-- page 2,196,837 jumped to 3DA6F in 266 us at offset 8,997,107,921 jumped to 3DA5F in 203 us at offset 8,996,538,139 jumped to 3DA57 in 195 us at offset 8,996,253,241 jumped to 3DA53 in 197 us at offset 8,996,110,772 jumped to 3DA57 in 6 us at offset 8,996,253,285 jumped to 3DA55 in 193 us at offset 8,996,182,045 jumped to 3DA54 in 178 us at offset 8,996,146,471 jumped to 3DA54 in 189 us at offset 8,996,128,666 jumped to 3DA54 in 191 us at offset 8,996,119,760 <-- page 2,196,318 jumped to 3DA54 in 32 us at offset 8,996,128,710 jumped to 3DA54 in 5 us at offset 8,996,124,259 jumped to 3DA54 in 10 us at offset 8,996,122,057 <-- page 2,196,318 jumped to 3DA54 in 4 us at offset 8,996,120,955 <-- page 2,196,318 jumped to 3DA54 in 4 us at offset 8,996,120,382 <-- page 2,196,318 jumped to 3DA54 in 9 us at offset 8,996,120,112 <-- page 2,196,318 jumped to 3DA54 in 1 us at offset 8,996,120,470 <-- page 2,196,318 jumped to 3DA54 in 1 us at offset 8,996,120,338 <-- page 2,196,318 pwned! seen 324,774 times before in 0.003654 seconds Half the reads are pretty fast: * In the beginning, because searches start with the same few pages. * At the end, because searches end on the same page. * All reads of any page, after the first. So, it's those reads in the middle that we need to get rid of. POSITION HEURISTIC # In theory, the output of a good hash function should be uniformly distributed. This means that with a bit of math, we can estimate where a hash would be – a hash that's ~1/5 in the range of all possible hashes should be at ~1/5 of the file. Here's a tiny example: >>> digest = '5b' # 1-byte hash (2 hex digits) >>> size = 1000 # 1000-byte long file >>> int_digest = int(digest, 16) # == 91 >>> int_end = 16 ** len(digest) # == 0xff + 1 == 256 >>> int(size * int_digest / int_end) 355 We can do this once, and then binary search a safety interval around that position. Alas, this only gets rid of the fast jumps at the beginning of the binary search, and for some reason, it ends up being slightly slower than binary search alone. (code) We can also narrow down around the estimated position iteratively, making the interval smaller by a constant factor each time. This seems to work: a factor of 1000 yields 1.7 ms, and a factor of 8000 yields 1.2 ms, both in 2 steps. (code) However, it has deeper issues: * Having arbitrary start/end offsets complicates the code quite a bit. * I don't know how to reliably determine the factor.2 * I don't know how to prove it's correct,3 especially for smaller intervals, where the hashes are less uniform. To be honest, I don't think it can be 100% correct, and I don't know how to estimate how correct it is. Anyway, if the implementation is hard to explain, it's a bad idea. INDEX FILE # An arbitrary self-imposed restriction I had was that any solution should mostly use the original passwords list, with little to no preparation. By relaxing this a bit, and going through the file once, we can build an index like: <SHA-1 of the password>:<offset in file> ... that we can then search with skip_to_before_line(). Of course, we won't include all the hashes – by including lines a few kilobytes apart from each other, we can seek directly to within a few kilobytes in the big file. The only thing left to figure out is how much "a few kilobytes" is. After my endless harping about caching and pages, the answer should be obvious: one page size (4K). And this actually gets us 0.8 ms! But back when I wrote the code, that hadn't really sunk in, so after getting 1.2 ms with a 32K distance, I moved on. Code: pwned.py, generate-index.py. BINARY FILE # Already on the additional file slippery slope, I converted the list to binary, mostly to make it smaller – smaller file, fewer reads. I packed each line into 24 bytes: | binary hash (20 bytes) | count (4 bytes) | This halved the file, but only lowered the runtime to a modest 2.6 ms. More importantly, it made the code much, much simpler: because items are fixed size, you can know where the Nth item is, so I was able to use bisect for the binary search. Code: pwned.py, convert-to-binary.py. GETTING TO UNDER 1 MILLISECOND # OK, what now? This is what we have so far: * The position heuristic kinda (maybe?) works, but is hard to reason about. * The index file gets us there, but barely, and the index is pretty big. * The binary file isn't much faster, and it creates a huge file. But, less code! I don't know what to do with the first one, but I think we can combine the last two. GENERATING THE INDEX # Let's start with the script I made for the text index: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import os, sys file = sys.stdin.buffer outf = sys.stdout.buffer while True: pos = file.tell() line = file.readline() if not line: break outf.write(line.partition(b':')[0] + b':' + str(pos).encode() + b'\n') file.seek(2**12, os.SEEK_CUR) file.readline() The output looks like this: $ python generate-index.py < pwned.txt 2>/dev/null | head -n5 000000005AD76BD555C1D6D771DE417A4B87E4B4:0 00000099A4D3034E14DF60EF50799F695C27C0EC:4157 00000172E8E1D1BD54AC23B3F9AB4383F291CA17:8312 000002C8F808A7DB504BBC3C711BE8A8D508C0F9:12453 0000047139578F13D70DD96BADD425C372DB64A9:16637 We need to pack that into bytes. A hash takes 20 bytes. But, we only need slightly more than 3 bytes (6 hex digits) to distinguish between the index lines: $ python generate-index.py < pwned.txt 2>/dev/null | cut -c-6 | uniq -c | head 2 000000 1 000001 1 000002 1 000004 1 000005 1 000007 1 000008 1 00000A 1 00000B 1 00000C To represent all the offsets in the file, we need log2(35G) / 8 = 4.39... bytes, which results in a total of 9 bytes (maybe even 8, if we mess with individual bits). Let's make it future-proof: 6 bytes for the hash buys at least 55 trillion lines (2.4 petabyte files), and 6 bytes for the offset buys 0.28 petabyte files. 12 13 14 digest, _, _ = line.partition(b':') outf.write(bytes.fromhex(digest.decode())[:6]) outf.write(pos.to_bytes(6, 'big')) If you look at the text index, you'll notice the offsets are 4K + ~50 bytes apart; this results in sometimes having to read 2 pages, because not all pages have an index entry. Let's fix that by reading the first whole line after a 4K boundary instead: 16 17 file.seek((pos // 4096 + 1) * 4096) file.readline() OK, we're done: $ time python generate-index.py < pwned.txt > index.bin real 1m2.729s user 0m34.292s sys 0m21.392s USING THE INDEX # We start with a skeleton that's functionally identical to the naive script. The only difference is that I've added stubs for passing and using the index: 9 10 11 def find_line(file, prefix, index): skip_to_before_line_index(file, prefix, index) return find_line_linear(file, prefix) 23 24 def skip_to_before_line_index(file, prefix, index): ... 30 31 index_path = sys.argv[2] index = ... 43 line = find_line(file, hexdigest.upper().encode(), index) The whole skeleton, if you want to see it: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 import os import sys import time import bisect import getpass import hashlib def find_line(file, prefix, index): skip_to_before_line_index(file, prefix, index) return find_line_linear(file, prefix) def find_line_linear(lines, prefix): for line in lines: if line.startswith(prefix): return line if line > prefix: break return None def skip_to_before_line_index(file, prefix, index): ... path = sys.argv[1] file = open(path, 'rb') index_path = sys.argv[2] index = ... try: password = sys.argv[3] except IndexError: password = getpass.getpass() hexdigest = hashlib.sha1(password.encode()).hexdigest() del password print("looking for", hexdigest) start = time.monotonic() line = find_line(file, hexdigest.upper().encode(), index) end = time.monotonic() if line: times = int(line.decode().rstrip().partition(':')[2]) print(f"pwned! seen {times:,} times before") else: print("not found") print(f"in {end-start:.6f} seconds") As mentioned before, we'll use the standard library bisect module to search the index. We could read the entire index in memory, as a list of 12-byte bytes. But that would still be slow, even if outside the current timing code, and memory usage would increase with the size of the file. Fortunately, bisect doesn't only work with lists, it works with any sequence – that is, any object that can pretend to be a list. So we'll build our own, by implementing the two required4 special methods. 27 28 29 30 31 32 class BytesArray: item_size = 12 def __init__(self, file): self.file = file We can go ahead and plug it in: 38 39 index_path = sys.argv[2] index = BytesArray(open(index_path, 'rb')) The first special method is __getitem__(), for a[i]: 34 35 36 37 38 39 def __getitem__(self, index): self.file.seek(index * self.item_size) buffer = self.file.read(self.item_size) if len(buffer) != self.item_size: raise IndexError(index) # out of bounds return buffer The second special method is __len__(), for len(a): 41 42 43 44 def __len__(self): self.file.seek(0, os.SEEK_END) size = self.file.tell() return size // self.item_size Using the index becomes straightforward: 23 24 25 26 27 28 29 30 31 32 33 34 def skip_to_before_line_index(file, prefix, index): item_prefix = bytes.fromhex(prefix.decode())[:6] item = find_lt(index, item_prefix) offset = int.from_bytes(item[6:], 'big') if item else 0 file.seek(offset) def find_lt(a, x): i = bisect.bisect_left(a, x) if i: return a[i-1] return None We get the first 6 bytes of the hash, find the rightmost value less than that, extract the offset from it, and seek to there. find_lt() comes from bisect's recipes for searching sorted lists. And we're done: $ average-many pwned.py pwned.txt index.bin .002546 Huh? ... that's unexpected... Oh. I said we won't read the index in memory. But we can force it into the cache by reading it a bunch of times: $ for _ in {1..10}; do cat index.bin > /dev/null; done Finally: $ average-many pwned.py pwned.txt index.bin .000421 Code: pwned.py, generate-index.py. I HEARD YOU LIKE INDEXES (THE END) # Hmmm... isn't that cold start bugging you? If we make an index for the big index, we get 1.2 ms from a cold start. Maybe another smaller index can take us to below 1 ms? ... Just kidding, this is it, this really is the end. And now, let's take that moment to ponder: method statements time (ms, order of magnitude) linear 29 100,000 linear+skip 42 100 binary search 49 10 binary index 59 (72) 1 For twice the code, it's 5 orders of magnitude faster! I'm deliberately not counting bisect or the OS cache here, beacuse that's the point, they're basically free. Turns out, you can get pretty far with just a few tricks. -------------------------------------------------------------------------------- That's it for now. Learned something new today? Share this with others, it really helps! Twitter HN Reddit If you've made it this far, you might like: Write an SQL query builder in 150 lines of Python! BONUS: BETTER DATA STRUCTURES # As always, a specialized data structure can solve a problem better. In Sketchy Pwned Passwords, Scott Helme manages to "pack" the entire passwords list into a 1.5G in-memory count-min sketch, which can then be queried in 1 microsecond. And, if you don't care about the counts, a plain Bloom filter can even take you to 0.1 µs! (There is a trade-off, and it's that it takes 11 hours to create either.) Want to know when new articles come out? Drop your email in the box below and I'll send new stuff straight to your inbox! 1. It's kinda difficult to do in Python anyway. [return] 2. Something like (size / 4096) ** (1 / int(log(size, 4096))), maybe? [return] 3. I mean, I did cross-check it with other solutions for a few thousand values, and it seems correct, but that's not proof. [return] 4. We're only implementing the parts of the protocol that bisect uses. For the full protocol, __getitem__() would need to implement negative indexes and slicing. For more, free sequence methods, inherit collections.abc.Sequence. Interestingly, the class will work in a for loop without an __iter__() method. That's because there are actually two iteration protocols: an older one, using __getitem__(), and a newer one, added in Python 2.1, using __iter__() and __next__(). For details on the logic, see Unravelling for statements. [return] home ∙ feed ∙ about ∙ © 2021 lemon24