death.andgravity.com Open in urlscan Pro
2606:50c0:8003::153  Public Scan

URL: https://death.andgravity.com/pwned
Submission: On September 21 via manual from FR — Scanned from FR

Form analysis 2 forms found in the DOM

Name: mc-embedded-subscribe-formPOST https://gmail.us7.list-manage.com/subscribe/post?u=9909b0e978d8d8d941bd3c8dc&id=c61d63d661&SIGNUP=pwned

<form action="https://gmail.us7.list-manage.com/subscribe/post?u=9909b0e978d8d8d941bd3c8dc&amp;id=c61d63d661&amp;SIGNUP=pwned" method="post" id="embedded-subscribe-form" name="mc-embedded-subscribe-form" target="_blank" novalidate=""
  class="panel subscribe-form">
  <div class="panel-header text-large"> Want to know when new articles come out? </div>
  <div class="panel-body">
    <p>Drop your email in the box below and I'll send new stuff straight to your inbox!</p>
    <div class="form-group col-6 col-xs-12 col-md-9">
      <input type="text" name="FNAME" id="mce-FNAME" class="form-input input-lg" placeholder="Your first name">
    </div>
    <div class="form-group col-6 col-xs-12 col-md-9">
      <input type="email" name="EMAIL" id="mce-EMAIL" class="form-input input-lg" placeholder="Your email address">
    </div>
    <!-- bot prevention-->
    <div style="position: absolute; left: -5000px;" aria-hidden="true"><input type="text" name="b_9909b0e978d8d8d941bd3c8dc_c61d63d661" tabindex="-1" value=""></div>
  </div>
  <div class="panel-footer">
    <div class="form-group">
      <input type="submit" name="subscribe" id="mc-embedded-subscribe" class="btn btn-primary btn-lg" value="Subscribe">
    </div>
  </div>
</form>

Name: mc-embedded-subscribe-formPOST https://gmail.us7.list-manage.com/subscribe/post?u=9909b0e978d8d8d941bd3c8dc&id=c61d63d661&SIGNUP=pwned

<form action="https://gmail.us7.list-manage.com/subscribe/post?u=9909b0e978d8d8d941bd3c8dc&amp;id=c61d63d661&amp;SIGNUP=pwned" method="post" id="embedded-subscribe-form" name="mc-embedded-subscribe-form" target="_blank" novalidate=""
  class="panel subscribe-form">
  <div class="panel-header text-large"> Want to know when new articles come out? </div>
  <div class="panel-body">
    <p>Drop your email in the box below and I'll send new stuff straight to your inbox!</p>
    <div class="form-group col-6 col-xs-12 col-md-9">
      <input type="text" name="FNAME" id="mce-FNAME" class="form-input input-lg" placeholder="Your first name">
    </div>
    <div class="form-group col-6 col-xs-12 col-md-9">
      <input type="email" name="EMAIL" id="mce-EMAIL" class="form-input input-lg" placeholder="Your email address">
    </div>
    <!-- bot prevention-->
    <div style="position: absolute; left: -5000px;" aria-hidden="true"><input type="text" name="b_9909b0e978d8d8d941bd3c8dc_c61d63d661" tabindex="-1" value=""></div>
  </div>
  <div class="panel-footer">
    <div class="form-group">
      <input type="submit" name="subscribe" id="mc-embedded-subscribe" class="btn btn-primary btn-lg" value="Subscribe">
    </div>
  </div>
</form>

Text Content

 * death and gravity


HAS YOUR PASSWORD BEEN PWNED? OR, HOW I ALMOST FAILED TO SEARCH A 37 GB TEXT
FILE IN UNDER 1 MILLISECOND (IN PYTHON)

December 2022 ∙ 19 minute read ∙ Twitter HN Reddit

So, there's this website, Have I Been Pwned, where you can check if your email
address has appeared in a data breach.

There's also a Pwned Passwords section for passwords ... but, typing your
password on a random website probably isn't such a great idea, right?

Of course, you could read about how HIBP protects the privacy of searched
passwords, and understand how k-Anonymity works, and check that the website uses
the k-Anonymity API, but that's waaay too much work.

Of course, you could use the API yourself, but where's the fun in that?

Instead, we'll do it the hard way – we'll check passwords offline.

And we're not stopping until it's fast.

 * The Pwned Passwords list
 * A minimal plausible solution
 * Problem: it's slow
   * Skipping
 * Problem: it needs tuning, it's still slow
   * Binary skipping
   * Better timing
 * Failing to get to under 1 millisecond
   * Profile before optimizing
   * Position heuristic
   * Index file
   * Binary file
 * Getting to under 1 millisecond
   * Generating the index
   * Using the index
   * I heard you like indexes (the end)
 * Bonus: better data structures


THE PWNED PASSWORDS LIST #

OK, first we need the password list.

Go to Pwned Passwords, scroll to Downloading the Pwned Passwords list, and
download the SHA-1 ordered-by-hash file – be nice and use the torrent, if you
can.

Note

You can also use the downloader to get an updated version of the file, but that
didn't exist when I started writing this.

The archive extracts to a 37 GB text file:

$ pushd ~/Downloads
$ stat -f %z pwned-passwords-sha1-ordered-by-hash-v8.7z
16257755606
$ 7zz e pwned-passwords-sha1-ordered-by-hash-v8.7z
$ stat -f %z pwned-passwords-sha1-ordered-by-hash-v8.txt
37342268646
$ popd


... that looks like this:

$ head -n5 ~/Downloads/pwned-passwords-sha1-ordered-by-hash-v8.txt
000000005AD76BD555C1D6D771DE417A4B87E4B4:10
00000000A8DAE4228F821FB418F59826079BF368:4
00000000DD7F2A1C68A35673713783CA390C9E93:873
00000001E225B908BAC31C56DB04D892E47536E0:6
00000006BAB7FC3113AA73DE3589630FC08218E7:3


Each line has the format:

<SHA-1 of the password>:<times it appeared in various breaches>


Note

For more details: why hash the passwords, and why include a count.

To make commands shorter, we link the file in the current directory:

$ ln -s ~/Downloads/pwned-passwords-sha1-ordered-by-hash-v8.txt pwned.txt



A MINIMAL PLAUSIBLE SOLUTION #

We'll take an iterative, problem-solution approach to this. But since right now
we don't have any solution, we start with the simplest thing that could possibly
work.

First, we get the imports out of the way:

1
2
3
4
5

import os
import sys
import time
import getpass
import hashlib


Then, we open the file:

7
8

path = sys.argv[1]
file = open(path, 'rb')


Opening the file in binary mode makes searching a bit faster, as it skips
needless decoding. By opening it early, we get free error checking – no point in
reading the password if the file isn't there.

Next, the password:

10
11
12
13
14
15
16
17

try:
    password = sys.argv[2]
except IndexError:
    password = getpass.getpass()
hexdigest = hashlib.sha1(password.encode()).hexdigest()
del password

print("looking for", hexdigest)


We either take the password as the second argument to the script, or read it
with getpass(), so real passwords don't remain in the shell history. After
computing its hash, we delete it; we don't go as far as to zero‍1 it, but at
least we avoid accidental printing.

Let's see if it works:

$ python pwned.py pwned.txt password
looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8
$ python pwned.py pwned.txt
Password:
looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8


sha1sum seems to agree:

$ echo -n password | sha1sum
5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8  -


To find the hash, we go through the file line by line – remember, simplest thing
that could possibly work.

 8
 9
10
11
12
13
14

def find_line(lines, prefix):
    for line in lines:
        if line.startswith(prefix):
            return line
        if line > prefix:
            break
    return None


If a line was found, we print the count:

29
30
31
32
33
34
35

line = find_line(file, hexdigest.upper().encode())

if line:
    times = int(line.decode().rstrip().partition(':')[2])
    print(f"pwned! seen {times:,} times before")
else:
    print("not found")


Finally, let's add some timing code:

29
30
31

start = time.monotonic()
line = find_line(file, hexdigest.upper().encode())
end = time.monotonic()


39

print(f"in {end-start:.6f} seconds")


And, it works:

$ python pwned.py pwned.txt blocking
looking for 000085013a02852372159cb94101b99ccaec59e1
pwned! seen 587 times before
in 0.002070 seconds


The code so far.


PROBLEM: IT'S SLOW #

You may have noticed I switched from password to blocking. That's because I
deliberately chose a password whose hash is at the beginning of the file.

On my 2013 laptop, password actually takes 86 seconds!

To put a lower bound on the time it takes to go through the file:

$ time python -c 'for line in open("pwned.txt", "rb"): pass'

real	1m28.325s
user	1m10.049s
sys	0m15.577s
$ time wc -l pwned.txt
 847223402 pwned.txt

real	0m51.729s
user	0m27.908s
sys	0m12.119s


There must be a better way.


SKIPPING #

There's a hint in the file name: the hashes are ordered.

That means we don't have to check all the lines. We can skip ahead repeatedly
until we're past the hash, go back one step, and only check each line from
there.

Lines are different lengths, so we can't skip exactly X lines without reading
them. But we don't need to, any line that's a reasonable amount ahead will do.

17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

def skip_to_before_line(file, prefix, offset):
    old_position = file.tell()

    while True:
        file.seek(offset, os.SEEK_CUR)

        file.readline()
        line = file.readline()
        # print("jumped to", (line or b'<eof>').decode().rstrip())

        if not line or line >= prefix:
            file.seek(old_position)
            break

        old_position = file.tell()


So we just seek() ahead a set number of bytes. Since that might not leave us at
the start of a line, we discard the incomplete line, and use the next one.

Finally, we wrap the original find_line() to do the skipping:

 8
 9
10
11
12
13
14

def find_line(file, prefix):
    skip_to_before_line(file, prefix, 16 * 2**20)
    return find_line_linear(file, prefix)


def find_line_linear(lines, prefix):
    for line in lines:


It works:

$ python pwned.py pwned.txt password
looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8
pwned! seen 9,545,824 times before
in 0.027203 seconds


To find the magic 16M offset, I had to try a bunch values:

offset (MiB)   time (s)
           1      0.05
           4      0.035
           8      0.030
          16      0.027  <-- sweet spot
          32      0.14


The code so far.


PROBLEM: IT NEEDS TUNING, IT'S STILL SLOW #

While three orders of magnitude faster, we still have a bunch of issues:

 1. The ideal offset depends on the computer you're running this on.
 2. The run time still increases linearly with file size – we haven't really
    solved the problem, as much as made it smaller by a (large, but) constant
    factor.
 3. The run time still increases linearly with where the hash is in the file.
 4. It's still kinda slow. ¯\_(ツ)_/¯

There must be a better way.


BINARY SKIPPING #

To make the "linear" part painfully obvious, uncomment the jumped to line.

$ python pwned.py pwned.txt password | grep -o 'jumped to .' | uniq -c
 139 jumped to 0
 139 jumped to 1
 139 jumped to 2
 139 jumped to 3
 139 jumped to 4
 103 jumped to 5


Surely, after we've jumped to 0 once, we don't need to do it 138 more times,
right?

We could jump directly to a line in the middle of the file; if there at all, the
hash will be in either of the halves. We could then jump to the middle of that
half, and to the middle of that half, and so on, until we either find the hash
or there's nowhere left to jump.

Note

If that sounds a lot like binary search, that's because it is – it's just not
wearing its regular array clothing.

And most of the work is already done: we can jump to a line at most X bytes from
where the hash should be, we only need to do it repeatedly, in smaller and
smaller fractions of the file size:

25
26
27
28
29
30
31
32

def skip_to_before_line(file, prefix, offset):
    while offset > 2**8:
        offset //= 2
        skip_to_before_line_linear(file, prefix, offset)


def skip_to_before_line_linear(file, prefix, offset):
    old_position = file.tell()


The only thing left is to get the file size:

 8
 9
10
11
12
13

def find_line(file, prefix):
    file.seek(0, os.SEEK_END)
    size = file.tell()
    file.seek(0)
    skip_to_before_line(file, prefix, size)
    return find_line_linear(file, prefix)


It works:

$ python pwned.py pwned.txt password
looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8
pwned! seen 9,545,824 times before
in 0.009559 seconds
$ python pwned.py pwned.txt password
looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8
pwned! seen 9,545,824 times before
in 0.000268 seconds


The huge time difference is due to operating system and/or disk caches – on the
second run, the same parts of the file are likely already in memory.

Anyway, look again at the jumped to output: instead of jumping blindly through
the whole file, now we're jumping around the hash, getting closer and closer to
it.

$ python pwned.py pwned.txt password | grep -o 'jumped to .' | uniq -c
   1 jumped to 7
   1 jumped to 3
   1 jumped to 7
   1 jumped to 5
   1 jumped to 4
  39 jumped to 5


$ python pwned.py pwned.txt password | grep -o 'jumped to ..' | uniq -c
   1 jumped to 7F
   1 jumped to 3F
   1 jumped to 7F
   1 jumped to 5F
   1 jumped to 4F
   1 jumped to 5F
   1 jumped to 57
   1 jumped to 5F
   1 jumped to 5B
   1 jumped to 59
   1 jumped to 5B
   1 jumped to 5A
  32 jumped to 5B


Notice we land on the same 7F... prefix twice; this makes sense – we skip ahead
by half the file size, then back, then ahead two times by a quarter. Caching
allows us to not care about that.

The code so far.


BETTER TIMING #

Given the way caching muddies the waters, how fast is it really?

This function averages a hundred runs, each with a different password:

function average-many {
    for _ in {1..100}; do
        python $@ $( python -c 'import time; print(time.time())' )
    done \
    | grep seconds \
    | cut -d' ' -f2 \
    | paste -sd+ - \
    | sed 's#^#scale=6;(#' | sed 's#$#)/100#' \
    | bc
}


After a few repeats, the average time settles around 3 ms.

$ for _ in {1..20}; do average-many pwned.py pwned.txt; done
.004802
.003904
.004088
.003486
.003451
.003476
.003414
.003442
.003169
.003297
.002931
.003077
.003092
.003011
.002980
.003147
.003112
.002942
.002984
.002934


Again, this is due to caching: the more we run the script, the more likely it is
that the pages at {half, quarters, eights, sixteenths, ...} of the file size are
already in memory – and the path to any line starts with a subset of those.

--------------------------------------------------------------------------------

And, we're done.

I wave my hands, get a 2020 laptop, and a miracle happens. It's far enough into
the totally unpredictable future, now, that you can search any password in under
1 millisecond. You can do anything you want.

So, there we go. Wasn't that an interesting story? That's the end of the
article.

Don't look at the scroll bar. Don't worry about it.

If you came here to check your password, you can go.

Subscribe on the way out if you'd like, but take care :)

Go solve Advent of Code or something.

I'm just chilling out.

See ya.

Want to know when new articles come out?

Drop your email in the box below and I'll send new stuff straight to your inbox!







FAILING TO GET TO UNDER 1 MILLISECOND #

I swear this was supposed to be the end. This really was supposed to be a short
one.

Here's a quote from a friend of mine, that chronologically should be way later
into the article, but makes a great summary for what follows:

> And now that you have arrived at this point, spend a moment to ponder the
> arbitrary nature of 1 millisecond given its dependency on the current year and
> the choice of your particular hardware.
> 
> After that moment, continue celebrating.

Nah, fuck it, it has to take less than 1 millisecond on the old laptop.

... so yeah, here's a bunch of stuff that didn't work.

Liking this so far? Here's another article you might like:

Learn by reading code: Python standard library design decisions explained


PROFILE BEFORE OPTIMIZING #

With the obvious improvements out of the way, it's probably a good time to stop
and find out where time is being spent.

$ python -m cProfile -s cumulative pwned.py pwned.txt "$( date )"
looking for 3960626a8c59fe927d3cf2e991d67f4c505ae198
not found
in 0.004902 seconds
         1631 function calls (1614 primitive calls) in 0.010 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      ...
        1    0.000    0.000    0.005    0.005 02-binary-search.py:8(find_line)
        1    0.000    0.000    0.005    0.005 02-binary-search.py:22(skip_to_before_line)
       28    0.000    0.000    0.005    0.000 02-binary-search.py:28(skip_to_before_line_linear)
       86    0.004    0.000    0.004    0.000 {method 'readline' of '_io.BufferedReader' objects}
      ...
       71    0.000    0.000    0.000    0.000 {method 'seek' of '_io.BufferedReader' objects}
      ...
       44    0.000    0.000    0.000    0.000 {method 'tell' of '_io.BufferedReader' objects}
      ...


From the output above, we learn that:

 * Most of the time is spent in readline() calls.
 * Both seek() and tell() calls are basically free.

readline() is implemented in C, so there's not much we can change there.

What we can change, however, is how often we call it.

--------------------------------------------------------------------------------

Another thing of interest is how much individual readline() calls take.

In skip_to_before_line_linear():

34
35
36
37
38
39
40

        start = time.monotonic()
        file.readline()
        line = file.readline()
        end = time.monotonic()
        print("jumped to", (line[:5] or b'<eof>').decode().rstrip(),
              f"in {(end-start)*1000000:4.0f} us",
              f"at offset {file.tell():16,}")


The output is pretty enlightening:

$ python pwned.py pwned.txt asdf
looking for 3da541559918a808c2402bba5012f6c60b27661c
jumped to 7FF9E in   10 us at offset   18,671,134,394  <-- 1/2 file size
jumped to 3FF89 in    4 us at offset    9,335,567,234  <-- 1/4 file size
jumped to 1FFBA in    3 us at offset    4,667,783,663  <-- 1/8 file size
jumped to 3FF89 in    3 us at offset    9,335,567,322  <-- 1/4 file size
jumped to 2FFA4 in    5 us at offset    7,001,675,508
jumped to 3FF89 in    4 us at offset    9,335,567,366  <-- 1/4 file size
jumped to 37F98 in    4 us at offset    8,168,621,453
jumped to 3FF89 in    3 us at offset    9,335,567,410  <-- 1/4 file size
jumped to 3BF94 in    3 us at offset    8,752,094,477
jumped to 3FF89 in    2 us at offset    9,335,567,498  <-- 1/4 file size
jumped to 3DF8E in    3 us at offset    9,043,831,007
jumped to 3CF90 in    3 us at offset    8,897,962,782
jumped to 3DF8E in    2 us at offset    9,043,831,095
jumped to 3D790 in    3 us at offset    8,970,896,964
jumped to 3DF8E in    2 us at offset    9,043,831,139
jumped to 3DB90 in  253 us at offset    9,007,364,072
jumped to 3D990 in  206 us at offset    8,989,130,552
jumped to 3DB90 in    6 us at offset    9,007,364,160
jumped to 3DA8F in  270 us at offset    8,998,247,402  <-- page 2,196,837
jumped to 3DA0F in  189 us at offset    8,993,689,007
jumped to 3DA8F in    5 us at offset    8,998,247,446  <-- page 2,196,837
jumped to 3DA4F in  212 us at offset    8,995,968,274
jumped to 3DA8F in    5 us at offset    8,998,247,534  <-- page 2,196,837
jumped to 3DA6F in  266 us at offset    8,997,107,921
jumped to 3DA5F in  203 us at offset    8,996,538,139
jumped to 3DA57 in  195 us at offset    8,996,253,241
jumped to 3DA53 in  197 us at offset    8,996,110,772
jumped to 3DA57 in    6 us at offset    8,996,253,285
jumped to 3DA55 in  193 us at offset    8,996,182,045
jumped to 3DA54 in  178 us at offset    8,996,146,471
jumped to 3DA54 in  189 us at offset    8,996,128,666
jumped to 3DA54 in  191 us at offset    8,996,119,760  <-- page 2,196,318
jumped to 3DA54 in   32 us at offset    8,996,128,710
jumped to 3DA54 in    5 us at offset    8,996,124,259
jumped to 3DA54 in   10 us at offset    8,996,122,057  <-- page 2,196,318
jumped to 3DA54 in    4 us at offset    8,996,120,955  <-- page 2,196,318
jumped to 3DA54 in    4 us at offset    8,996,120,382  <-- page 2,196,318
jumped to 3DA54 in    9 us at offset    8,996,120,112  <-- page 2,196,318
jumped to 3DA54 in    1 us at offset    8,996,120,470  <-- page 2,196,318
jumped to 3DA54 in    1 us at offset    8,996,120,338  <-- page 2,196,318
pwned! seen 324,774 times before
in 0.003654 seconds


Half the reads are pretty fast:

 * In the beginning, because searches start with the same few pages.
 * At the end, because searches end on the same page.
 * All reads of any page, after the first.

So, it's those reads in the middle that we need to get rid of.


POSITION HEURISTIC #

In theory, the output of a good hash function should be uniformly distributed.

This means that with a bit of math, we can estimate where a hash would be – a
hash that's ~1/5 in the range of all possible hashes should be at ~1/5 of the
file.

Here's a tiny example:

>>> digest = '5b'  # 1-byte hash (2 hex digits)
>>> size = 1000    # 1000-byte long file
>>> int_digest = int(digest, 16)  # == 91
>>> int_end = 16 ** len(digest)   # == 0xff + 1 == 256
>>> int(size * int_digest / int_end)
355


We can do this once, and then binary search a safety interval around that
position. Alas, this only gets rid of the fast jumps at the beginning of the
binary search, and for some reason, it ends up being slightly slower than binary
search alone. (code)

We can also narrow down around the estimated position iteratively, making the
interval smaller by a constant factor each time. This seems to work: a factor of
1000 yields 1.7 ms, and a factor of 8000 yields 1.2 ms, both in 2 steps. (code)

However, it has deeper issues:

 * Having arbitrary start/end offsets complicates the code quite a bit.
 * I don't know how to reliably determine the factor.2
 * I don't know how to prove it's correct,3 especially for smaller intervals,
   where the hashes are less uniform. To be honest, I don't think it can be 100%
   correct, and I don't know how to estimate how correct it is.

Anyway, if the implementation is hard to explain, it's a bad idea.


INDEX FILE #

An arbitrary self-imposed restriction I had was that any solution should mostly
use the original passwords list, with little to no preparation.

By relaxing this a bit, and going through the file once, we can build an index
like:

<SHA-1 of the password>:<offset in file>


... that we can then search with skip_to_before_line().

Of course, we won't include all the hashes – by including lines a few kilobytes
apart from each other, we can seek directly to within a few kilobytes in the big
file.

The only thing left to figure out is how much "a few kilobytes" is.

After my endless harping about caching and pages, the answer should be obvious:
one page size (4K). And this actually gets us 0.8 ms! But back when I wrote the
code, that hadn't really sunk in, so after getting 1.2 ms with a 32K distance, I
moved on.

Code: pwned.py, generate-index.py.


BINARY FILE #

Already on the additional file slippery slope, I converted the list to binary,
mostly to make it smaller – smaller file, fewer reads.

I packed each line into 24 bytes:

| binary hash (20 bytes) | count (4 bytes) |


This halved the file, but only lowered the runtime to a modest 2.6 ms.

More importantly, it made the code much, much simpler: because items are fixed
size, you can know where the Nth item is, so I was able to use bisect for the
binary search.

Code: pwned.py, convert-to-binary.py.


GETTING TO UNDER 1 MILLISECOND #

OK, what now? This is what we have so far:

 * The position heuristic kinda (maybe?) works, but is hard to reason about.
 * The index file gets us there, but barely, and the index is pretty big.
 * The binary file isn't much faster, and it creates a huge file. But, less
   code!

I don't know what to do with the first one, but I think we can combine the last
two.


GENERATING THE INDEX #

Let's start with the script I made for the text index:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15

import os, sys

file = sys.stdin.buffer
outf = sys.stdout.buffer

while True:
    pos = file.tell()
    line = file.readline()
    if not line:
        break

    outf.write(line.partition(b':')[0] + b':' + str(pos).encode() + b'\n')

    file.seek(2**12, os.SEEK_CUR)
    file.readline()


The output looks like this:

$ python generate-index.py < pwned.txt 2>/dev/null | head -n5
000000005AD76BD555C1D6D771DE417A4B87E4B4:0
00000099A4D3034E14DF60EF50799F695C27C0EC:4157
00000172E8E1D1BD54AC23B3F9AB4383F291CA17:8312
000002C8F808A7DB504BBC3C711BE8A8D508C0F9:12453
0000047139578F13D70DD96BADD425C372DB64A9:16637


We need to pack that into bytes.

A hash takes 20 bytes. But, we only need slightly more than 3 bytes (6 hex
digits) to distinguish between the index lines:

$ python generate-index.py < pwned.txt 2>/dev/null | cut -c-6 | uniq -c | head
   2 000000
   1 000001
   1 000002
   1 000004
   1 000005
   1 000007
   1 000008
   1 00000A
   1 00000B
   1 00000C


To represent all the offsets in the file, we need log2(35G) / 8 = 4.39... bytes,
which results in a total of 9 bytes (maybe even 8, if we mess with individual
bits).

Let's make it future-proof: 6 bytes for the hash buys at least 55 trillion lines
(2.4 petabyte files), and 6 bytes for the offset buys 0.28 petabyte files.

12
13
14

    digest, _, _ = line.partition(b':')
    outf.write(bytes.fromhex(digest.decode())[:6])
    outf.write(pos.to_bytes(6, 'big'))


If you look at the text index, you'll notice the offsets are 4K + ~50 bytes
apart; this results in sometimes having to read 2 pages, because not all pages
have an index entry. Let's fix that by reading the first whole line after a 4K
boundary instead:

16
17

    file.seek((pos // 4096 + 1) * 4096)
    file.readline()


OK, we're done:

$ time python generate-index.py < pwned.txt > index.bin

real	1m2.729s
user	0m34.292s
sys	0m21.392s



USING THE INDEX #

We start with a skeleton that's functionally identical to the naive script. The
only difference is that I've added stubs for passing and using the index:

 9
10
11

def find_line(file, prefix, index):
    skip_to_before_line_index(file, prefix, index)
    return find_line_linear(file, prefix)


23
24

def skip_to_before_line_index(file, prefix, index):
    ...


30
31

index_path = sys.argv[2]
index = ...


43

line = find_line(file, hexdigest.upper().encode(), index)


The whole skeleton, if you want to see it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

import os
import sys
import time
import bisect
import getpass
import hashlib


def find_line(file, prefix, index):
    skip_to_before_line_index(file, prefix, index)
    return find_line_linear(file, prefix)


def find_line_linear(lines, prefix):
    for line in lines:
        if line.startswith(prefix):
            return line
        if line > prefix:
            break
    return None


def skip_to_before_line_index(file, prefix, index):
    ...


path = sys.argv[1]
file = open(path, 'rb')

index_path = sys.argv[2]
index = ...

try:
    password = sys.argv[3]
except IndexError:
    password = getpass.getpass()
hexdigest = hashlib.sha1(password.encode()).hexdigest()
del password

print("looking for", hexdigest)

start = time.monotonic()
line = find_line(file, hexdigest.upper().encode(), index)
end = time.monotonic()

if line:
    times = int(line.decode().rstrip().partition(':')[2])
    print(f"pwned! seen {times:,} times before")
else:
    print("not found")

print(f"in {end-start:.6f} seconds")


As mentioned before, we'll use the standard library bisect module to search the
index.

We could read the entire index in memory, as a list of 12-byte bytes. But that
would still be slow, even if outside the current timing code, and memory usage
would increase with the size of the file.

Fortunately, bisect doesn't only work with lists, it works with any sequence –
that is, any object that can pretend to be a list. So we'll build our own, by
implementing the two required4 special methods.

27
28
29
30
31
32

class BytesArray:

    item_size = 12

    def __init__(self, file):
        self.file = file


We can go ahead and plug it in:

38
39

index_path = sys.argv[2]
index = BytesArray(open(index_path, 'rb'))


The first special method is __getitem__(), for a[i]:

34
35
36
37
38
39

    def __getitem__(self, index):
        self.file.seek(index * self.item_size)
        buffer = self.file.read(self.item_size)
        if len(buffer) != self.item_size:
            raise IndexError(index)  # out of bounds
        return buffer


The second special method is __len__(), for len(a):

41
42
43
44

    def __len__(self):
        self.file.seek(0, os.SEEK_END)
        size = self.file.tell()
        return size // self.item_size


Using the index becomes straightforward:

23
24
25
26
27
28
29
30
31
32
33
34

def skip_to_before_line_index(file, prefix, index):
    item_prefix = bytes.fromhex(prefix.decode())[:6]
    item = find_lt(index, item_prefix)
    offset = int.from_bytes(item[6:], 'big') if item else 0
    file.seek(offset)


def find_lt(a, x):
    i = bisect.bisect_left(a, x)
    if i:
        return a[i-1]
    return None


We get the first 6 bytes of the hash, find the rightmost value less than that,
extract the offset from it, and seek to there. find_lt() comes from bisect's
recipes for searching sorted lists.

And we're done:

$ average-many pwned.py pwned.txt index.bin
.002546


Huh? ... that's unexpected...

Oh.

I said we won't read the index in memory. But we can force it into the cache by
reading it a bunch of times:

$ for _ in {1..10}; do cat index.bin > /dev/null; done


Finally:

$ average-many pwned.py pwned.txt index.bin
.000421


Code: pwned.py, generate-index.py.


I HEARD YOU LIKE INDEXES (THE END) #

Hmmm... isn't that cold start bugging you? If we make an index for the big
index, we get 1.2 ms from a cold start. Maybe another smaller index can take us
to below 1 ms?

...

Just kidding, this is it, this really is the end.

And now, let's take that moment to ponder:

method statements time (ms, order of magnitude) linear 29 100,000 linear+skip 42
100 binary search 49 10 binary index 59 (72) 1

For twice the code, it's 5 orders of magnitude faster! I'm deliberately not
counting bisect or the OS cache here, beacuse that's the point, they're
basically free.

Turns out, you can get pretty far with just a few tricks.

--------------------------------------------------------------------------------

That's it for now.

Learned something new today? Share this with others, it really helps! Twitter HN
Reddit

If you've made it this far, you might like:

Write an SQL query builder in 150 lines of Python!


BONUS: BETTER DATA STRUCTURES #

As always, a specialized data structure can solve a problem better.

In Sketchy Pwned Passwords, Scott Helme manages to "pack" the entire passwords
list into a 1.5G in-memory count-min sketch, which can then be queried in
1 microsecond. And, if you don't care about the counts, a plain Bloom filter can
even take you to 0.1 µs! (There is a trade-off, and it's that it takes 11 hours
to create either.)

Want to know when new articles come out?

Drop your email in the box below and I'll send new stuff straight to your inbox!






 1. It's kinda difficult to do in Python anyway. [return]

 2. Something like (size / 4096) ** (1 / int(log(size, 4096))), maybe? [return]

 3. I mean, I did cross-check it with other solutions for a few thousand values,
    and it seems correct, but that's not proof. [return]

 4. We're only implementing the parts of the protocol that bisect uses.
    
    For the full protocol, __getitem__() would need to implement negative
    indexes and slicing. For more, free sequence methods, inherit
    collections.abc.Sequence.
    
    Interestingly, the class will work in a for loop without an __iter__()
    method. That's because there are actually two iteration protocols: an older
    one, using __getitem__(), and a newer one, added in Python 2.1, using
    __iter__() and __next__(). For details on the logic, see Unravelling for
    statements. [return]

home ∙ feed ∙ about ∙ © 2021 lemon24