World Password Day cracking challenge

jpgoldberg · April 2018

This is for discussion of How strong should your Master Password be? For World Password Day we’d like to know, our password cracking challenge announced today (April 26) and will begin on May 3.

jpgoldberg · April 2018

The link for bugcrowd enrollment is now live: https://bugcrowd.com/onepasswordgame

jpgoldberg · May 2018

And the race has started. The actual challenge hashes (actually derived keys) have now been published at https://github.com/agilebits/crackme/blob/master/password-day-2018.json

Yesterday we published GnuPG/PGP signature for the challenge data and signature for the solution file

Signature file `password-day-2018.json.asc`

Signature file `password-day-2018-answers.json.asc`

jpgoldberg · May 2018

Why do we even need to keep an "answers" file?

We do not need to keep the answers file in order to check candidate answers. Indeed, anyone can verify whether a candidate answer is correct or not, so why run the risk¹ of keeping such a file around? We could have simply destroyed the solution data once the challenge hashes were created (which really would have saved me the trouble of figuring out how to encrypt it and store it safely).

We are keeping it around and publicly posting its signature in case we are later accused of not creating the challenges by our stated rules. Suppose that nobody ends up claiming a prize and suspect that we may have, say, used words that aren't on the published wordlist. While that isn't something that we would ever do, and I don't actually think that someone is likely to accuse us of doing so, we are keeping the passwords used around as proof in case there ever is any need to demonstrate that our challenges do have solutions of the form described.

Why did we PGP sign instead of putting the hash in "The Blockchain"?

Using a public ledger, such as some coin blockchain would provide a proof of the time before which the signature was created. Signing with a code signing certificate would also provide a trusted time stamp. So why use PGP, which doesn't have a trusted time stamp?

The answer is simple: Laziness and cost. Given the "frothy" nature of cryptocurrency valuation, it has become expensive to do anything with them other than simple speculative trading. As for not using a code-signing certificate, I do not have access to our code signing keys, and it wouldn't really have been worth the effort of asking someone who does has access to our code signing keys to sign the files. So you're just going to have to trust that we won't swap things out later and backdate. You can make a local copy of the signatures now to see that we don't do anything like that. (And seriously, why would we? But it is nice to have a system where the trust is in the math.)

It is very well encrypted and is in a bank safe deposit box, by the way. It really is protected by multiple factors. ↩︎

jpgoldberg · May 2018

One week in. What we've learned

Our cracking challenged launched one week ago and so far no one has found a password. This could mean one of two things.

Nobody is trying. Or
It is (at least) as difficult as our original estimate.

People are trying

Several people expressed interest in cracking. Some people, such as Chick3nman even tweeted screenshots of their hashcat reports.

hashcat running on 8 cores

Now at 250 million 240 thousand hashes per second spread over seven challenges, if he keeps up this pace then the average time to hit on a solution should be about three weeks. Others, such as NETMUX, have also posted. So we know that some people are dedicating serious resources toward cracking, but we don't know how many.

We didn't make it too easy

Recall that although we have some very very rough guesses about the actual time, electricity, and effort that needs to go into this sort of cracking, it is because we want to get better estimates that we are running the contest. So while we were hoping for "a few weeks", we wouldn't have been terribly surprised if answers came in within days

This saves us money. Had we it been too easy, we would have needed to run a harder contest that would last at least a few weeks to get more useful data.

It still might be too hard

If we've made it too hard (that is if we don't get winners within, say, a month), then we still won't get the data that we need. We need winners to tell us what they did and the effort they put in. So we need winners. If we don't get winners, then we may have to consider increasing the prizes. Obviously we can't say "on such and such date we will increase prizes by such and such amount" as that might incentive people to delay reporting solutions. But if it comes to having to increase prizes, we will figure something out.

Lessons so far

Three word Master Passwords passwords from our generator are not terrible. This is very good news. I'm still strongly inclined to recommend four words. Again, the strength of your Master Password is your line of defense against someone who captures your 1Password data from your machine. (Your Secret Key works additional protection if data is stolen from our machines, but it doesn't protect you from data stolen from yours.)

As I said in the original blog post if someone captures 1Password data from your machine the strength of your Master Password is what determines how much time you to change the passwords you store in 1Password. So while it's nice to know that a three word generated password might buy you a week or two (or three or four), go with a four word Master Password so that you will have years or decades.

Curcumin · May 2018

I found benchmarks for the p3.16xlarge here . I tested locally and the iterations do scale linearly. So you would do 100x less hashes than what's specified in the benchmark (since HC benchmarks with 1,000). That yields 212kH/s (slightly slower than the screenshot above). Over 330 days to exhaust the list and at $600/day it adds to almost $200k. This is contingent on my math which I may have slipped up somewhere.

EnerJi · May 2018

@Curcumin:

Out of curiosity, did you check whether the newer p3.16xlarge instances were more cost effective than the p2.16xlarge instances?

I suspect that even if they are, they still won't be cheap enough to interest someone in using them for this contest. Someone who already owns and has amortized the cost of a mining rig and only needs to pay for electricity (plus the opportunity cost of not mining) is much more likely to break-even.

@jpgoldberg Thanks for the running updates. Following this contest with much interest!

Curcumin · May 2018

@EnerJi updated my post to P3 - I was looking at the wrong benchmark.

AGAlumB · May 2018

Someone who already owns and has amortized the cost of a mining rig and only needs to pay for electricity (plus the opportunity cost of not mining) is much more likely to break-even.

@EnerJi: That's a really salient point. :)

Over 330 days to exhaust the list and at $600/day it adds to almost $200k.

@Curcumin: Thank you for sharing! It's definitely interesting to see what people come up, even theoretically. Looking forward to more. :)

Curcumin · May 2018

If a mod could delete my first comment that would be awesome. I got stuck in moderation queue then it posted wayyyy after the fact.

I think there may be a mistake in @jpgoldberg's post.

Now at 250 million hashes per second

I believe he's running at 240,000 Hashes/s. 30,000 on each GPU * 8 = 240,000.

Words in list: 18329
Three word keyspace: 18329^3 = 6157668625289

Days = [Keyspace/Hashes per second/86400 Seconds in a day]

6157668625289/240000/86400 = ~297 days

Also this isn't computed over each of the 7 results because they have unique salts and need to be calculated independently. Unsubstantiated thought, but I think it might be smarter to spend all of the hashing power on just one hash since it will yield a guaranteed result in at most 297 days. When spreading it across 7 hashes, it's technically possible for those 297 days to pass without a single result. Although I'm not sure how HashCat distributes the work; if it does it in a linear fashion by iterating through the entire list for each hash, or takes one password and tests it through each hash. I'm sure they've already implemented an optimized way to handle multiple hashes.

Ah I wish I could join the fun, but my puny 970 GTX can't hang with the big guys. I would love to build a dedicated rig, but I'm not sure if I would have any other use for it to be honest (not crypto mining).

Curcumin · May 2018

--deleted--

jpgoldberg · May 2018

I think there may be a mistake in @jpgoldberg's post.

Indeed, by a factor of 1000. I had started to write "a quarter of a million" and then decided to go with "250 thousand", but somehow had "million" still stuck in my head.

jpgoldberg · May 2018

My calculation based on 250 thousand (I had looked at the thing, thought "quarter of a million" as an approximation for "240 thousand" and thereafter that approximation was stuck in my head) was calculated with something like

TotalSpace := 18350^3  //  6.178857875e12
AverageNeededGuesses := TotalSpace/2 //  3.0894289375e12
SecondPerDay := 60 * 60 * 24   // 86400
HashesPerDay := 250000 * SecondPerDay // 2.16e10
Days := AverageNeededGuesses / HashesPerDay  // 143

// The step I'm not sure is justified:
Targets := 7 // there are seven challenges to crack
DaysUntilHit := Days/Targets // 20

That last step, dividing by seven for the number of challenges offered, is based on the assumption that each reported "hash" is actually trying against all seven targets. I believe that that is possible because the salt in PBKDF2 is only used in the first round. So I would need someone who knows more about hashcat and PBKDF2 cracking to tell me whether I am right or wrong on that last part.

Curcumin · May 2018

@jpgoldberg I don't think average needed guesses is a correct metric, since that would just calculate a 50% chance of hitting it within that time frame.

is based on the assumption that each reported "hash" is actually trying against all seven targets.

That is incorrect because each hash has a unique salt, meaning they need to be calculated independently. That means if you run the function with the salt of hash 1 and use the correct password for hash 2, that wouldn't trigger a success for hash 2 since it was created with a different salt. What you're saying would only be true if they all used the same salt and just different passwords.

This goes back to my point wondering if it's smarter to throw all of the hashing power towards one hash, so you are guaranteed a result in the shortest timeframe.

Also would you be able to delete the comment #429488? It got stuck in the moderation queue and now it's duplicated on my previous one.

jpgoldberg · May 2018

This isn't computed over each of the seven results because they have unique salts and need to be calculated independently.

That is true. But the average time to crack one of them is not the same as the average time to crack any of them. It will take somewhere between 1 and 6157668625289 to to crack one of these. The average for hitting any particular one will be half of that range.

But we don't need to go halfway through before we find the first one (of seven). Some of these will take fewer than 6157668625289/2 guesses, and some will take more. By going after all seven, you are going to hit the ones that take fewer first.

Dividing by seven was the wrong thing to do. but there is some real gain in going after all seven simultaneously. (I need to review an on-line statistics course I took a few years back that talked about stuff like this.)

jpgoldberg · June 2018

Doubling the prizes

As of June 11, we are doubling the prizes:

For the first person or team to crack a three word password, we offer 4096 8192 USD.
For the second person or team to crack a different three word password, we offer 2048 4096 USD.
For the third person or team to crack yet a different three word password, we offer 1024 2048 USD.
And for the fourth person or team to crack yet another one, we we offer 1024 USD.

Why double

My pre-contest guess of a few weeks or of what it would take to incentivize more hardware being thrown into the effort appear to have been too low. Again, the whole point of us doing this is so that we have more information to go on the next time we "guess".

As we said in the original announcement

If no correct submission has been submitted within one month, we may increasing the prizes. However, such an increase and the timing of it (if it occurs) will be unpredictable. Do not delay a submission in the hope of an increased prize.

We do want to pay people fairly for the resources that they are putting into teaching us how hard or easy these are to crack. For those who have been working on this for more than a month, we want to make that worthwhile. And perhaps there are those who think they might be able to catch up, hopefully the new prizes will bring those people into the effort.

So get cracking!

Resources

All submissions must be through the Bugcrowd 1Password Game program
Our original announcement
The source used for generating the challenges
The challenge data

analogist · June 2018

By going after all seven, you are going to hit the ones that take fewer first... there is some real gain in going after all seven simultaneously.

@jpgoldberg This bothered me for a while and I spent a while to get my head around it, since I'm not very good at math.

Assume the success of cracking any particular hash is a draw from a uniform distribution with a PDF of and CDF of .
The CDF of the minimum of n independent draws from a uniform distribution would be , so its PDF is the derivative, or

Since drawing from is n times faster than , we want to know the percentage of time where

or

For n = 7 and F(x) = x, this is
where x = [0, 1]

This is ~52% for 7 hashes, or a factor of speedup (to get the first one) of ~2.1x

This is a plot of factor of speedup to getting any 1 hash, when attacking n hashes simultaneously. I think. This might be wrong, though.

jpgoldberg · June 2018

Thank you @analogist!

That all seems plausible, but I'd need to check up on the CDF for n draws from a uniform distribution. I think you are on the right track even if you are wrong. (I have no specific reason to believe that you are wrong, but I, along with others, have a history of being wrong about this.)

jpgoldberg · July 2018

Is it time to offer hints?

As we approach two months with no results, we have learned from people who've tried that we did under-estimate the time and effort it would take to crack these. We've already doubled the reward once, and we are open to increasing it again, but we would need to know that there is an end in sight to that approach.

An alternative is to offer hints. Other than the logistics of doing so (the answers are stored, encrypted of course, on removable media in a "secure location"), there is the concern about whether providing hints would disrupt the work that people have been doing already for nearly two months.

I am not an expert in the use of cracking tools, and so I don't know whether offering hints would mean that existing effort would be thrown away.

The sorts of hints, that I'm considering would be things like

The total number of characters, including the spaces, in challenge ID is N
The length of the first word in challenge ID is N
When the password for challenge ID is simply hashed once, the first N bits of the hash are B

Or similar. The first two would make some of the challenges easier than others, which is unfortunate. I would like to find hints that apply equally to all of the challenges. So I would be more inclined to go with something like the third hint.

So what I would like to ask is

Would giving hints be unfair to those who have already been working at this?
If not, then what sorts of hints would participants recommend.

CyberSpaceCowboy · July 2018

@jpgoldberg I am glad that you are considering hints!

Hashcat allows for resumeing and as long as on restart you can add additional flag argument to restrict the length by an upper and lower bound so a hint on the lines of size of pwd should be a good approach. If resume doesn't allow for flags and people are using crunch for feeding, then at least I'm sure you can alter crunches flags on resume. Basically the cracking program (hashcat) will just throw out before encryption of bad passwords so anything before is checked in full but anything after is tested then checked. (A second confirmation on this would be nice though by anyone with a mining rig cluster)

No matter what though hashcat gives you a way to see current placement in the list used, so the work required to refactor a full list and subtract used pairs would be reasonable given that the permutation is reduced competitively. Some methods would just require more work than others.

CyberSpaceCowboy · July 2018

Something I want to put out for everyone to help understand why the PBKDF2-HMAC-SHA256 method is being so slow even on dedicated hardware.

To conduct this process over a GPU as an advantage to get more values/second, compared to CPUs may not be as beneficial as expected given the parameters of the challenge. A GPU has the advantage of doing repetitive 32-bit integer operations in parallel very, very fast, like rendering polygons for example (you know, what they were designed for?). In this challenge we are using PBKDF2 with the HMAC-SHA256 Digest method consisting of 32-bit operations and the derived key size for this exercise of a 32 bit output (all checks out right?). Normally a GPU can still handle this operation fairly well, however the problem that the GPU is faced with for this case is the hash stretching or salting of the password. For this exercise 100000 rounds of stretching is implemented (very safe and reasonable) and a completely random 16 bit salt is used (although given in the challenge). Because of the nature of salting, using the output of a previous round as the input for the next, GPUs cannot put these actions in parallel. This would mean a cracking program (hashcat) would need to put these actions in block, rather than the more prefered parallel and 100000 is a lot to handle. I do not know the inner workings of hashcat's distribution when a big salt is used but this problem has been cited as an issue before in cracking slow down not being linearly proportional to the stretching iterations, so if I had to take a guess it doesn't handle it perfectly. (sources can be furnished upon request)

Sc00bz · July 2018

@analogist sorry but your analysis is wrong. @jpgoldberg I told you on Twitter one salt at a time is faster.

Probability of any event happening with 7 independent events. Each event has a probability of happening of 9.4276...%:
50% = 1 - (1 - 0.094276...) ^ 7

How I got "0.094276..." (You can change the 50% to any percent from 0% to less than 100%):
0.094276... = 1 - e ^ (ln(1 - 50%) / 7)

So when checking 7 salts at the same time you need to check 0.65993 * keyspace (7 * 0.094276) number of keys to get a 50% chance of finding one. But if you used just one salt you only need to check 0.5 * keyspace number of keys to get a 50% chance of finding it. Trying to crack all 7 salts at the same time will take 31.987% longer to get to a 50% chance. Note at 1 * keyspace number of keys with one salt you are guaranteed to crack it, but with all 7 it's 66.008% (1 - (1 - 1 / 7) ^ 7) chance of finding one (1 * keyspace = 1 / 7 * keyspace * 7 salts).

Sc00bz · July 2018

@jpgoldberg as for hints it's more realistic to get a length of a word or password by shoulder surfing or listening. Giving a length can help some more than others. Say someone tried the shorter passwords first to get a like 0.001% advantage (however pointless that is) and then find out none of the passwords they checked could of been correct.

A short hash is most consistent for speed up (each bit gives a 2x speed up) vs lengths which are dependent on the specific length:

Speed increase given password length

Speed increase given word length

An added benefit to giving a short hash is it forces people to attack one hash at a time as there is no easy way to tell HashCat to try that password only on a specific salt. If you do give a short hash start at 1 bit and increase by 1 bit each month. Note some people might not of built in a resume feature. I think most are doing something like "./customScript | hashcat ..."

CyberSpaceCowboy · July 2018

@Sc00bz I'm not sure I understand what you mean by a short hash? Are you saying they should redistribute the entire challenge with a new method? (because that would kinda defeat the ability to acquire meaningful information of how this was done and how it relates to 1Password's actual security) It seems a little unreasonable to assume of the few people doing this, word size played a factor in the custom script used to pipe in guess pairs. Chances are an alphabetical or entropic pattern is being implemented for simplicity, so length only would add a "throw out" condition to people's custom scripts.

Sc00bz · July 2018

Short hash: "When the password for challenge ID is simply hashed once, the first N bits of the hash are B"

The main problem for giving a length is that you can't control the speed up. Also if they look at the passwords and determine that it gives too much of a speed up then that gives us extra info. I would say that they should not give full password length as there is a 52.8% chance that at least one will get a speed up of at least 27.35x and 70.2% chance for at least 17.56x. For length of the first word there is an 18.1% chance for at least one getting a 35.59x speed up or 81.9% chance for all to have at most a 8.21x speed up.

With the 8.21x increase in speed it will take on average 7.62 GPU-months (GPU being GTX 1080 Ti). Doing a short hash you can initially give 1 bit then give another bit then another and so on. I think 2 bits (4x speed up) should be enough for this to be cracked in a month. Assuming everyone checks passwords in random enough order* to limit the testing overlap between people. Also assuming everyone doesn't restart from scratch. Unless they are not properly generating passwords.

* Don't choose random passwords from the keyspace but randomize the keyspace and check those without rechecking passwords. Taking the word list and doing a shuffle is fine. There are better ways but at least do that. If you are doing random passwords stop and shuffle the word list and check sequentially with your shuffled word list.

jpgoldberg · July 2018

Hey @sc00bz, thanks for joining the conversation here. You did indeed tell me on Twitter what gains we should expect by going after multiple challenges. But you had previously given an estimate that was dead wrong, and which I embarrassingly repeated above without checking. So now I want to be sure that I fully understand the math before repeating it. I am more than ready to defer to your expertise, but I want to make sure that I understand things before I repeat them.

I agree that the short hash is the only (proposed) hint that will improve all passwords uniformly (assuming that they do vary in length, I'm not going to check unless I have to.) And so if we do hints it probably will be the short hash. I'm not sure whether 1 bit or 2 would be appropriate to start with.

And thank you @CyberSpaceCowboy for reassuring me that if we do go with hints around things like length, it isn't going to mess things up for hashcat users. Your explanation about the use of distinct salts is what I expected, but I didn't know if there was some trick that hashcat could use that I hadn't anticipated. Distinct salts are part of what we are trying to model.

As you both can see, I am not an expert in the lower level aspects of cracking, which is why we've been trying to incentivize you all to do the work for us! We just need to make sure that the incentives are actually worth your while. So whether that will be through hints or through prize increases remains to be seen. But we do want to be fair to those teaching us what we needed to know.

Sc00bz · July 2018

But you had previously given an estimate that was dead wrong, and which I embarrassingly repeated above without checking.

Are you talking about this?

From my tweet "18328^3/2/(240000*7)/3600/24 days":
singleHashTriesFor50Percent = 18328^3/2
singleHashSpeed = 1,680,000 = 240,000*7 ******
secondsToDays = 1/3600/24

The only thing I got wrong was assuming there's little to no difference between attacking one hash or all seven.

"18328^3/2/(240000*7)/3600/24 days [to find one]" is wrong *
"18328^3*0.094276/240000/3600/24 days to find one" is correct

* for seven hashes

Anyway you should only be cracking one hash and thus "18328^3/2/(240000*7)/3600/24" is correct.

****** Just found out Hashcat switched how they report salted speed. They report "single salt speed" (... or what is known as hashes per second) 240,000. So real speed with 7 hashes is 34,000 (passwords per second). This was switched about 2 years ago and I just never noticed. I think this was changed because of the AM dump of millions of bcrypt hashes. I remember people asking why the cracking speed is 0.

I should of just took these benchmarks and divided PBKDF2-HMAC-SHA256 by 100: 151,871 H/s. Oh that's what I did here.

I'll re-explain this:

You have two, six sided dice. What is the probability of rolling and at least one die is a one?
Not rolling a one is 5/6 and doing that twice is (5/6)^2. So rolling at least one, one is 1-(5/6)^2. In general it is 1-(1-p)^n. p is probability and n is number of times.

Let's say p = 0.0942763 and n = 7
0.500000 = 1 - (1 - 0.0942763) ^ 7

For cracking a uniform random key from a key space, the probability of guessing correctly is guesses/keySpace
p = 0.0942763
guessesPerSalt = 0.0942763 * keySpace
totalGuesses = n * guessesPerSalt = 7 * 0.0942763 * keySpace = 0.6599341 * keySpace = 4,062,990,000,000

Compare that to p = 0.5 and n = 1
0.5 = 1 - (1 - 0.5) ^ 1
guessesPerSalt = 0.5 * keySpace
totalGuesses = n * guessesPerSalt = 1 * 0.5 * keySpace = 0.5 * keySpace = 3,078,330,411,776

All seven vs one to get a 50% chance of cracking:
4,062,990,000,000 vs 3,078,330,411,776

Basically would you rather roll six, six sided dice and if any are a one, you win OR roll one, six sided die and you win.

(assuming that they do vary in length, I'm not going to check unless I have to.)

That's a 1 in 176,905 chance for password length to be all the same and 1 in 9,669 chance for first word length to be all the same. You can safely assume they vary.

analogist · July 2018

@Sc00bz You're absolutely right. I don't think I can edit my comment now unfortunately. In retrospect, it makes no sense to do intercept crossings on PDFs. Only CDFs map to outcomes.

Just to convince myself numerically,

the avg of 100k uint32s:
$ ./hcatsim | head -n 100000 |computestat
Average: 2148127802.649780.

the avg of 100k of the smallest of 7 uint32s:
./hcatsim | tail -n 100000 |computestat
Average: 534712113.780030.

That's a about a 4x speedup, but requires 7 times more effort.
So all-7 as compared to one-at-a-time is about 7/4 = 1.75x harder.

jpgoldberg · July 2018

@Sc00bz, the part that I am unsure about is where you put together

totalGuesses = n * guessesPerSalt = 7 * 0.0942763 * keySpace = 0.6599341 * keySpace = 4,062,990,000,000

I'm not challenging your arithmetic, but I'm not sure that that is the relevant computation.

I think that a better approach to this problem is to first ignore the big numbers. The seven passwords are chosen uniformly from a set of 18328^3. But we can ignore that number and talk about seven values chosen uniformly on [0, 1). Afterwards we will multiply out.

We know that if just one thing is chosen that way its expected value is 0.5. And the expected value of each item chosen is also 0.5. But what we want to know is if seven are chosen what is the expected value of the lowest one. That is how far you need to go through the whole set before you can expect to get one of the seven.

That is what I don't know how to do the math on¹ (but really should, because I think I've been taught that at least once), but is also what makes me think that @analogist is on the right track by talking about probability distributions.

And when don't know the math, I ask the question on Quora. ↩︎

jpgoldberg · July 2018

And Quora has come through for me (and told me that @Sc00bz is right.)

So if we are searching for one of seven uniformly selected passwords, then we should expect to find one after going through 1/(1 + 7) of the space. So we should expect to hit the first one after going through 1/8 of the space.

Note that this is not an eight times speedup, as when just searching for one, we only needed to go through half the space on average. So this is a four times speed up. That is the same value that @Sc00bz calculated. I didn't follow us argument, but he was right.

So we get a four times speed up by going after the the "first of seven, whichever that turns out to be", while dividing our effort over the seven. So Sc00bz's 7/4 is on target.

Conclusions

Going after all seven simultaneously does more harm than good. Better to just pick one of the seven (which is what I'd originally expected people to do) and shoot for that one.
@Sc00bz is usually right about stuff like this.

Sc00bz · July 2018

So Sc00bz's 7/4 is on target.

That was @analogist's target. I ran some tests after seeing his comment because I kind of forgot I could just do that:
"minSeven" is the minimum value of 7 random 32 bit integers (keySpace = 2 ^ 32). Average amount of work was 0.874480 * keySpace "average(7 * minSeven)" which is close to expected the 7/8 * keySpace. I was talking about the amount of work for a probability of cracking (vs average amount of work): minSeven < 404913779 ≈ (1 - e ^ (ln(1 - 50%) / 7)) * keySpace ≈ 0.0942763 * keySpace happened 50.0373% of the time which is close to the expected 50%. For this I generated a million "minSeven" with a CSPRNG. I feel like this thread would of been shorter if I thought to simulate this several comments ago.