The Ashley Madison breach has gotten a great deal of attention because of the circumstances and nature of the data breached. For some of the records in the database, it also contains the choice of security question and associated answers for a number of the users, from which we might get some insight into user behavior.
At the outset, we must note that the Ashley Madison dataset is far from a scientific sample: there were a number of test accounts, and other accounts were created by other than the purported owner. The results reported below should be viewed as uncorroborated, at best.
Here is the distribution of security questions in the approximately 32 million record database:
|1||Mother’s maiden name||1,614,785||5.05%|
|2||High school name||1,054,227||3.30%|
|4||Last 4 digits SSN||313,597||0.98%|
As pointed out previously, the questions varied in some other countries. For example, in Norway the school prompt was for “middle school”, and the prompt for social security number was instead for national ID number.
Unlike many sites employing security questions, they were not apparently mandatory at Ashley Madison, and not very popular given that only about 13% of accounts used them. There were only 4 different prompts, much fewer than the 10 or more provided at most other sites.
One might ask how often users enter false data, either out of frustration or to deliberately obscure their answers. The easiest category to observe is the Social Security Number (SSN) category. While in the US the last 4 digits of SSN are uniformly distributed or nearly so, this isn’t the case everywhere, but we’ll ignore that effect for now.
The SSN category breaks down as follows:
|Number of values||Number of answers|
|Other than 4 digits||10850||24496|
Obviously those whose answer wasn’t 4 digits weren’t answering truthfully. In some cases this was clearly intentional, but others were mischosen categories (such as team names). But what about the 4 digit answers? One would expect an average of about 29 instances of each 4-digit answer. But not surprisingly, the distribution of answers was not unlike that of user chosen PINs: The most frequent answer was “1234” (7667 instances), followed by “1111” (1801), “0000” (1461), and “6969” (1157). Years in the latter half of the 20th century were also substantially over-represented, as many members probably used their years of birth or other significant years. The most popular answer in this range was 1969 (134 instances). 58 of those accounts also listed a birth date in 1969.
The other three questions are harder to analyze, but scanning through the answers gives some insight into user behavior. The distribution of mothers’ maiden names provided indicated that many users did answer truthfully (the most common being Smith, Jones, Brown, and Johnson, but some users seemed to answer with women’s names that are not common last names, such as Mary, Maria, and Mom(!). There were also variations in capitalization that would require normalization (converting to a consistent capitalization) if the answers were to be salted and hashed as passwords should be.
School names presented additional challenges. Common school names such as Central (7024 instances), Lincoln (2581), East (2445), West (2381), North (2336), and South (1980) were predominant, but so were initials such as BHS (2012 instances) and CHS (1597). It’s not clear how useful this question/answer would be, given that a user who forgot their password might think they answered “Central” while they actually answered CHS, Central High (432), Central Tech (417), or Central High School (185).
Favorite teams were similar in many respects to schools. There was considerably less diversity in the team name responses, making them easier to guess, particularly considering regional favorites. Yankees (28605 responses), Cowboys (26672), Steelers (20320), and Leafs (20206) were the most popular. Variation on team names were again prevalent, such as Maple Leafs (5558), Toronto Maple Leafs (4354), Mapleleafs (692), and The Leafs (301). There were also many spelling errors.
It’s hard to imagine how the “security” answers could be used for anything. Some of the questions, such as favorite team, were easily guessable, but there were enough possible variations that it was far from assured that the account owner could enter the correct answer. The last 4 digits of the Social Security Number or National ID number was easier to get right, but contained information widely used for account verification at other sites: effectively a shared password. While Ashley Madison used a robust password hashing algorithm, it is easy to see why hashing wouldn’t work for security answers: there are lots of variations to be considered, and that probably requires a human customer service agent. Unfortunately, that customer service agent is likely to be vulnerable to social engineering attacks as well.