in December 2009 via SQL injection and 32 M pass-
words were leaked. The site appears to have always
used UTF-8. At the time of the leak its home page was
available in English, Spanish, Portuguese, Chinese, and
Thai. No password requirements were in place at the
time of the compromise.
• porn.com: Porn.com is a pornographic website which
offers premium accounts. It was compromised in June
2011 as part of the “50 Days of Lulz” hacking incident,
with about 25,000 usernames, emails, passwords, and
names leaked. The site is available in English only and
uses UTF-8. Passwords were required to be at least
three characters long.
• www.nato.int/cps/en/natolive/e-bookshop.htm: The
NATO e-Bookshop is a website for ordering and
downloading official publications of NATO, the North
Atlantic Treaty Organisation. It was compromised
in June 2011 as part of the “50 Days of Lulz”
hacking incident with about 10,000 usernames, emails,
passwords, and names leaked. The site is available
in English, French, Russian and Ukrainian and uses
UTF-8. No password requirements were observed.
• wonder-tree.com: Wonder-Tree is a small religious
meditation website. In May 2011 the site’s adminis-
trators accidentally made a text file of about 1,000
usernames, emails, passwords, and postal addresses
publicly accessible. The site is primarily in Hebrew
with some English translation. Interestingly, the site’s
home page uses the Windows-1252 (Hebrew) encoding
while the leaked data was entirely encoded in UTF-8.
• 70yx.com: 70yx is an online gaming website. A list of
10 million usernames and passwords were leaked as
part of a major Chinese hacking incident in December
2011. 70yx is available in Chinese only and uses
GB2312. It allows passwords of 6–16 characters from
the ASCII subset only.
• csdn.net: CSDN is a Chinese-language forum site for
software developers. A database of 6 million passwords
were leaked in the same 2011 incident as 70yx, claimed
to be a backup of the database of accounts from 2009.
CSDN is available in Chinese only. It uses UTF-8 and
imposes a 5-character minimum on passwords.
B. Types of characters chosen by users
In Table I, we provide an overview of the frequencies of
different character classes within our available data. Note
that all of our data sets were in effect encoded in UTF-
8, with the one site not using UTF-8 (70yx) restricting
users to ASCII characters for which the GB2312 encoding
is identical.
1) Malformed passwords: The RockYou data set was
the only one in which we observed passwords which were
not well-formed UTF-8 strings. There were only 256 such
passwords out of over 32M. It isn’t possible to conclusively
determine the encoding, particularly for short strings which
may use encodings in which every string is valid. A manual
inspection suggested that most of the non-UTF-8 passwords
were ISO 8859-n variants, particularly ISO 8859-1. This
might be due to non-compliant browsers submitting this
encoding despite the character set specified by the page.
2) Non-ASCII passwords: The vast majority of the pass-
words observed in all data sets consisted only of characters
in the traditional ASCII subset of UTF-8 (technically called
the Basic Latin block). The Wonder-Tree data set contained
about 2.5% passwords using characters outside of this range,
in all but one case using the Hebrew block of Unicode.
8
In-
terestingly, 87.5% of the Wonder-Tree users did use Hebrew
characters in their username, indicating that the majority of
users actively decided not to use Hebrew characters in their
passwords despite Hebrew being their preferred language.
All other data sets had fewer than 0.01% passwords
containing non-ASCII characters. Still, the RockYou data set
had 18,031 such passwords. Of these, 58.1% only included
characters from the ISO-8859-1 (Latin-1) character set,
mostly consisting of accented Latin letters. The remainder
included a wide mixture, including Cyrillic, Greek, Hebrew,
Arabic, Chinese, Japanese, and Korean, indicating that some
tiny percentage of users do choose to use passwords in their
native writing system.
Interestingly, in the CSDN data set only a handful of the
non-ASCII passwords actually contained Chinese characters,
with roughly ten times more passwords consisting exclu-
sively of the black circle character , which may be an
artifact of a buggy password manager or due to misguided
copying of discreetly rendered passwords.
3) ASCII character preferences: Within the large major-
ity of exclusively-ASCII passwords, what is most striking
is a preference towards numeric-only passwords in the non-
English language data sets, as seen in Table I. Fewer than
16% of users at RockYou only used digits in their passwords
while 45–48% of users in the Chinese data sets did so.
The Hebrew users at Wonder-Tree were in the middle with
38% of passwords being numeric-only. Similarly, 53–60%
of users in the predominantly English data sets included at
least one number in their passwords, compared to 65% of
the Hebrew speakers and 87–90% of Chinese speakers.
A simple hypothesis to explain this phenomenon is that
the Hindu-Arabic numeral system (the written digits 0, 1,
. . . , 9) is commonly used in both written Hebrew and
Chinese,
9
making the digits more familiar for users with
low fluency in English or other languages using the Latin
alphabet. For Hebrew speakers, there is a second advantage
8
The lone exception was a password which was entirely ASCII except
for the special UTF-8 non-printing character which marks a change from
left-to-right to right-to-left text rendering, which may have been included
due to a copy-and-paste error.
9
Both Hebrew and Chinese have separate numeral systems which are
used for ceremonial or historical purposes, but Hindu-Arabic numerals are
almost always used for practical applications.