Unicode Normalisation Vulnerability
Normalization ensures two strings that may use a different binary representation for their characters have the same binary value after normalization.
- Normalisation algorithms must be idempotent
- Convert strings to canonical form so that it is standardised
There are two overall types of equivalence between characters:
Canonical Equivalence: characters are assumed to have the same appearance and meaning when printed or displayed.
Compatibility Equivalence: is a weaker equivalence, in that two values may represent the same abstract character but can be displayed differently.
There are 4 Normalization algorithms defined by the Unicode standard:
NFD: Normalization Form Canonical Decomposition: Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order.
NFC: Normalization Form Canonical Composition: Characters are decomposed and then recomposed by canonical equivalence.
NFKD: Normalization Form Compatibility Decomposition: Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.
NFKC: Normalization Form Compatibility Composition: Characters are decomposed by compatibility, then recomposed by canonical equivalence.
Case Study
Spotify account hijacking:
- Target user account is bigbird
- Create new account "ᴮᴵᴳᴮᴵᴿᴰ", Python string u'\u1d2e\u1d35\u1d33\u1d2e\u1d35\u1d3f\u1d30'
- Send a request for a password reset for your new account.
- A password reset link is sent to the email you registered for your new account. Use it to change the password.
- Now, instead of logging in to account with username "ᴮᴵᴳᴮᴵᴿᴰ", try logging in to account with username bigbird with the new password.
- Success! Mission accomplished.
Spotify followed simple rules for usernames:
- not allow white space in usernames,
- treat "BigBird" and "bigbird" as the same username.
Creative usernames and Spotify account hijacking
JavaScript
- ECMAScript-5 has problems with characters present in astral planes
'mañana' == 'mañana'
// false
'ma\xF1ana' == 'man\u0303ana'
// false
'ma\xF1ana'.length
// 6
'man\u0303ana'.length
// 7
// match foo + one character + bar
/foo.bar/.test('foo💩bar')
// false
/foo.bar/.test('foogbar')
// true
// doesn't match line breaks, either
/^.$/.test('💩')
// false
// matches line breaks, but
// still doesn't match whole astral symbols
/^[\s\S]$/.test('💩')
// false
/^[\0-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF]$/.test('💩')
// true
MySQL vs. UTF-8
CREATE TABLE `table_name` (
`id` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
`column_name` VARCHAR(255) NOT NULL DEFAULT '',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
- Try adding unicode character
UPDATE table_name SET column_name = 'foo💩bar' WHERE id = 9001;
-- Query OK, 1 row affected, 1 warning
SHOW WARNINGS;
--
-- | Level | Code | Message |
-- |---------|------|-------------------------------------------------------------------------------|
-- | Warning | 1366 | Incorrect string value : '\xF0\x9F\x92\xA9' for column 'column_name' at row 1 |
SELECT column_name FROM table_name WHERE id = 9001;
-- foo
Mitigation
- Use the new UTF-8 implementation in MySQL which is
utf8mb4
mb4
: Multi-Byte-4
ALTER TABLE `table_name`
CONVERT TO CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
C-Sharp
Case mapping:
// Case mapping use these
Regex.IsMatch(input, regex, RegexOptions.IgnoreCase);
Regex.IsMatch("hac\u212A", "HACK", RegexOptions.IgnoreCase) == true
string.Equals(input1, input2, StringComparison.CurrentCultureIgnoreCase);
new CultureInfo(..).CompareInfo.Compare(input1, input2, CompareOptions.IgnoreNonSpace)
// instead of
strings.ToLower(input);
strings.ToUpper(input);
Normalization:
input.Normalize(NormalizationForm.FormC);
input.Normalize(NormalizationForm.FormKC);
Others:
IdnMapping().GetAscii(input)
// Use these of URL
Uri(input).Host /
Uri(input).IdnHost /
Uri(input).SafeDnsHost
// Example
new Uri("http://faceboo\u212A.com").Host == "facebook.com"
Python
Case mapping:
input.lower()
input.upper()
re.compile(regex, re.IGNORECASE).match(input)
input.casefold()
Normalization:
unicodedata.normalize('NFC' input)
unicodedata.normalize('NFKC', input)
Others:
urllib.parse.urlparse(input).hostname
// Example
import urllib.parse
urllib.parse.urlparse("http://i\u006bea.com").hostname ==
urllib.parse.urlparse("http://ikea.com").hostname
Domain Names
- IDN Domains (Internationalized Domain Name)
- 2009
- Stored as ASCII strings using Punycode transcription
- No changes to the DNS system needed
- Checkout notes about URL here: URL Notes Link
RIGHT-TO-LEFT OVERRIDE
Unicode Character 'RIGHT-TO-LEFT OVERRIDE' (U+202E)
ruby -e 'File.rename("backdoor_ppt.exe", "resume\xe2\x80\xaetpp.exe")'
ASCII Encoding
American Standard Code for Information Interchange (ASCII), is a character encoding standard for electronic communication.
1963
7-Bit character set
- only 128 characters
- 0000000 - 1111111
A: 65 = 41 (hex) = 1000001 (binary)
A: 97 = 61 (hex) = 1100001 (binary)
ISO-8856-X
ASCII compatible
8-Bit character set
- 256 characters
- 00000000 - 11111111
8859-2: (Central Europe)
Unicode Encoding
It is a standard for consistent encoding
Each character or symbol is mapped to a numerical value which is referred to as a code point.
It's like a database which gives the relationship between a code point to a character
1991
MultiByte character set
Fully ASCII and ISO-8859 compatible
Different encodings (UTF-8, UTF-16, UTF-32, EBCDIC, ...)
Current version: 12 (~137,000 characters)
U+0000
-U+10FFFF
U+0000
-U+007F
: ASCIIU+0080
-U+00FF
: ISOU+0000
-U+FFFF
(BMP: Basic Multilingual Plane) = 65536 charactersU+010000
-U+10FFFF
(Astral Planes) = Over a million
Encoding
The term encoding means the method in which characters are represented as a series of bytes.
UTF-8 can represent ASCII characters using 1 byte or up to4 bytes for other characters
UTF-16 uses a minimum of 2 bytes, up to 4 bytes
UTF-32 uses 4 bytes for all characters
Western languages are more efficient with UTF-8
Asian languages are efficient with UTF-16
OS must support these standards for applying right encoding
File written with UTF-8 encoding should only be decoded with UTF-8