Unicode Normalisation Vulnerability

Normalization ensures two strings that may use a different binary representation for their characters have the same binary value after normalization.

Normalisation algorithms must be idempotent
Convert strings to canonical form so that it is standardised

There are two overall types of equivalence between characters:

Canonical Equivalence: characters are assumed to have the same appearance and meaning when printed or displayed.
Compatibility Equivalence: is a weaker equivalence, in that two values may represent the same abstract character but can be displayed differently.

There are 4 Normalization algorithms defined by the Unicode standard:

NFD: Normalization Form Canonical Decomposition: Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order.
NFC: Normalization Form Canonical Composition: Characters are decomposed and then recomposed by canonical equivalence.
NFKD: Normalization Form Compatibility Decomposition: Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.
NFKC: Normalization Form Compatibility Composition: Characters are decomposed by compatibility, then recomposed by canonical equivalence.

Case Study

Spotify account hijacking:

Target user account is bigbird
Create new account "ᴮᴵᴳᴮᴵᴿᴰ", Python string u'\u1d2e\u1d35\u1d33\u1d2e\u1d35\u1d3f\u1d30'
Send a request for a password reset for your new account.
A password reset link is sent to the email you registered for your new account. Use it to change the password.
Now, instead of logging in to account with username "ᴮᴵᴳᴮᴵᴿᴰ", try logging in to account with username bigbird with the new password.
Success! Mission accomplished.

Spotify followed simple rules for usernames:

not allow white space in usernames,
treat "BigBird" and "bigbird" as the same username.

Creative usernames and Spotify account hijacking

JavaScript

ECMAScript-5 has problems with characters present in astral planes

javascript

'mañana' == 'mañana'
// false

'ma\xF1ana' == 'man\u0303ana'
// false

'ma\xF1ana'.length
// 6

'man\u0303ana'.length
// 7


// match foo + one character + bar
/foo.bar/.test('foo💩bar')
// false

/foo.bar/.test('foogbar')
// true


// doesn't match line breaks, either
/^.$/.test('💩')
// false

// matches line breaks, but
// still doesn't match whole astral symbols
/^[\s\S]$/.test('💩')
// false


/^[\0-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF]$/.test('💩')
// true

JavaScript Unicode

MySQL vs. UTF-8

sql

CREATE TABLE `table_name` (
    `id` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
    `column_name` VARCHAR(255) NOT NULL DEFAULT '',
    PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Try adding unicode character

sql

UPDATE table_name SET column_name = 'foo💩bar' WHERE id = 9001;
-- Query OK, 1 row affected, 1 warning

SHOW WARNINGS;
--
-- | Level   | Code | Message                                                                       |
-- |---------|------|-------------------------------------------------------------------------------|
-- | Warning | 1366 | Incorrect string value : '\xF0\x9F\x92\xA9' for column 'column_name' at row 1 |

SELECT column_name FROM table_name WHERE id = 9001;
-- foo

Mitigation

Use the new UTF-8 implementation in MySQL which is utf8mb4
mb4: Multi-Byte-4

sql

ALTER TABLE `table_name`
CONVERT TO CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;

C-Sharp

Case mapping:

// Case mapping use these
Regex.IsMatch(input, regex, RegexOptions.IgnoreCase);
Regex.IsMatch("hac\u212A", "HACK", RegexOptions.IgnoreCase) == true

string.Equals(input1, input2, StringComparison.CurrentCultureIgnoreCase);

new CultureInfo(..).CompareInfo.Compare(input1, input2, CompareOptions.IgnoreNonSpace)

// instead of
strings.ToLower(input);
strings.ToUpper(input);

Normalization:

input.Normalize(NormalizationForm.FormC);

input.Normalize(NormalizationForm.FormKC);

Others:

IdnMapping().GetAscii(input)

// Use these of URL
Uri(input).Host /
Uri(input).IdnHost /
Uri(input).SafeDnsHost

// Example
new Uri("http://faceboo\u212A.com").Host == "facebook.com"

Python

Case mapping:

python

input.lower()
input.upper()
re.compile(regex, re.IGNORECASE).match(input)
input.casefold()

Normalization:

python

unicodedata.normalize('NFC' input)

unicodedata.normalize('NFKC', input)

Others:

python

urllib.parse.urlparse(input).hostname

// Example
import urllib.parse

urllib.parse.urlparse("http://i\u006bea.com").hostname ==
urllib.parse.urlparse("http://ikea.com").hostname

Domain Names

IDN Domains (Internationalized Domain Name)
2009
Stored as ASCII strings using Punycode transcription
No changes to the DNS system needed
Checkout notes about URL here: URL Notes Link

RIGHT-TO-LEFT OVERRIDE

Unicode Character 'RIGHT-TO-LEFT OVERRIDE' (U+202E)

ruby

ruby -e 'File.rename("backdoor_ppt.exe", "resume\xe2\x80\xaetpp.exe")'

ASCII Encoding

American Standard Code for Information Interchange (ASCII), is a character encoding standard for electronic communication.

1963
7-Bit character set
- only 128 characters
- 0000000 - 1111111
A: 65 = 41 (hex) = 1000001 (binary)
A: 97 = 61 (hex) = 1100001 (binary)

ISO-8856-X

ASCII compatible
8-Bit character set
- 256 characters
- 00000000 - 11111111
8859-2: (Central Europe)

Unicode Encoding

It is a standard for consistent encoding
Each character or symbol is mapped to a numerical value which is referred to as a code point.
It's like a database which gives the relationship between a code point to a character
1991
MultiByte character set
Fully ASCII and ISO-8859 compatible
Different encodings (UTF-8, UTF-16, UTF-32, EBCDIC, ...)
Current version: 12 (~137,000 characters)
U+0000 - U+10FFFF
U+0000 - U+007F: ASCII
U+0080 - U+00FF: ISO
U+0000 - U+FFFF (BMP: Basic Multilingual Plane) = 65536 characters
U+010000 - U+10FFFF (Astral Planes) = Over a million

Encoding

The term encoding means the method in which characters are represented as a series of bytes.

UTF-8 can represent ASCII characters using 1 byte or up to4 bytes for other characters
UTF-16 uses a minimum of 2 bytes, up to 4 bytes
UTF-32 uses 4 bytes for all characters
Western languages are more efficient with UTF-8
Asian languages are efficient with UTF-16
OS must support these standards for applying right encoding
File written with UTF-8 encoding should only be decoded with UTF-8

Unicode Normalisation Vulnerability ​

Case Study ​

JavaScript ​

MySQL vs. UTF-8 ​

Mitigation ​

C-Sharp ​

Python ​

Domain Names ​

RIGHT-TO-LEFT OVERRIDE ​

ASCII Encoding ​

ISO-8856-X ​

Unicode Encoding ​

Encoding ​

Reference ​

Unicode Normalisation Vulnerability

Case Study

JavaScript

MySQL vs. UTF-8

Mitigation

C-Sharp

Python

Domain Names

RIGHT-TO-LEFT OVERRIDE

ASCII Encoding

ISO-8856-X

Unicode Encoding

Encoding

Reference