char

202407191538
Status: #idea
Tags: Java

char

Each code point is a Unicode scalar value. Supplementary characters are those which cannot be encoded in 16-bits (U+10000 to U+1FFFF)

Attention

Unicode code points range from U+0_0000 to U+1_FFFF as of now. However, their encodings can support till U+FFFF_FFFF.

Java uses UTF-16 to encode its characters.
Supplementary characters are encoded using 2 characters

Unicode encodings

A character is referred to as a unicode code point
Unicode has 3 encodings
- UTF-32 → Fixed length encoding of 32 bits
- UTF-16 → Variable length encoding in 16/32 bits
- UTF-8 → Variable length encoding in 8/16/24/32 bits

UTF-8

Pasted image 20240719143810.png

Notice that 1 byte code points start with $0$
2, 3, 4 byte code points start with 2, 3, 4 $1^{'} s$ respectively. The trailing bytes all start with $10$

Warning

1 byte has 1 fixed bit → 7 effective bits
2 bytes has 5 fixed bits → 11 effective bits
3 bytes has 8 fixed bits → 16 effective bits
4 bytes has 11 fixed bits → 21 effective bits

char

Unicode encodings

UTF-8

References