char
202407191538
Status: #idea
Tags: Java
char
- Each code point is a Unicode scalar value. Supplementary characters are those which cannot be encoded in 16-bits (
U+10000toU+1FFFF)
Attention
Unicode code points range from U+0_0000 to U+1_FFFF as of now. However, their encodings can support till U+FFFF_FFFF.
- Java uses UTF-16 to encode its characters.
- Supplementary characters are encoded using 2 characters
Unicode encodings
- A character is referred to as a unicode code point
- Unicode has 3 encodings
- UTF-32 → Fixed length encoding of 32 bits
- UTF-16 → Variable length encoding in 16/32 bits
- UTF-8 → Variable length encoding in 8/16/24/32 bits
UTF-8

- Notice that 1 byte code points start with
- 2, 3, 4 byte code points start with 2, 3, 4
respectively. The trailing bytes all start with
Warning
1 byte has 1 fixed bit → 7 effective bits
2 bytes has 5 fixed bits → 11 effective bits
3 bytes has 8 fixed bits → 16 effective bits
4 bytes has 11 fixed bits → 21 effective bits