0 NUL 16 DLE 32 48 0 64 @ 80 P 96 ` 112 p
1 SOH 17 DC1 33 ! 49 1 65 A 81 Q 97 a 113 q
2 STX 18 DC2 34 " 50 2 66 B 82 R 98 b 114 r
3 ETX 19 DC3 35 # 51 3 67 C 83 S 99 c 115 s
4 EOT 20 DC4 36 $ 52 4 68 D 84 T 100 d 116 t
5 ENQ 21 NAK 37 % 53 5 69 E 85 U 101 e 117 u
6 ACK 22 SYN 38 & 54 6 70 F 86 V 102 f 118 v
7 BEL 23 ETB 39 ' 55 7 71 G 87 W 103 g 119 w
8 BS 24 CAN 40 ( 56 8 72 H 88 X 104 h 120 x
9 HT 25 EM 41 ) 57 9 73 I 89 Y 105 i 121 y
10 LF 26 SUB 42 * 58 : 74 J 90 Z 106 j 122 z
11 VT 27 ESC 43 + 59 ; 75 K 91 [ 107 k 123 {
12 FF 28 FS 44 , 60 < 76 L 92 \ 108 l 124 |
13 CR 29 GS 45 - 61 = 77 M 93 ] 109 m 125 }
14 SO 30 RS 46 . 62 > 78 N 94 ^ 110 n 126 ~
15 SI 31 US 47 / 63 ? 79 O 95 _ 111 o 127 DEL
Representing Text
Representing text
Characters represent a single symbol on the screen. The basic idea is we assign each value (letter, number) we want to represent a number, and then use that number encoded as a suitably sized integer. Historically, due largely to US influances, the encoding which was used is ASCII. ASCII is capable of representing the english alphabet (UPPER and lower case), some puncutation symbols, some control characters (NULL, TERM, BELL, etc…), and numbers. There are 128 (27) symbols which can be represented.
ASCII does have one major drawback, and that is that it cannot represent lots of symbols that are required when writing in other languages (eg, French, Arabic) and does not have all the symbols which people need to use in every day life outside of of the US (you can’t represent £ or € in ASCII).
There are two ways of dealing with this:
-
Map different symbols to the same number (eg, code pages)
-
Try to come up with one set of numbers which map to every possible character in use (unicode).
footnote Describing a character as a 'single symbol on the screen' is actually simplfying a bit, but we’ll not worry about this here.
Some encodings make use of the remaining values for the byte to represent characters which are otherwise not possible to display. For example, extended ascii. There were also different encodings developed, which allows the values to be remapped depending on the needs of a particular region, assuming the encoding is known.
Strings
A string is a squence of characters (bytes) which represent a word. You can conceptually think of these as arrays of characters. To detect the end of the string, many
programming languages include a speical, NUL
(0) byte at the end of the string. This is called the null terminator.
For example, to represent 'Hello' as an (ascii) string:
Index | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
Letter |
H |
e |
l |
l |
o |
NUL |
Decimal |
72 |
101 |
108 |
108 |
111 |
0 |
Unicode
Unicode is an attempt to assign symbol a different unique number. There are plenty of other values (eg, emojis, control codes, zero-width join) which are also assigned numbers. However, this presents a few problems:
-
If every single character must be represented (and there are possibly millions of characters), how do we do this without making our files MASSIVE
-
There are a lot of pre-unicode documents out there, how do we ensure we can still read these documents easily
-
How do we deal with symbols which look the same, but might have different meanings (without causing offence)
-
How do we know when one symbol begins and another ends (is this document an ascii document where each byte is a letter, or a multi-byte file where each 4 bytes is a letter)
There are many Unicode encoding standards which are in use today. My favourite is UTF-8, although UTF-16 is also a popular choice.
UTF-8
There are many different unicode encodings. UTF-8 is variable width which means the number of bytes required to represent a letter can very based on the letter being represented.
The prefix (number of ones before the first 0) of the first byte tells you how many bytes are present in the letter, each contiation byte starts with 01
:
bytes | 0 | 1 | 2 | 4 |
---|---|---|---|---|
1 byte |
0xxxxxxx |
N/A |
N/A |
N/A |
2 bytes |
110xxxxx |
01xxxxxx |
N/A |
N/A |
3 bytes |
1110xxxx |
01xxxxxx |
01xxxxxx |
N/A |
4 bytes |
11110xxx |
01xxxxxx |
01xxxxxx |
01xxxxxx |
UTF-16
UTF-16 is another method of commonly encoding characters. In this scheme, codes are formed of 2-byte (16-bit) values. All code points below the limit for a 16-bit value are encoded as-is (ie, their codepoint is directly represented in binary). For points that are larger than a 16 bit number, values must be encoded using two 16-bit values. Fortunatly, there is a block of unicode code points which are not assigned (and well never be assigned).
code points | 0 | 1 | 2 | 3 |
---|---|---|---|---|
16 bit representable |
xxxx xxxx |
xxxx xxxx |
N/A |
N/A |
not 16 bit representable |
110 110 xx |
xxxx xxxx |
110 111 yy |
yyyy yyyy |
Last updated 0001-01-01