Representing Text

Representing text

Characters represent a single symbol on the screen. The basic idea is we assign each value (letter, number) we want to represent a number, and then use that number encoded as a suitably sized integer. Historically, due largely to US influances, the encoding which was used is ASCII. ASCII is capable of representing the english alphabet (UPPER and lower case), some puncutation symbols, some control characters (NULL, TERM, BELL, etc…), and numbers. There are 128 (2⁷) symbols which can be represented.

Ascii Table (Decimal Values)

    0 NUL    16 DLE    32      48 0    64 @    80 P    96 `   112 p
    1 SOH    17 DC1    33 !    49 1    65 A    81 Q    97 a   113 q
    2 STX    18 DC2    34 "    50 2    66 B    82 R    98 b   114 r
    3 ETX    19 DC3    35 #    51 3    67 C    83 S    99 c   115 s
    4 EOT    20 DC4    36 $    52 4    68 D    84 T   100 d   116 t
    5 ENQ    21 NAK    37 %    53 5    69 E    85 U   101 e   117 u
    6 ACK    22 SYN    38 &    54 6    70 F    86 V   102 f   118 v
    7 BEL    23 ETB    39 '    55 7    71 G    87 W   103 g   119 w
    8 BS     24 CAN    40 (    56 8    72 H    88 X   104 h   120 x
    9 HT     25 EM     41 )    57 9    73 I    89 Y   105 i   121 y
   10 LF     26 SUB    42 *    58 :    74 J    90 Z   106 j   122 z
   11 VT     27 ESC    43 +    59 ;    75 K    91 [   107 k   123 {
   12 FF     28 FS     44 ,    60 <    76 L    92 \   108 l   124 |
   13 CR     29 GS     45 -    61 =    77 M    93 ]   109 m   125 }
   14 SO     30 RS     46 .    62 >    78 N    94 ^   110 n   126 ~
   15 SI     31 US     47 /    63 ?    79 O    95 _   111 o   127 DEL

ASCII does have one major drawback, and that is that it cannot represent lots of symbols that are required when writing in other languages (eg, French, Arabic) and does not have all the symbols which people need to use in every day life outside of of the US (you can’t represent £ or € in ASCII).

There are two ways of dealing with this:

Map different symbols to the same number (eg, code pages)
Try to come up with one set of numbers which map to every possible character in use (unicode).

footnote Describing a character as a 'single symbol on the screen' is actually simplfying a bit, but we’ll not worry about this here.

Some encodings make use of the remaining values for the byte to represent characters which are otherwise not possible to display. For example, extended ascii. There were also different encodings developed, which allows the values to be remapped depending on the needs of a particular region, assuming the encoding is known.

Strings

A string is a squence of characters (bytes) which represent a word. You can conceptually think of these as arrays of characters. To detect the end of the string, many programming languages include a speical, NUL (0) byte at the end of the string. This is called the null terminator.

For example, to represent 'Hello' as an (ascii) string:

Index	0	1	2	3	4	5
Letter	H	e	l	l	o	NUL
Decimal	72	101	108	108	111	0

Unicode

Unicode is an attempt to assign symbol a different unique number. There are plenty of other values (eg, emojis, control codes, zero-width join) which are also assigned numbers. However, this presents a few problems:

If every single character must be represented (and there are possibly millions of characters), how do we do this without making our files MASSIVE
There are a lot of pre-unicode documents out there, how do we ensure we can still read these documents easily
How do we deal with symbols which look the same, but might have different meanings (without causing offence)
How do we know when one symbol begins and another ends (is this document an ascii document where each byte is a letter, or a multi-byte file where each 4 bytes is a letter)

There are many Unicode encoding standards which are in use today. My favourite is UTF-8, although UTF-16 is also a popular choice.

UTF-8

There are many different unicode encodings. UTF-8 is variable width which means the number of bytes required to represent a letter can very based on the letter being represented.

The prefix (number of ones before the first 0) of the first byte tells you how many bytes are present in the letter, each contiation byte starts with 01:

bytes	0	1	2	4
1 byte	0xxxxxxx	N/A	N/A	N/A
2 bytes	110xxxxx	01xxxxxx	N/A	N/A
3 bytes	1110xxxx	01xxxxxx	01xxxxxx	N/A
4 bytes	11110xxx	01xxxxxx	01xxxxxx	01xxxxxx

bytes

1 byte

0xxxxxxx

N/A

2 bytes

110xxxxx

01xxxxxx

N/A

3 bytes

1110xxxx

01xxxxxx

N/A

4 bytes

11110xxx

01xxxxxx

UTF-16

UTF-16 is another method of commonly encoding characters. In this scheme, codes are formed of 2-byte (16-bit) values. All code points below the limit for a 16-bit value are encoded as-is (ie, their codepoint is directly represented in binary). For points that are larger than a 16 bit number, values must be encoded using two 16-bit values. Fortunatly, there is a block of unicode code points which are not assigned (and well never be assigned).

code points	0	1	2	3
16 bit representable	xxxx xxxx	xxxx xxxx	N/A	N/A
not 16 bit representable	110 110 xx	xxxx xxxx	110 111 yy	yyyy yyyy

code points

16 bit representable

xxxx xxxx

N/A

not 16 bit representable

110 110 xx

xxxx xxxx

110 111 yy

yyyy yyyy

Boolean Logic

Last updated 0001-01-01