Character encodings

In Lua and with software in general, there is no such thing as plain text: everything is encoded as binary, generally organized in bytes (groups of 8 bits), and text is yet another construct above bytes by using what is known as character encodings.

Character encodings are the art of representing text as a sequence of characters over bytes. This is something that also applies in Lua, as Lua strings are nothing more than sequences of bytes with no validation or transformation; interpreting them as text is up to the programs and native APIs.

For simplification, we’ll suppose that encodings are simply a way to encode a sequence of codepoints, where each codepoint is a numerical value representing a character within bounds, over a sequence of bytes, and that for a sequence of codepoints, there is exactly one way to represent it over a sequence of bytes. Note that this doesn’t mean a text has only one representation; see Unicode normalization on that subject.

Real-world uses Unicode encodings, mostly, from the point of view of ComputerCraft:

  • The host system running Java might be using UTF-16 (Windows API implement this, as Microsoft was an early adopter of Unicode), UTF-8 (for most common systems) or earlier encodings such as ASCII, an ISO-8859 variant, Windows-1252, Shift JIS, etc (generally on systems dating from before 2000), or even vendor-specific encodings.

  • Java uses UCS-2 as its native character and string type, which is limited to the BMP (sorry emojis, you won’t fit in).

  • Most of the web accessible through HTTP uses UTF-8, but lots of earlier websites still use ISO-8859 variants.

In this document, I’ll describe common character encodings encountered while programming for or contributing to thox.

ComputerCraft encoding

Monitor characters in ComputerCraft.

Monitors and terminals in ComputerCraft use a custom 8-bit encoding derived from ISO 8859-1, with the 00-1F range backported from Code page 437 (with some exceptions) and the 80-9F ranges replaced with custom graphical characters; the table above corresponds to the 8x11 glyphs defined in the mod.

This comes from the fact that Cobalt, the LuaJ fork specific to ComputerCraft, i.e. the Lua interpreter, decodes and encodes a Lua string by simply mapping byte values to Unicode codepoints; see LuaString.decode.

This encoding is used in every string passed on to the Lua code, and expected from every string passed on by the Lua code to the native APIs. When using native ComputerCraft or emulator (e.g. CCEmuX), there is no proper mapping operation when pasting to a terminal: for example, the character “♫” (U+266B) is not properly converted into the character 0x0E (15), instead only being converted to 0x3F (char. 63, “?”).

The code page layout is the following:

_0

_1

_2

_3

_4

_5

_6

_7

_8

_9

_A

_B

_C

_D

_E

_F

0_

NUL
0000
0
263A
1
263B
2
2665
3
2666
4
2663
5
2660
6
2022
7
25D8
8
2642
11
2640
12
266A
14
266B
15

1_

25BA
16
25C4
17
2195
18
203C
19
00B6
20
§
00A7
21
25AC
22
21A8
23
2191
24
2193
25
2192
26
2190
27
221F
28
2194
29
25B2
30
25BC
31

2_

SP
0020
32
!
0021
33
"
0022
34
#
0023
35
$
0024
36
%
0025
37
&
0026
38
'
0027
39
(
0028
40
)
0029
41
*
002A
42
+
002B
43
,
002C
44
-
002D
45
.
002E
46
/
002F
47

3_

0
0030
48
1
0031
49
2
0032
50
3
0033
51
4
0034
52
5
0035
53
6
0036
54
7
0037
55
8
0038
56
9
0039 | 57
:
003A
58
;
003B
59
<
003C
60
=
003D
61
>
003E
62
?
003F
63

4_

@
0040
64
A
0041
65
B
0042
66
C
0043
67
D
0044
68
E
0045
69
F
0046
70
G
0047
71
H
0048
72
I
0049
73
J
004A
74
K
004B
75
L
004C
76
M
004D
77
N
004E
78
O
004F
79

5_

P
0050
80
Q
0051
81
R
0052
82
S
0053
83
T
0054
84
U
0055
85
V
0056
86
W
0057
87
X
0058
88
Y
0059
89
Z
005A
90
[
005B
91
\
005C
92
]
005D
93
^
005E
94
_
005F
95

6_

`
0060
96
a
0061
97
b
0062
98
c
0063
99
d
0064
100
e
0065
101
f
0066
102
g
0067
103
h
0068
104
i
0069
105
j
006A
106
k
006B
107
l
006C
108
m
006D
109
n
006E
110
o
006F
111

7_

p
0070
112
q
0071
113
r
0072
114
s
0073
115
t
0074
116
u
0075
117
v
0076
118
w
0077
119
x
0078
120
y
0079
121
z
007A
122
{
007B
123
|
007C
124
}
007D
125
~
007E
126
🮙
1FB99
127

8_

EMQ
2001
128
🬀
1FB00
129
🬁
1FB01
130
🬂
1FB02
131
🬃
1FB03
132
🬄
1FB04
133
🬅
1FB05
134
🬆
1FB06
135
🬇
1FB07
136
🬈
1FB08
137
🬉
1FB09
138
🬊
1FB0A
139
🬋
1FB0B
140
🬌
1FB0C
141
🬍
1FB0D
142
🬎
1FB0E
143

9_

🬏
1FB0F
144
🬐
1FB10
145
🬑
1FB11
146
🬒
1FB12
147
🬓
1FB13
148
258C
149
🬔
1FB14
150
🬕
1FB15
151
🬖
1FB16
152
🬗
1FB17
153
🬘
1FB18
154
🬙
1FB19
155
🬚
1FB1A
156
🬛
1FB1B
157
🬜
1FB1C
158
🬝
1FB1D
159

A_

NBSP
00A0
160
¡
00A1
161
¢
00A2
162
£
00A3
163
¤
00A4
164
¥
00A5
165
¦
00A6
166
§
00A7
167
¨
00A8
168
©
00A9
169
ª
00AA
170
«
00AB
171
¬
00AC
172
SHY
00AD
173
®
00AE
174
¯
00AF
175

B_

°
00B0
176
±
00B1
177
²
00B2
178
³
00B3
179
´
00B4
180
µ
00B5
181
00B6
182
·
00B7
183
¸
00B8
184
¹
00B9
185
º
00BA
186
»
00BB
187
¼
00BC
188
½
00BD
189
¾
00BE
190
¿
00BF
191

C_

À
00C0
192
Á
00C1
193
Â
00C2
194
Ã
00C3
195
Ä
00C4
196
Å
00C5
197
Æ
00C6
198
Ç
00C7
199
È
00C8
200
É
00C9
201
Ê
00CA
202
Ë
00CB
203
Ì
00CC
204
Í
00CD
205
Î
00CE
206
Ï
00CF
207

D_

Ð
00D0
208
Ñ
00D1
209
Ò
00D2
210
Ó
00D3
211
Ô
00D4
212
Õ
00D5
213
Ö
00D6
214
×
00D7
215
Ø
00D8
216
Ù
00D9
217
Ú
00DA
218
Û
00DB
219
Ü
00DC
220
Ý
00DD
221
Þ
00DE
222
ß
00DF
223

E_

à
00E0
224
á
00E1
225
â
00E2
226
ã
00E3
227
ä
00E4
228
å
00E5
229
æ
00E6
230
ç
00E7
231
è
00E8
232
é
00E9
233
ê
00EA
234
ë
00EB
235
ì
00EC
236
í
00ED
237
î
00EE
238
ï
00EF
239

F_

ð
00F0
240
ñ
00F1
241
ò
00F2
242
ó
00F3
243
ô
00F4
244
õ
00F5
245
ö
00F6
246
÷
00F7
247
ø
00F8
248
ù
00F9
249
ú
00FA
250
û
00FB
251
ü
00FC
252
ý
00FD
253
þ
00FE
254
ÿ
00FF
255

OpenComputers encoding

OpenComputers uses UTF-8 as defined in RFC2279 as its main encoding. However, only a subset of characters can be displayed, the other characters being displayed as “?” (U+003F). You can find the list of characters in encodings-oc-characters.txt.

See LuaString.valueOf and LuaString.tojstring (OC-LuaJ version), StaticFontRenderer.drawChar and chars.txt for reference.