Character encodings#
In Lua and with software in general, there is no such thing as plain text: everything is encoded as binary, generally organized in bytes (groups of 8 bits), and text is yet another construct above bytes by using what is known as character encodings.
Character encodings are the art of representing text as a sequence of characters over bytes. This is something that also applies in Lua, as Lua strings are nothing more than sequences of bytes with no validation or transformation; interpreting them as text is up to the programs and native APIs.
For the sake of simplicity, we’ll consider that encodings are a way to encode a sequence of codepoints, where each codepoint is a numerical value representing a character within bounds, over a sequence of bytes, and that for a sequence of codepoints, there is exactly one way to represent it over a sequence of bytes. Note that this doesn’t mean a text has only one representation; see Unicode normalization on that subject.
Real-world uses Unicode encodings, mostly, from the point of view of ComputerCraft:
The host system running Java might be using UTF-16 (Windows API implement this, as Microsoft was an early adopter of Unicode), UTF-8 (for most common systems) or earlier encodings such as ASCII, an ISO-8859 variant, Windows-1252, Shift JIS, etc (generally on systems dating from before 2000), or even vendor-specific encodings.
Java uses UCS-2 as its native character and string type, which is limited to the BMP (sorry emojis, you won’t fit in).
Most of the web accessible through HTTP uses UTF-8, but lots of earlier websites still use ISO-8859 variants.
In this document, I’ll describe common character encodings encountered while programming for or contributing to thox.
ComputerCraft encoding#
Monitors and terminals in ComputerCraft use a custom 8-bit encoding derived from ISO 8859-1, with the 00-1F range backported from Code page 437 (with some exceptions) and the 80-9F ranges replaced with custom graphical characters; the table above corresponds to the 8x11 glyphs defined in the mod.
This comes from the fact that Cobalt, the LuaJ fork specific to ComputerCraft, i.e. the Lua interpreter, decodes and encodes a Lua string by mapping byte values to Unicode codepoints; see LuaString.decode.
This encoding is used in every string passed on to the Lua code, and expected from every string passed on by the Lua code to the native APIs. When using native ComputerCraft or emulator (e.g. CCEmuX), there is no proper mapping operation when pasting to a terminal: for example, the character “♫” (U+266B) is not properly converted into the character 0x0E (15), instead only being converted to 0x3F (char. 63, “?”).
The code page layout is the following:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
NUL 0000 0
|
☺
263A 1
|
☻
263B 2
|
♥
2665 3
|
♦
2666 4
|
♣
2663 5
|
♠
2660 6
|
•
2022 7
|
◘
25D8 8
|
♂
2642 11
|
♀
2640 12
|
♪
266A 14
|
♫
266B 15
|
|||
|
►
25BA 16
|
◄
25C4 17
|
↕
2195 18
|
‼
203C 19
|
¶
00B6 20
|
§
00A7 21
|
▬
25AC 22
|
↨
21A8 23
|
↑
2191 24
|
↓
2193 25
|
→
2192 26
|
←
2190 27
|
∟
221F 28
|
↔
2194 29
|
▲
25B2 30
|
▼
25BC 31
|
|
SP 0020 32
|
!
0021 33
|
"
0022 34
|
#
0023 35
|
$
0024 36
|
%
0025 37
|
&
0026 38
|
'
0027 39
|
(
0028 40
|
)
0029 41
|
*
002A 42
|
+
002B 43
|
,
002C 44
|
-
002D 45
|
.
002E 46
|
/
002F 47
|
|
0
0030 48
|
1
0031 49
|
2
0032 50
|
3
0033 51
|
4
0034 52
|
5
0035 53
|
6
0036 54
|
7
0037 55
|
8
0038 56
|
9
0039 57
|
:
003A 58
|
;
003B 59
|
<
003C 60
|
=
003D 61
|
>
003E 62
|
?
003F 63
|
|
@
0040 64
|
A
0041 65
|
B
0042 66
|
C
0043 67
|
D
0044 68
|
E
0045 69
|
F
0046 70
|
G
0047 71
|
H
0048 72
|
I
0049 73
|
J
004A 74
|
K
004B 75
|
L
004C 76
|
M
004D 77
|
N
004E 78
|
O
004F 79
|
|
P
0050 80
|
Q
0051 81
|
R
0052 82
|
S
0053 83
|
T
0054 84
|
U
0055 85
|
V
0056 86
|
W
0057 87
|
X
0058 88
|
Y
0059 89
|
Z
005A 90
|
[
005B 91
|
\
005C 92
|
]
005D 93
|
^
005E 94
|
_
005F 95
|
|
`
0060 96
|
a
0061 97
|
b
0062 98
|
c
0063 99
|
d
0064 100
|
e
0065 101
|
f
0066 102
|
g
0067 103
|
h
0068 104
|
i
0069 105
|
j
006A 106
|
k
006B 107
|
l
006C 108
|
m
006D 109
|
n
006E 110
|
o
006F 111
|
|
p
0070 112
|
q
0071 113
|
r
0072 114
|
s
0073 115
|
t
0074 116
|
u
0075 117
|
v
0076 118
|
w
0077 119
|
x
0078 120
|
y
0079 121
|
z
007A 122
|
{
007B 123
|
|
007C 124
|
}
007D 125
|
~
007E 126
|
🮙
1FB99 127
|
|
EMQ 2001 128
|
🬀
1FB00 129
|
🬁
1FB01 130
|
🬂
1FB02 131
|
🬃
1FB03 132
|
🬄
1FB04 133
|
🬅
1FB05 134
|
🬆
1FB06 135
|
🬇
1FB07 136
|
🬈
1FB08 137
|
🬉
1FB09 138
|
🬊
1FB0A 139
|
🬋
1FB0B 140
|
🬌
1FB0C 141
|
🬍
1FB0D 142
|
🬎
1FB0E 143
|
|
🬏
1FB0F 144
|
🬐
1FB10 145
|
🬑
1FB11 146
|
🬒
1FB12 147
|
🬓
1FB13 148
|
▌
258C 149
|
🬔
1FB14 150
|
🬕
1FB15 151
|
🬖
1FB16 152
|
🬗
1FB17 153
|
🬘
1FB18 154
|
🬙
1FB19 155
|
🬚
1FB1A 156
|
🬛
1FB1B 157
|
🬜
1FB1C 158
|
🬝
1FB1D 159
|
|
NBSP 00A0 160
|
¡
00A1 161
|
¢
00A2 162
|
£
00A3 163
|
¤
00A4 164
|
¥
00A5 165
|
¦
00A6 166
|
§
00A7 167
|
¨
00A8 168
|
©
00A9 169
|
ª
00AA 170
|
«
00AB 171
|
¬
00AC 172
|
SHY 00AD 173
|
®
00AE 174
|
¯
00AF 175
|
|
°
00B0 176
|
±
00B1 177
|
²
00B2 178
|
³
00B3 179
|
´
00B4 180
|
µ
00B5 181
|
¶
00B6 182
|
·
00B7 183
|
¸
00B8 184
|
¹
00B9 185
|
º
00BA 186
|
»
00BB 187
|
¼
00BC 188
|
½
00BD 189
|
¾
00BE 190
|
¿
00BF 191
|
|
À
00C0 192
|
Á
00C1 193
|
Â
00C2 194
|
Ã
00C3 195
|
Ä
00C4 196
|
Å
00C5 197
|
Æ
00C6 198
|
Ç
00C7 199
|
È
00C8 200
|
É
00C9 201
|
Ê
00CA 202
|
Ë
00CB 203
|
Ì
00CC 204
|
Í
00CD 205
|
Î
00CE 206
|
Ï
00CF 207
|
|
Ð
00D0 208
|
Ñ
00D1 209
|
Ò
00D2 210
|
Ó
00D3 211
|
Ô
00D4 212
|
Õ
00D5 213
|
Ö
00D6 214
|
×
00D7 215
|
Ø
00D8 216
|
Ù
00D9 217
|
Ú
00DA 218
|
Û
00DB 219
|
Ü
00DC 220
|
Ý
00DD 221
|
Þ
00DE 222
|
ß
00DF 223
|
|
à
00E0 224
|
á
00E1 225
|
â
00E2 226
|
ã
00E3 227
|
ä
00E4 228
|
å
00E5 229
|
æ
00E6 230
|
ç
00E7 231
|
è
00E8 232
|
é
00E9 233
|
ê
00EA 234
|
ë
00EB 235
|
ì
00EC 236
|
í
00ED 237
|
î
00EE 238
|
ï
00EF 239
|
|
ð
00F0 240
|
ñ
00F1 241
|
ò
00F2 242
|
ó
00F3 243
|
ô
00F4 244
|
õ
00F5 245
|
ö
00F6 246
|
÷
00F7 247
|
ø
00F8 248
|
ù
00F9 249
|
ú
00FA 250
|
û
00FB 251
|
ü
00FC 252
|
ý
00FD 253
|
þ
00FE 254
|
ÿ
00FF 255
|
OpenComputers encoding#
OpenComputers uses UTF-8 as defined in RFC2279 as its main encoding.
However, only a subset of characters can be displayed, the other characters
being displayed as “?” (U+003F
). You can find the list of characters
in encodings-oc-characters.txt
.
See LuaString.valueOf and LuaString.tojstring (OC-LuaJ version), StaticFontRenderer.drawChar and chars.txt for reference.