Character encodings#

In Lua and with software in general, there is no such thing as plain text: everything is encoded as binary, generally organized in bytes (groups of 8 bits), and text is yet another construct above bytes by using what is known as character encodings.

Character encodings are the art of representing text as a sequence of characters over bytes. This is something that also applies in Lua, as Lua strings are nothing more than sequences of bytes with no validation or transformation; interpreting them as text is up to the programs and native APIs.

For the sake of simplicity, we’ll consider that encodings are a way to encode a sequence of codepoints, where each codepoint is a numerical value representing a character within bounds, over a sequence of bytes, and that for a sequence of codepoints, there is exactly one way to represent it over a sequence of bytes. Note that this doesn’t mean a text has only one representation; see Unicode normalization on that subject.

Real-world uses Unicode encodings, mostly, from the point of view of ComputerCraft:

The host system running Java might be using UTF-16 (Windows API implement this, as Microsoft was an early adopter of Unicode), UTF-8 (for most common systems) or earlier encodings such as ASCII, an ISO-8859 variant, Windows-1252, Shift JIS, etc (generally on systems dating from before 2000), or even vendor-specific encodings.
Java uses UCS-2 as its native character and string type, which is limited to the BMP (sorry emojis, you won’t fit in).
Most of the web accessible through HTTP uses UTF-8, but lots of earlier websites still use ISO-8859 variants.

In this document, I’ll describe common character encodings encountered while programming for or contributing to thox.

ComputerCraft encoding#

Monitors and terminals in ComputerCraft use a custom 8-bit encoding derived from ISO 8859-1, with the 00-1F range backported from Code page 437 (with some exceptions) and the 80-9F ranges replaced with custom graphical characters; the table above corresponds to the 8x11 glyphs defined in the mod.

This comes from the fact that Cobalt, the LuaJ fork specific to ComputerCraft, i.e. the Lua interpreter, decodes and encodes a Lua string by mapping byte values to Unicode codepoints; see LuaString.decode.

This encoding is used in every string passed on to the Lua code, and expected from every string passed on by the Lua code to the native APIs. When using native ComputerCraft or emulator (e.g. CCEmuX), there is no proper mapping operation when pasting to a terminal: for example, the character “♫” (U+266B) is not properly converted into the character 0x0E (15), instead only being converted to 0x3F (char. 63, “?”).

The code page layout is the following:

	`_0`	`_1`	`_2`	`_3`	`_4`	`_5`	`_6`	`_7`	`_8`	`_9`	`_A`	`_B`	`_C`	`_D`	`_E`	`_F`
`0_`	`NUL` `0000` 0	☺ `263A` 1	☻ `263B` 2	♥ `2665` 3	♦ `2666` 4	♣ `2663` 5	♠ `2660` 6	• `2022` 7	◘ `25D8` 8			♂ `2642` 11	♀ `2640` 12		♪ `266A` 14	♫ `266B` 15
`1_`	► `25BA` 16	◄ `25C4` 17	↕ `2195` 18	‼ `203C` 19	¶ `00B6` 20	§ `00A7` 21	▬ `25AC` 22	↨ `21A8` 23	↑ `2191` 24	↓ `2193` 25	→ `2192` 26	← `2190` 27	∟ `221F` 28	↔ `2194` 29	▲ `25B2` 30	▼ `25BC` 31
`2_`	`SP` `0020` 32	! `0021` 33	" `0022` 34	# `0023` 35	$ `0024` 36	% `0025` 37	& `0026` 38	' `0027` 39	( `0028` 40	) `0029` 41	* `002A` 42	+ `002B` 43	, `002C` 44	- `002D` 45	. `002E` 46	/ `002F` 47
`3_`	0 `0030` 48	1 `0031` 49	2 `0032` 50	3 `0033` 51	4 `0034` 52	5 `0035` 53	6 `0036` 54	7 `0037` 55	8 `0038` 56	9 `0039` 57	: `003A` 58	; `003B` 59	< `003C` 60	= `003D` 61	> `003E` 62	? `003F` 63
`4_`	@ `0040` 64	A `0041` 65	B `0042` 66	C `0043` 67	D `0044` 68	E `0045` 69	F `0046` 70	G `0047` 71	H `0048` 72	I `0049` 73	J `004A` 74	K `004B` 75	L `004C` 76	M `004D` 77	N `004E` 78	O `004F` 79
`5_`	P `0050` 80	Q `0051` 81	R `0052` 82	S `0053` 83	T `0054` 84	U `0055` 85	V `0056` 86	W `0057` 87	X `0058` 88	Y `0059` 89	Z `005A` 90	[ `005B` 91	\ `005C` 92	] `005D` 93	^ `005E` 94	_ `005F` 95
`6_`	` `0060` 96	a `0061` 97	b `0062` 98	c `0063` 99	d `0064` 100	e `0065` 101	f `0066` 102	g `0067` 103	h `0068` 104	i `0069` 105	j `006A` 106	k `006B` 107	l `006C` 108	m `006D` 109	n `006E` 110	o `006F` 111
`7_`	p `0070` 112	q `0071` 113	r `0072` 114	s `0073` 115	t `0074` 116	u `0075` 117	v `0076` 118	w `0077` 119	x `0078` 120	y `0079` 121	z `007A` 122	{ `007B` 123	\| `007C` 124	} `007D` 125	~ `007E` 126	🮙 `1FB99` 127
`8_`	`EMQ` `2001` 128	🬀 `1FB00` 129	🬁 `1FB01` 130	🬂 `1FB02` 131	🬃 `1FB03` 132	🬄 `1FB04` 133	🬅 `1FB05` 134	🬆 `1FB06` 135	🬇 `1FB07` 136	🬈 `1FB08` 137	🬉 `1FB09` 138	🬊 `1FB0A` 139	🬋 `1FB0B` 140	🬌 `1FB0C` 141	🬍 `1FB0D` 142	🬎 `1FB0E` 143
`9_`	🬏 `1FB0F` 144	🬐 `1FB10` 145	🬑 `1FB11` 146	🬒 `1FB12` 147	🬓 `1FB13` 148	▌ `258C` 149	🬔 `1FB14` 150	🬕 `1FB15` 151	🬖 `1FB16` 152	🬗 `1FB17` 153	🬘 `1FB18` 154	🬙 `1FB19` 155	🬚 `1FB1A` 156	🬛 `1FB1B` 157	🬜 `1FB1C` 158	🬝 `1FB1D` 159
`A_`	`NBSP` `00A0` 160	¡ `00A1` 161	¢ `00A2` 162	£ `00A3` 163	¤ `00A4` 164	¥ `00A5` 165	¦ `00A6` 166	§ `00A7` 167	¨ `00A8` 168	© `00A9` 169	ª `00AA` 170	« `00AB` 171	¬ `00AC` 172	`SHY` `00AD` 173	® `00AE` 174	¯ `00AF` 175
`B_`	° `00B0` 176	± `00B1` 177	² `00B2` 178	³ `00B3` 179	´ `00B4` 180	µ `00B5` 181	¶ `00B6` 182	· `00B7` 183	¸ `00B8` 184	¹ `00B9` 185	º `00BA` 186	» `00BB` 187	¼ `00BC` 188	½ `00BD` 189	¾ `00BE` 190	¿ `00BF` 191
`C_`	À `00C0` 192	Á `00C1` 193	Â `00C2` 194	Ã `00C3` 195	Ä `00C4` 196	Å `00C5` 197	Æ `00C6` 198	Ç `00C7` 199	È `00C8` 200	É `00C9` 201	Ê `00CA` 202	Ë `00CB` 203	Ì `00CC` 204	Í `00CD` 205	Î `00CE` 206	Ï `00CF` 207
`D_`	Ð `00D0` 208	Ñ `00D1` 209	Ò `00D2` 210	Ó `00D3` 211	Ô `00D4` 212	Õ `00D5` 213	Ö `00D6` 214	× `00D7` 215	Ø `00D8` 216	Ù `00D9` 217	Ú `00DA` 218	Û `00DB` 219	Ü `00DC` 220	Ý `00DD` 221	Þ `00DE` 222	ß `00DF` 223
`E_`	à `00E0` 224	á `00E1` 225	â `00E2` 226	ã `00E3` 227	ä `00E4` 228	å `00E5` 229	æ `00E6` 230	ç `00E7` 231	è `00E8` 232	é `00E9` 233	ê `00EA` 234	ë `00EB` 235	ì `00EC` 236	í `00ED` 237	î `00EE` 238	ï `00EF` 239
`F_`	ð `00F0` 240	ñ `00F1` 241	ò `00F2` 242	ó `00F3` 243	ô `00F4` 244	õ `00F5` 245	ö `00F6` 246	÷ `00F7` 247	ø `00F8` 248	ù `00F9` 249	ú `00FA` 250	û `00FB` 251	ü `00FC` 252	ý `00FD` 253	þ `00FE` 254	ÿ `00FF` 255

OpenComputers encoding#

OpenComputers uses UTF-8 as defined in RFC2279 as its main encoding. However, only a subset of characters can be displayed, the other characters being displayed as “?” (U+003F). You can find the list of characters in encodings-oc-characters.txt.

See LuaString.valueOf and LuaString.tojstring (OC-LuaJ version), StaticFontRenderer.drawChar and chars.txt for reference.