Contexts and interactions#

Todo

Talk about processes themselves.

Inter-process communication (IPC) serves, from a process’s point of view, for communicating with the hardware and other processes, allowing them to cooperate to accomplish a task. It is accomplished in thox using two mechanisms:

These mechanisms take place in contexts. Contexts are security objects in which these two interactions take place; they allow processes to manage these elements, amongst others:

  • Sandboxing of given processes (by whitelisting or blacklisting RPC endpoints they can call).

  • Logging (by running the process from a process “watcher”).

  • Compatibility with older thox processes (by converting older RPC endpoints into new ones); see seccomp for a real-world example of such a concept.

They can also correspond to objects, such as file handles or network connections, where the handle is guaranteed to the daemon to be closed since references to given contexts are closed automatically when a process exits.

Regarding contexts, a process:

  • Has a default context, given the number 0.

  • Can get access to other contexts when receiving an answer to an RPC call, using os.pull(), or creating one, using os.context().

  • Can share access to a context by answering a call using os.answer(), or creating a new process using os.run().

A context is owned by the process which has created it using os.context(). This process manages the context, that means it manages the routing of the RPC calls and messages, thus the security of it in case it shares this context with multiple processes.

The initial process gets a context managed by the process manager; the initial process can then create contexts for its children, depending on the security it wants to setup. See initd: process manager with thox IPC for more information.

Todo

Message queues in a different namespace, also in contexts, with the following functions, probably:

  • os.push(ctx, name, info)

  • os.listen(ctx, name)

This requires similar utilities to RPC calls, as we also need routing (although while calls must arrive to one desintation, messages can arrive to multiple).

Todo

What happens on OS shutdown for example, in which order are processes closed if they need to close processes? Can processes have a “kill” event in order to be able to make their last actions (e.g. sending disconnect messages on network connections)?

Events#

thox processes are event-driven; the process manager prepares the events for the process to read, with an optional filter depending on what it expects.

Events can be the following:

  • RPC events, such as calls received by the process and answers received from previous calls.

  • Messages from messages queues the process has subscribed to. These messages can represent “real world” events, etc.

Events are represented using os.Event objects, and are pulled using os.pull(). It is possible to pull specific events, such as the answer to a specific call, by specifying additional parameters to os.pull(); by default, it returns the oldest event not gathered by the process.

Todo

What if there are too many events? Where is the limit? What happens?

Remote Procedure Call (RPC)#

thos processes communicate mainly in a one-to-one fashion using a remote procedure call protocol.

Todo

Since RPC in thox is asynchronous, what about cancelling calls? Some calls might never end, and this might clog up the process’ call space, there should be a timeout mechanism, or we could let the process manage its timeouts using alarms but let it cancel the call. But what about what happens for the process at the other end of the RPC call? This would be a hang up, but wouldn’t it be simpler to just let the RPC call open for it and simply not transmit the answer? But then should we alert it of the hang up?

Also, what if cancelling is used to DOS another process making complex operations or lots of I/O behind to make it work? e.g. instead of being limited to the max. number of calls I can emit, I make a lot of calls, cancel all of them, then do it again in a loop. The process scheduler might see this as time-sharing friendly, so it might give it more time to make more system calls compared to how it is managed on the other end.

Calling procedure#

Picture three processes P1, P2 and P3, where P3 manages the default context of both processes P1 and P2. P1 wants to execute an action using this protocool, and P2 has this function available and wants any other processes using its default context to be able to run it.

In order to represent the action, thox uses RPC names such as my.super.function. When started up, P3 first decides, either for each action or globally, what it wants to do. Some common possibilities are:

  • It provides a fix set of functions, and does not provide any mechanisms to “bind” functions.

  • It transmits all calls from a given context to another, e.g. calls from a context it created to its default context.

  • It allows RPC name binding.

The basic context most processes on thox will encounter is the context provided by initd (see initd: process manager with thox IPC for more information). This context allows binding through its os.rpc.bind() and os.rpc.unbind() endpoints.

With binding, daemons such as P2 can then route specific RPC calls done on its default context to itself, which means that subsequent calls by any process to my.super.function on the context provided by P3 will result in P2 receiving a call from the said process. Therefore, when making a call to my.super.function to its default context, P1 will receive an answer from P2.

What happens in order during the call is the following:

  • P1 calls the procedure using the rpc call. This actually emits a call to the system using os.call(), which returns a token in the form of a numerical Call IDentifier (CID).

  • P3 gets a call event, bundled with the CID with which to answer, the arguments given by P1, and some additional request information. It finds out that P2 is bound to the given name, and transmits the call to it using os.transmit().

  • P2 gets a call event, bundled with the CID with which to answer, the arguments given by P1, and some additional request information.

  • P2 treats the request accordingly.

  • P2 emits an answer using os.answer(), passing the CID to it, optionally followed by some return values.

  • P1 gets an answer event, with the CID (to distinguish the call to which the answer is for, in case P1 has sent multiple calls).

For binding a name beforehand, P2 uses os.rpc.bind(); it can also unbind a name using os.rpc.unbind().

Remote Procedure Names (RPN)#

Remote Procedure Names (RPN) are names to which remote procedure calls are emitted. They are repsented as a dot-joined collection of one or more name components, which are non-empty strings of letters and digits, not beginning with a digit and not being one of the following reserved words:

and, break, do, else, elseif, end, false, for, function, goto, if, in, local, nil, not, or, repeat, return, then, true, until, while

A Remote Procedure Name:

  • Can end with an underscore. Note that only the last name component can end with an underscore; for example, fs_.open is not allowed.

  • Must neither end with an empty component (i.e. with a dot) nor must the last component be composed solely of an underscore; for example, fs_ and fs.open_ are allowed, where _ or fs._ are not.

  • Is of non-zero arbitrary length [1].

  • Is case-insensitive, which implies that the callee will receive a lower-cased version of it; e.g. FS.GetSpaceLeft, FS.GETSPACELEFT and fs.getspaceleft will all be received by the callee as fs.getspaceleft.

Note that the underscore has special significance; see Sharing contexts.

The regex for validating names as used for RPC (which requires negative lookaheads) is the following:

/(?!.*\.\_?$)((?!and|break|do|else|elseif|end|false|for|function|goto|if|in|local|nil|not|or|repeat|return|then|true|until|while)([a-z][a-z0-9]*)\.?)*\_?/gi

A Lua function to validate and return the canonicalized RPC name can be found in torpcname.lua.

Some valid and invalid identifiers are the following:

Valid identifiers

Invalid identifiers

sleep
os.module
how.deep.does.this.go
my.function2
my.function2_
my_
for
123hello
hello.2theworld
my.gawd$
my_.function2
my._
_

The rationale behind this definition is to be able to integrate these identifiers into native code using the os.rpc prefix, for example os.rpc.sleep(5) to emit a synchronous call to the sleep() function. Case insentivity is explained by the confusion that the system-wide difference between fs.GetSpaceLeft and fs.getspaceleft could generate, leading to potential security problems; see typosquatting for a real world problem alike what this mitigation is addressing.

Notice that while this API is asynchronous, most of the time, processes will want to call RPC functions synchronously; to make this more accessible, one can use the os.rpc object.

Status codes#

RPC calls always return a status code as a number as the first argument. This status code should always be defined between 0 and 255; when a status code provided to os.answer() is not provided within those bounds, the status code will be set to INVALID (255).

Special status codes are the following:

  • SUCCESS (0): returned when the call has succeeded.

  • UNBOUND (253): returned when the name should be considered as unbound.

  • UNANSWERED (254): returned when a call has been unanswered. This can be due to it not being picked up from a full event queue, or to bad routing leading a process owning a context to forward it to self.

  • UNKNOWN (255): returned when an invalid status code has been provided to os.answer().

Sharing contexts#

While some RPC calls return scalar values, such as nil values, strings and numbers, some other share contexts. Sharing a context means providing the caller an access to a given context.

thox uses the “underscore means sharing” ideology, which is stated as followed:

If the name of the RPC procedure ends with an underscore, then the first argument in the answer is the context being shared.

So for example, math.add will not share a context, where fs.open_ will. Let’s take this last RPC call, where the callee creates a new context it owns using os.context() and wants to share it:

  • From the callee’s point of view, the first argument to be passed to os.answer() will be the number of the context to share, here the value returned by os.context().

  • From the callers’ point of view, the first returned value, which can be found in the args attribute of a os.AnswerEvent, will be either nil or the number of the context that has been shared.

Note that both numbers won’t necessarily be the same; for example, the callee can know the new context as context number 7, whereas the caller might know it as context number 3. It is up to the process manager to do the conversion while sharing the context.

Todo

What happens if the called already has an access to the given context?

Note

The “underscore means sharing” design is the simplest solution to the following design problem: how does the process manager know if a number the callee has transmitted as an answer to a call was intended to be a scalar or a context number, which needs to be shared and translated between processes?

The following other possibilities were considered:

  • Making every argument to os.answer() (except the call identifier) a table, composed of one elements for scalar arguments (e.g. {5} or {"hello"} and two elements for contexts to be shared (e.g. {5, "ctx"}).

    While making the notation heavier, it also might have produced cases where callers do not expect a context and get overwhelmed with unclosed contexts, preventing new contexts to be created or shared with them, thus crashing it.

  • Adding an argument to os.call() describing which argument(s) in the answer were expected to be context numbers, e.g. {1} for the first argument to be a context, {1, 4} for arguments one and four to be contexts, or {} for no contexts at all.

    While this would be visible by callees, badly programmed callees might share contexts accidentally. For example, imagine a math.sum call taking two arguments and returning the sum of both. Calling this function with both 0 and {} as expected contexts would produce the expected results, but suppose we call it with {1} as expected contexts and the callee doesn’t check this. By returning 0 as intended, it, in fact, isn’t returning the number 0, but sharing its default context, which can then be used by the caller to emit calls in a privileged context, thus leading to a privilege escalation!

While the last idea could have been kind of “patched” by having a solid daemon library which takes wrapped functions with their expected contexts and matches them at each call with the call data, and returns an error with no context in case of mismatch, I considered this as a weakness of the design.

Yet the idea of having both the caller and the callee know what is being shared stuck, and most of context sharing only involves sharing one context as first answer argument. By imposing this constraint on the call convention, I only needed a way to warn the callee that a context was expected, and instead of an additional argument, I decided to make this information as part of the name, by appending a special character.

This special character was to be a dollar or a bang; however that would have broken the “RPC names can be expressed natively in Lua” rationale, so I decided to remove underscores from the base names, and add it at the end of the name if a context is being shared. I considered this an elegant way for both the caller and the callee to be aware that a context is being shared!

Note that this logic could be expanded to have multiple underscores at the end of names for sharing multiple contexts at once. However, I decided to keep it simple, and forbid more than one underscore at the end of the name; this way, later versions of thox can add them.