Description

The bus1 Kernel Message Bus defines and implements a distributed object model. It allows local processes to send messages to objects owned by remote processes, as well as share their own objects with others. Object ownership is static and cannot be transferred. Access to remote objects is prohibited, unless it was explicitly granted. Processes can transmit messages to a remote object via the message bus, transferring a data payload, object access rights, file descriptors, or other auxiliary data.

To participate on the message bus, a peer context must be created. Peer contexts are kernel objects, identified by a file descriptor. They are not bound to any process, but can be shared freely. The peer context provides a message queue to store all incoming messages, a registry for all locally owned objects, and tracks access rights to remote objects. A peer context never serves as routing entity, but merely as anchor for peer-owned resources. Any message on the bus is always destined for an object, and the bus takes care to transfer a message into the message queue of the peer context that owns this object.

The message bus manages object access using capabilities. That is, by default only the owner of an object is granted access rights. No other peer can access the object, nor are they aware of the existance of the object. However, access rights can be transmitted as auxiliary data with any message, effectively granting them to the receiver of the message. This even works transitively, that is, any peer that was granted access to an object can pass on those rights, even if they do not own the object. But mind that no access rights can ever be revoked, besides the owner destroying the object.

Nodes and Handles Each peer context comes with a registry of owned objects, which in bus1 parlance are called nodes. A peer is always the exclusive owner of all nodes it has created. Ownership cannot be transferred. The message bus manages access rights to nodes as a set of handles held by each peer. For each node a peer has access to, whether it is local or remote, the message bus keeps a handle on the peer. Initially when a node is created the node owner is the only peer with a handle to the newly created node. Handles are local to each peer, but can be transmitted as auxiliary data with any message, effectively allocating a new handle to the same node in the destination peer. This works transitively, and each peer that holds a handle can pass it on further, or deliberately drop it. As long as a peer has a handle to a node it can send messages to it. However, a node owner can, at any time, decide to destroy a node. This causes all further message transactions to this node to fail, although messages that have already been queued for the node are still delivered. When a node is destroyed, all peers that hold handles to the node are notified of the destruction. Moreover, if the owner of a node that has been destroyed releases all its handles to the node, no further messages or notifications destined for the node are delivered. Handles are the only way to refer to both local and remote nodes. For each handle allocated on a peer, a 64-bit ID is assigned to identify that particular handle on that particular peer. The ID is only valid locally on that peer, it cannot be used by remote peers to address the handle (in other words, the ID namespace is tied to each peer and does not define global entities). When creating a new node, userspace freely selects the ID except that the BUS1_HANDLE_FLAG_MANAGED bit must be cleared, and when receiving a handle from a remote peer the kernel assigns the ID, which always has the BUS1_HANDLE_FLAG_MANAGED set. Additionally, the BUS1_HANDLE_FLAG_REMOTE flag tells whether a specific ID refers to a remote handle (if set), or to an owner handle (if unset). An ID assigned by the kernel is never reused, even after a handle has been dropped. The kernel keeps a user-reference count for each handle. Every time a handle is exposed to a peer, the user-reference count of that handle is incremented by one. This is never done asynchronously, but only synchronously when an ioctl is called by the holding peer. Therefore, a peer can reliable deduce the current user-reference count of all its handles, regardless of any ongoing message transaction. References can be explicitly dropped by a peer. Once the counter of a handle hits zero, it is destroyed, its ID becomes invalid, and if it was assigned by the kernel, it will not be reused again. Note that a peer can never have multiple different handles to the same node, rather the kernel always coalesces them into a single handle, using the user-reference counter to track it. However, if a handle is fully released, but the peer later acquires a handle to the same remote node again, its ID will be different, as IDs are never reused. New nodes are allocated on-demand by passing the desired ID to the kernel in any ioctl that accepts a handle ID. When allocating a new node, the node owner implicitly also gets a handle to that node. As long as the node is valid, the kernel will pin a single user-reference to the owner's handle. This guarantees that a node owner always retains access to their node, until they explicitly destroy it (which will make it possible for userspace to release the handle like any other). Once all the handles to a local node have been released, no more messages destined for the node will be received. Otherwise, a handle to a local node behaves just like any other handle, that is, user-references are acquired and released according to its use. However, whenever the overall sum of all user-references on all handles to a node drops to one (which implies that only the pinned reference of the owner is left), a release-notification is queued on the node owner. If the counter is incremented again, any such notification is dropped, if not already dequeued.

Message Transactions A message transaction atomically transfers a message to any number of destinations. Unless requested otherwise, the message transaction fully succeeds or fully fails. To receive messag payloads, each peer has an associated shmem-backed pool which may be mapped read-only by the receiving peer. The kernel copies the message payload directly from the sending peer to each of the receivers' pool without an intermediary kernel buffer. The pool is divided into slices to hold each message. When a message is received, its offset into the pool in bytes is returned to userspace, and userspace has to explicitly release the slice once it has finished with it. The kernel amends all data messages with the uid , gid , pid , tid , and optionally the security context of the sending peer. The information is collected from the sending peer when the message is sent and translated into the namespaces of the receiving peer's file-descriptor.

Seed Message Every peer may pin a special seed message. Only the peer itself may set and retrieve the seed, and at most one seed message may be pinned at any given time. The seed typically describes the peer itself and pins any nodes and handles necessary to bootstrap the peer.

Resource quotas Each user has a fixed amount of available resources. The limits are static, but may be overridden by module parameters. Limits are placed on the amount of memory a user's pools may consume, the number of handles a user may hold and, the number of inflight messages may be destined for a user and the number of file descriptors may be inflight to a user. All inflight resources are accounted on the receiving peer. As resources are accounted on the receiver, a quota mechanism is in place in order to avoid intentional or unintentional resource exhaustion by a malicious or broken sending user. At the time of a message transaction, the sending user may consume in total (including what is consumed by previous transactions) half of the total resources of the receiving user that have not been consumed by another user. When a message is dequeued its resource consumption is deaccounted from the sending users quota. If a receiving peer does not dequeue any of its incoming messages it would be possible for a users quota to be fully consumed by one peer, making it impossible to communicate with other functioning peers owned by the same user. A second quota is therefore enforced per-peer, enforcing that at the time of a message transaction the receiving peer may consume at in total (including what is consumed by previous transactions) half of the total resources available to the sending user that have not been consumed by another peer.

Global Ordering Despite there being no global synchronization, all events on the bus, such as sending or receiving of messages, release of handles or destruction of nodes, behave as if they were globally ordered. That is, for any two events it is always possible to consider one to have happened before the other in such a way that it is consistent with all the effects observed on the bus. For instance, if two events occurr on one peer (say the sending of a message, and the destruction of a node), and they are observed on another peer (by receiving the message and receiving a destruction notification for the node), we are guaranteed that the order the events occurred in and the order they were observed in is the same. One could consider a further example involving three peers, if a message is sent from one peer to two others, and after receiving the message the first recipient sends a further message to the second recipient, it is guaranteed that the original message is received before the subsequent one. This principle of causality is also respected in the pressence of side-channel communication. That is, if one event may have triggered another, even if on different, disconnected, peers, we are guaranteed that the events are ordered accordingly. To be precise, if one event (such as receiving a message) completed before another (such as sending a message) was started, then they are ordered accordingly. Also in the case where there can be no causal relationship, we are guaranteed a global order. In case two events happend concurrently, there can never be any inconsistency in which occurred before the other. By way of example, consider two peers sending one message each to two different peers, we are guaranteed that both the recipient peers receive the two messages in the same order, even though the order may be arbitrary.