ref: a3f2d6d23ed1be69fee607bdedadc0438fd01305
dir: /sys/doc/nssec.ms/
.HTML "Namespaces as Security Domains" .TL Namespaces as Security Domains .AU Jacob Moody .AB We aim to explore the use of Plan 9 namespaces as ways of building isolated processes. We present here code for increasing the ability and granularity for which a process may isolate itself from others on the system. .AE .SH Introduction .PP .FS First presented in a slightly different form at the 9th International Workshop on Plan 9. .FE .LP In a Plan 9 system the kernel exposes hardware and system interfaces through a myriad of filesystem trees. These trees, or sharp devices, replace the functionality of many would be system calls through use of standard file system operations. A standard Plan 9 environment is comprised of a composition of these individual devices together, the collection of such being the processes namespace. .LP With these principles it is quite easy for a process to build a slim namespace using only what it may need for operation. This could be done in service to reduce the "blast radius" of awry or malicious code to some effect. But to be fully effective a process must also be able to remove the ability to bootstrap these capabilities back. We will explore different ways of building isolated namespaces, their pitfalls, ways to address those issues, along with new solutions. .SH Outside World .LP There have been many solutions for sandboxing within the UNIX™ world. There are more classical approaches such as .CW zones , [Price04] and .CW jails , that all provide an abstraction of building some number of smaller full unix boxes out of a single physical host. However these interfaces are presented more as a systems management tool, the mechanisms for which an administrator creates and manages these resources is unergonomic to use on a per-process basis. Instead it seems more the fashion now to isolate specific pieces of the system, and expect it possible that each process on the system may choose to manage its environment. The most successful execution of this idea in the wild is the OpenBSD project's .CW unveil and .CW pledge [Beck18] system calls, allowing a processes to cut off specific parts of the filesystem or system call interfaces. Linux namespaces [Biederman06] implement this idea by allowing a process to fork off private versions of specific global resources. In both these cases the sandboxing of a process is through gradual steps, removing potentially dangerous tools one by one. .SH Existing Work .LP Let us first define the resources we are restricting access to. The aforementioned gradual solutions provide ways in which a process can remove itself from specific kernel interfaces. In plan9 the kernel exposes almost all of its functionality through individual filesystems. These devices are accessed globally by prefixing a path with a sharp('#'), and have conventional places they are bound within the namespace. .LP A processes namespace in plan9 is typically constructed using a namespace file. These files are a collection of namespace operations formatted as one would expect to see them in a shell script. They typically begin by binding in some number of sharp devices in to their expected location. .P1 bind #d /fd bind -c #e /env bind #p /proc bind -c #s /srv .P2 Then using the globals provided, in particular /srv, to bring in the rest of the root filesystem. A process can at any point choose to construct itself a new namespace, but it must do so when changing users. This is done in part to ensure that each filesystem that the program would like to use has their chance to authenticate and be notified. Because this information is only exchanged on attach, the new user must construct a namespace from scratch. .LP Many programs, like network services, wish to drop their current user and become the special user .CW none user on startup, and in doing so must rebuild their namespace. The conventional default namespace files used is /lib/namespace, but most programs allow the user to specify an alternative with a flag. It is here that we already can approximate a chroot style environment by changing the root filesystem used in a namespace file. .P1 bind #s /srv mount /srv/myboot /root bind -a /root / .P2 By having another filesystem exposed in /srv/myboot and modifying the provided namespace file, we've allowed this process to work within an entirely separate root filesystem. .SH RFNOMNT .LP The issue in using these namespaces as security barriers is that there is nothing preventing a process from bootstrapping a resource back. While our example code places a different root filesystem in the namespace, nothing is preventing that process or its children from potentially rebootstrapping the real root filesystem back. For this issue there is a special rfork flag .CW RFNOMNT the prevents a process from accessing any almost any sharp device of consequence. This is done by preventing a process from walking to a device by its location within '#'. This allows existing binds of resources to continue working within the namespace but restricts a process from binding in new resources from the kernel. .LP While effective we found this to be too large a hammer in practice. Doing as its name implies .CW RFNOMNT also prevents a process from performing any mounts or binds. This in practice creates a single point in time in which a process gives up all of its control, instead of the idealized gradual process. This makes it quite hard to make use of in practice, only a single program in a chain may be the one to invoke .CW RFNOMNT or must hope that no other program further in the chain may want to make use of its namespace. The interface itself feels very clunky, there is a nice gradual addition of these kernel devices to the namespace why must the removal be all at once? .SH Chdev .LP We propose a new write interface through /dev/drivers that functionally replaces .CW RFNOMNT . /dev/drivers now accepts writes in the form of .P1 chdev op devmask .P2 Devmask is a string of sharp device characters. Op specifies how devmask is interpreted. Op is one of .TS lw(1i) lw(4.5i). \f(CW&\fP T{ Permit access to just the devices specified in devmask. T} \f(CW&~\fP T{ Permit access to all but the devices specified in devmask. T} \f(CW~\fP T{ Remove access to all devices. Devmask is ignored. T} .TE .LP This allows a process to selectively remove access to sections of sharp devices with quite a bit of control. In order to mimic all of .CW RFNOMNT 's features, removing access to .CW devmnt , which is not normally accessible directly, disables the processes ability to perform mount and bind operations. .LP For the implementation, we extended the existing .CW RFNOMNT flag attached to the process namespace group into a bit vector. Each bit representing an index into .CW devtab . The following function illustrates how this vector is set. .P1 void devmask(Pgrp *pgrp, int invert, char *devs) { int i, t, w; char *p; Rune r; u64int mask[nelem(pgrp->notallowed)]; if(invert) memset(mask, 0xFF, sizeof mask); else memset(mask, 0, sizeof mask); w = sizeof mask[0] * 8; for(p = devs; *p != 0;){ p += chartorune(&r, p); t = devno(r, 1); if(t == -1) continue; if(invert) mask[t/w] &= ~(1<<t%w); else mask[t/w] |= 1<<t%w; } wlock(&pgrp->ns); for(i=0; i < nelem(pgrp->notallowed); i++) pgrp->notallowed[i] |= mask[i]; wunlock(&pgrp->ns); } .P2 Devmask is called from the write handler for /dev/drivers. This bitmask is then consulted any time a name is resolved that begins with '#'. This is done from within the .CW namec () function using the following function to check if a particular device .CW r is permitted. .P1 int devallowed(Pgrp *pgrp, int r) { int t, w, b; t = devno(r, 1); if(t == -1) return 0; w = sizeof(u64int) * 8; rlock(&pgrp->ns); b = !(pgrp->notallowed[t/w] & 1<<t%w); runlock(&pgrp->ns); return b; } .P2 .LP We found that once removal is made to a core verb of these sharp devices it becomes easy to start to view access to them as capabilities. This is aided by system functionally already neatly organized into the various devices themselves. For example, one could say a process is capable of accessing the broader internet if it has access to the .CW devip device. This access can either be direct via it's path under '#' or through a location in the namespace where this device had already been bound. With these changes, the entire capability list of a process is on display through just its /proc/$pid/ns file. This .CW ns file would indicate if a particular device is bound and now also includes the list of devices a process has access to. .LP In practice, this results in a pattern of binding in a sharp device, making use of them and removing them when no longer needed. A namespace file for a web server could now look like .P1 bind #s /srv # /srv/www created by srvfs www /lib/www mount /srv/www /lib/ unmount /srv chdev -r s # chdev &~ s .P2 In this example we have created a new root for the process by using exportfs to expose a little piece of the boot namespace. We unmount .CW devsrv and remove access to it with .CW chdev ensuring there is no way for our process to talk to the real .CW /srv/boot . This provides a nice succinct lifetime of access to .CW devsrv and makes the removal of these sharp devices as easy as it is to use them in the first place. .LP Like .CW RFNOMNT , .CW chdev does not restrict access to sharp devices that had already been mounted. This allows a process to use a subsection or only one piece of sharp devices as well. One example of this may be to restrict a process to just a single network stack .P1 bind '#I1' /net chdev -r I .P2 .SH /srv/clone .LP With this .CW chdev mechanism, the ability for a device to provide isolation of its own became more powerful. Partially illustrated in the previous .CW devip example. .CW Devsrv , the sharp device providing named pipes, was an ideal target for adding isolation. Devsrv provides a bulletin board of all posted 9p services for a given host. We wanted to provide a mechanism for a process, or family tree of process to share a private .CW devsrv between themselves. .LP The design for this was borrowed from devip, one in which a process opens a .CW clone file to read its newly allocated slot number. This new 'board' appears as a sibling directory to the .CW clone it was spawned from. This new board is itself a fully functioning .CW devsrv with its own clone file, making nesting to full trees of .CW srvs quite easy, and completely transparent. The following illustrates how one could replace their global .CW /srv with a freshly allocated one. .P1 </srv/clone { s='/srv/'^`{read} bind -c $s /srv exec p } .P2 Also like devip, once the last reference to the file descriptor returned by opening .CW clone is closed the board is closed and posters to that board receive an EOF. It is important to bake this kind of ownership into the design, as self referential users of .CW /srv are quite common in current code. .LP This along with chdev can be used to create a sandbox for /srv quite easily, the process allocates itself a new /srv then removes access to the global root srv. This allows potentially untrusted process to still make use of the interface without needing to worry about their access to the global state. The practice of having new boards appear as subdirectories allows the entire state to easily be seen by inspecting the root of devsrv itself. .SH Restricting Within a Mount .LP As shown earlier with the use of .CW srvfs , an intermediate file server can be used to only service a small subsection of a larger namespace. In that example we used this to expose only /lib/www from the host to processes running a web server. This can be limited as the invocation of .CW exportfs can become more complicated if the user wishes to use multiple pieces from completely separate places within the file tree. To address this a utility program .CW protofs was written to easily create convincing mimics of the filesystem it was run from. .CW protofs accepts a .CW proto file, a text file containing a description of file tree, and uses it to provide dummy files mimicking the structure. These dummies can then be used by a process as targets for bind mounts of its current namespace, providing the illusion of trimming all but select pieces. This new root cannot be simply bound over the real one, that still allows an unmount to escape back to the real system, but rexporting the namespace still works. To illustrate a more involved setup than before: .P1 # We want to provide our web server # with /bin, /lib/www and /lib/git ; cat >>/tmp/proto <<. bin d775 lib d775 www d775 git d775 . ; protofs -m /mnt/proto /tmp/prot ; bind /bin /mnt/proto/bin ; bind /lib/www /mnt/proto/lib/www ; bind /lib/git /mnt/proto/lib/git # A private srv could be used, omitted for brevity ; srvfs webbox /mnt/proto # Namespace file for using our new mini-root ; cat >>/tmp/ns <<. mount #s/webbox /root bind -b /root / chdev -r s . ; auth/newns -n /tmp/ns ls / bin lib ; .P2 .SH Future Work .LP While we think these bring us closer to namespaces as security boundaries, there is still plenty of work and understanding to be done. One particular item of interest is attempting some kind of isolation of .CW devproc , possibly in a similar fashion to the .CW /srv/clone implementation, but attempts have yet to be made. The exact nature of .CW namespace files and how they relate to sandboxing as a whole has yet to be fully worked out. There is clear potential, but it is likely additional abilities may be required. It is somewhat difficult to synthesize a namespace entirely from nothing, which is something we found ourselves reaching for when building alternative roots to run processes within. There is potential for some merger of .CW proto and .CW namespace files to provide a template of the current namespace to graft on to the next one. .LP Both .CW chdev and .CW /srv/clone are merged into 9front and their implementations are freely available as part of the base system. .SH References .LP [Beck18] Bob Beck, ``Pledge, and Unveil, in OpenBSD'', .I "BSDCan Slides" Ottawa, July, 2018. .LP [Price04] Daniel Price, Andrew Tucker, ``Solaris Zones: Operating System Support for Consolidating Commercial Workloads'', .I "Proceedings of the 18th Large Installation System Administration Conference" pp. 241-254, Atlanta, November, 2004. .LP [Biederman06] Eric W. Biederman ``Multiple Instances of the Global Linux Namespaces'', .I "Proceedings of the 2006 Linux Symposium Volume One" pp. 102-112, Ottawa, Ontario July, 2006.