ref: 4b60d3dd32efd00a1d07cb08662926ba9836719f
dir: /paper.ms/
.HTML "neoventi .TL Neoventi .AU Noam Preil <noam@pixelhero.dev> .AB .CW neoventi is a forwards-looking backwards-compatible reimplementation of the venti disk storage system, intended to address shortcomings in the prototype system while retaining the useful properties it revealed. A further benefit of the creation of a secondary implementation is the development of a better understanding of the original prototype, enabling further refinements to the original project. .AE .SH Background .PP The venti disk system is a networked archival data storage system, which can be most effectively understood as a disk-resident write-once hash table, in which keys must always be the hash of their corresponding value [1]. Venti is intended to be one piece of a larger system, useful as the backbone for backup systems and snapshotting file systems [1]. Fossil is an archival file system built on top of venti, originally intended to supplant the Plan 9 file server [5]. .PP The venti disk system has interesting properties that enable unusual features, largely borne out of simple but niche design choices, such as hash-based addressing and an append-only primary data log. Unfortunately, venti is also explicitly a prototype [1] and has both design and implementation limitations that cannot be easily addressed [2]. .PP Further, many of the issues that plague the venti prototype's code base are complicated and entangled. The process of writing an implementation intended, from the beginning, to .I not be merely a prototype inherently lends itself to a better understanding of the design of the original prototype and, thus, of how to fix the issues. Multiple bugs in venti have been fixed as a result of the neoventi work; rewriting the code does not need to mean abandoning the original but can instead complement it. Having divergent implementations provides a practical method of iterative improvement: testing them against each other. .PP From a client's perspective, the venti system is straightforward. A trivial binary protocol is implemented over TCP. Variable-sized blocks written using the protocol yield an address as a hash of the block's contents; the block may then be read back using its hash. This simple design yields a system with interesting properties, including idempotent writes, immutable blocks, and implicit corruption detection [1]. .PP That client's model is inadequate for understanding a venti server, however, as implied by venti's man pages. .P1 BUGS Setting up a venti server is too complicated. .P2 .PP [3]. Setting up a venti server is complicated not merely because the tooling for doing so is inadequate - although this is certainly true - but because the venti server itself is complicated, and the primary inadequacy of the tooling is in failing to protect the operator from this complexity. .PP In many places, the venti documentation refers to "convention" with regards to how data is stored [4]. This includes the process by which large files and directories are stored in venti and the way files are stored in venti. The documentation is misleading; many of the conventions are hardcoded into various parts of the venti tooling as assumptions, and violating the assumptions can cause problems up to and including data loss. .PP For instance, venti blocks are not merely data. They are associated as well with .I "type tags, which the documentation claims is done by venti [4]. This is contradicted by the documentation of the network protocol, which notes that each write operation must include a type tag, which venti stores. This mismatch of assumptions results in missed opportunities for useful tooling and, if the client puts in a tag that the data verification tooling thinks is "wrong," it may lead to the tool marking the block as invalid and rendering it unreadable, as can be seen in the .CW venti/syncarena code: .P1 if(vttypevalid(cl.info.type) < 0){ ... broken = 1; } if(broken && fix){ cl.info.type = VtCorruptType; if(writeclumphead(arena, aa, &cl) < 0){ ... } } .P2 [6]. .PP Additionally, each block is tagged with the ID of the user who first wrote it, which has privacy and security implications; security in general is a major shortcoming of venti. The protocol was designed to support authentication and encryption but neither the server nor any clients implement this support in any way, and the documentation outright acknowledges that the system was designed so that it .I could be secure [4]. .PP The implementation has also been plagued with a long string of bugs. Most known bugs have been fixed, but not all. Attempts have been made to fix some of the remaining issues, but the code base is complicated and entangled. .PP Fossil has historically been even less reliable than venti, and there are multiple known serious bugs that have never been fixed. .PP In the time since fossil was originally conceived, the Plan 9 file server has indeed been supplanted, but not by fossil. The current predominant file systems in use are the Plan 9 world are cwfs and hjfs [7], both of which run in user space. In fact, the file server / auth server / cpu server / terminal distinction has largely been erased, as there is now a unified kernel that can act in any desired configuration [8]. There have been other file systems proposed, as well, all of which run in user space [9] [10]. .PP However, largely for reliability reasons, fossil has since been supplanted as well. While there are a few remaining users of the system [11], most have abandoned it, and for good reason: for a long time, there were serious data loss / corruption bugs! .PP Despite the flaws, the venti system possesses many useful and interesting properties that are worth preserving, which motivates the redesign and implementation of a successor system. Some of these features are shared with other file systems (most notably Plan 9 file systems, due to their shared histories), and others are unique to the venti system, or to the venti/fossil combination. .PP Many of the flaws have also been slowly improved over the decades. Fossil supposedly had one major data loss/corruption issue that bothered many of its users, which was reportedly fixed around the early 2010s, though confirmation of this is tricky [13]. There are few users today, but we do exist, and these issues seem largely gone. .PP On the other hand, there continue to be many known bugs, from outstanding locking issues to bad behavior when fossil runs out of disk space to tooling claiming there are multiple PiB of free space on a file system sized under 10GiB [14]. .PP The idea of writing neoventi arose slowly during my work on the venti system. Some of the bugs were simple enough to fix, but months of work and issues ranging from the usage of .I "signed integers for memory sizes, and an alignment bug that would cause a crash if the cache was set to .I exactly 128MiB, and the discovery that over 1KLoC in libventi that were intended as a performance optimization actually made things .I slower, and the realization that a code cleanup would not work because one of four layers of entangled abstractions depended on undocumented behavior in another in subtle ways, I eventually got fed up enough that I no longer wished to work on the venti code base. I still wished to .I use a venti system, though, and there were features I wanted to experiment with adding to venti that I did not think could be reasonably added to the existing code base, so after months of wavering, I gave in and decided to rewrite venti from scratch. .SH neoventi .PP neoventi currently implements a read-only venti server in under 900 lines of code. It is fully compatible with the venti disk format, and has been tested on multiple existing ventis, including the one that backed the fossil it was running from. neoventi is, right now, intended primarily as a proof-of-concept, and a demonstration that venti can be simplified greatly from its current incarnation. .PP neoventi is largely written as a reaction against perceived shortcomings in venti. For instance, to process a read request, the venti server goes through more than five layers of abstraction: fcalls, RPCs, the Packet layer, Fragments, ZBlocks, and packing and conversion code to go between the layers. This complexity necessitates a lot of code to manage and copy data around. This is a burden for maintenance, and introduces a lot of subtle behavior. .PP By contrast, to implement the same code, neoventi deliberately goes as far in the opposite direction as possible, and does not use even a single layer of abstraction. This difference in approach can be seen easily in the code for establishing a connection with the client. Here is the code for reading the hello packet from the client in the original server: .P1 if((p = vtrecv(z)) == nil) return -1; if(vtfcallunpack(&tx, p) < 0){ packetfree(p); return -1; } packetfree(p); .P2 .PP This code receives a packet as a Packet abstraction, and unpacks the request into a VtFcall, before freeing the raw packet. It also has error handling code mixed throughout, as errors are, in C convention, passed up the stack. For contrast, here is the equivalent code in neoventi: .P1 vtrecv(conn, buf); .P2 .PP There's a few major differences. First, the rest of the routine for handling the hello request in neoventi operates on the incoming buffer directly. Where venti requires multiple memory copies and packet preprocessing, neoventi shuffles bytes around and slams them back over the network. Where venti constructs a Fcall response, and sets all of its fields, neoventi just reuses the same buffer into which the client send the request, and preserves the packet tag. .PP Secondly, neoventi uses stack allocation for buffers, and avoids heap allocations. This means that there is no need for constant memory allocations and frees, and is enabled by the first change. .PP Thirdly, neoventi uses a setjmp buffer for error handling, instead of passing errors up the stack. This is enabled in large part by the usage of stack allocation: the stack does not need to unwind, because there is nothing to clean up! .PP In practice, the venti protocol is not terribly complicated, and it is not difficult to get right - and bugs in packet processing are trivial to investigate and fix, where leaky abstractions and manual memory management and error handling are not. It is noteworthy that the entirety of neoventi needed for read-only operations occupies less than 900 lines of code, whereas venti's packet layer alone is over 1,000 lines of code, and it uses another 400 lines for the RPC/Fcall code. Venti needs significantly more code just to .I "read packets than neoventi needs for an entire functional server! .PP neoventi's performance leaves much to be desired, in large part due to the total lack of caching. Nonetheless, it tests within 10% of venti's performance for cold reads, and that's despite the fact that manual investigation of the block structure being read suggests that blocks were being read from disk on average more than three times each. .PP neoventi's read pathway has been tested thoroughly. Testing was conducted using a fossil system root mounted via vacfs as a read-only root, a ramfs for a write buffer, and wbfs to deeply unionize them [16]. Tests included playback of multimedia, a full system compile of my Plan 9 branch, full scans of the git history and every version of every file ever stored in the distro, and streaming of game ROMs using Plan 9's built-in emulators [17]. No bugs were found, and performance was more than acceptable: notable results include 720p 30FPS video streaming through neoventi in real time with no stuttering [18]. .SH Use Cases and Future Work .PP Venti/fossil is already a useful system. I have accomplished much with it that I would not have been able to with any other system I am aware of. During the last 9front hackathon, I was able to sync my laptop's venti with my file server 50 miles away in under fifteen seconds. This works only because syncing two ventis can be reduced to a copy of fully linear data. Data in venti exists independently of its physical address, and so can be copied to an arbitrary location on the destination machine. After syncing from one venti to another, one can note down the latest address in the source venti. The next sync is effectively just a copy of all data between the prior highest address and the current latest block - there is no need for any indirection or cleverness whatsoever. .PP Fossil's snapshotting mechanism also allows for a file system to be forked across machines and merged as with branches in a git repository. At present, this is entirely a manual process, and can be painful if the trees have diverged. In principle, however, git can be implemented directly on top of a venti+fossil system. .PP Further, it is possible to fully automate the synchronization of an arbitrary number of ventis using a peer-to-peer torrenting system. This would require engineering effort, but should not be terribly complicated; Plan 9 already has a native torrenter, and there's no need to reinvent the mechanisms here [19]. venti itself could be modified to do this in the background; multiple file systems can be automatically and seamlessly synced across arbitrary sets of machines. .PP Venti currently uses SHA1 for its hash, but SHA1 is no longer sufficient; there are real-world examples of SHA1 collisions. Ideally, it should be made trivial to swap the hash function for future-proofing, so that when the hash function which replaces SHA1 is itself broken, it should not be painful to swap the function again. .PP There are many security and privacy holes in venti's design that should be fixed. Anyone with access to a venti system has, in essence, full permissions to all data in every file system stored in the system. The only current practical mitigation is to simply prevent access to the venti in the first place, and only expose a file system on top of it, such as fossil, but this can greatly reduce the benefits of venti's deduplication. .PP neoventi is intended to fully replace venti for practical use, and to do so by providing a superset of venti's functionality, with none of its bugs, and better performance and security. Currently, the only one of these goals which is met is the lack of venti's bugs, but that's a ridiculous claim to make when neoventi does not currently even support accepting new data, which makes avoiding data loss or corruption a trivial matter! .PP The most pressing immediate work is to finish venti compatibility by adding in write support. After that, more directly useful projects, such as improvements to the disk formats, upgrading the hash function, and replacing the venti protocol with a 9p implementation, will be possible. Performance is a tertiary goal, though it will likely get some focus before new features are added since optimization is often a pleasurable endeavour. .SH Conclusion .PP neoventi provides a read-only drop-in replacement for the venti server, and meets its goal as a proof-of-concept: it demonstrates irrefutably that it is possible to implement a venti system with a fraction of the complexity used by the original implementation. neoventi needs more work to be able to advance beyond being simply a competiting prototype, but a fresh start makes it easier to do so than to clean up venti. The code is available [20], and feedback is welcomed. .SH References and Footnotes .LP [1] Sean Quinlan and Sean Dorward, ``Venti: A New Approach to Archival Storage,'' .I "Usenix Conference on File and Storage Technologies" , 2002. .LP [2] My only source here is personal experience. .LP [3] venti(8) .LP [4] venti(6) .LP [5] Sean Quinlan, Jim McKie, and Russ Cox, ``Fossil, an Archival File Server,'' /sys/doc/fossil.ms .LP [6] Venti source code, /sys/src/cmd/venti/srv/syncarena.c .LP [7] This is statistical: 9front is the most active branch of Plan 9 at the moment, and most 9front users use either cwfs or hjfs, both of which run as user-space programs. .LP [8] 9front's .I "Frequently Questioned Answers, https://fqa.9front.org .LP [9] ``Good Enough File System'', http://shithub.us/ori/gefs/HEAD/info.html .LP [10] ``Another file system'', http://git.9front.org/plan9front/mafs/HEAD/info.html .LP [11] Source: I'm typing this paper on a machine that uses venti+fossil :) .LP [12] Sean Quinlan, ``A Cached WORM File System'', .I "Software - Practice and Experience, December, 1991 .LP [13] Evidence leans towards http://9legacy.org/9legacy/patch/fossil-deadlocks.diff being the fix, though the history here is muddled and it's really hard to find a meaningful history of 9legacy's patches; since fossil appears to have been removed from 9front before the fix was made (in large part as a .I result of the bug!), and the patches in both plan9port and 9legacy are unclear, it's hard to tell, and nobody seems interested in discussion. My one attempt at submitting a fix to 9legacy got no response; while my patch was accepted into 9legacy, I was not made aware of this and did not find out until I was looking through 9legacy's public information to try to piece together the history. .LP [14] These are all issues I have personally run into. I fixed a deadlock last year, but I've since run into another; I did not grab a trace at the time, and have not run into it again since. .LP [15] Patches can be found in the 9front tree, available via hjgit://git.9front.org/plan9front/plan9front. .LP [16] wbfs is a write-buffer file system, forked from kvik's unionfs. It takes two file systems, one read-only and the other read-write, and presents a unified image. Files in the read-only file system can be "modified" by writing an updated file to the write buffer file system, and further accesses will read back the instance from the write buffer. It is available at https://git.sr.ht/~pixelherodev/wbfs - it is not complete, but was sufficient for testing neoventi. .LP [17] See nintendo(1) for details on the emulators in question. .LP [18] In the interest of transparency, there was stuttering observed, but it was tested against a ramfs and determined to be a result of the CPU being unable to keep up with decoding and falling out of sync, and not related to neoventi at all. .LP [19] torrent(1). .LP [20] https://git.sr.ht/~pixelherodev/neoventi