shithub: gefs

--- /dev/null

+++ b/gefs.ms

@@ -1,0 +1,745 @@

+.am DS

+.ft I

+..

+.ta 1i 2.3i 4.5i  (optional to set tabs)

+.TL

+GEFS, A Good Enough File System

+.AU

+Ori Bernstein

+ori@eigenstate.org

+.AB

+GEFS is a new file system built for Plan 9.

+It aims to be a crash-safe, corruption-detecting, simple, and fast snapshotting file system, in that order.

+GEFS achieves these goals by building a traditional 9p file system interface on top of a forest of copy-on-write Bε trees.

+It doesn't try to be optimal on all axes, but good enough for daily use.

+.AE

+.NH 1

+The Current Situation

+.PP

+Currently, Plan 9 has several general purpose disk file systems available.

+All of them share a common set of problems.

+On power loss, the file systems tend to get corrupted, requiring manual recovery steps.

+Silent disk corruption is not caught by the file system, and reads will silently return incorrect data.

+They tend to require a large, unshrinkable disk for archival dumps, and behave poorly when the disk fills.

+Additionally, all of them do O(n) scans to look up files in directories when walking to a file.

+This causes poor performance in large directories.

+.PP

+CWFS, the default file system on 9front, has proven to be performant and reliable, but is not crash safe.

+While the root file system can be recovered from the dump, this is inconvenient and can lead to a large amount of lost data.

+It has no way to reclaim space from the dump.

+In addition, due to its age, it has a lot of historical baggage and complexity.

+.PP

+HJFS, a new experimental system in 9front, is extremely simple, with fewer lines of code than any of the other on-disk storage options.

+It has dumps, but does not separate dump storage from cache storage, allowing full use of small disks.

+However, it is extremely slow, not crash safe, and lacks consistency check and recovery mechanisms.

+.PP

+Finally, fossil, the default file system on 9legacy, is large and complicated.

+It uses soft-updates for crash safety[7], an approach that has worked poorly in practice for the BSD filesystems[8].

+Fossil also has a history (and present) of deadlocks and crashes due its inherent complexity.

+There are regular reports of deadlocks and crashes when using tools such as clone[9].

+While the bugs can be fixed as they're found, simplicity requires a rethink of the on disk data structures.

+And even after adding all this complexity, the fossil+venti system provides no way to recover space when the disk fills.

+.NH 1

+How GEFS Solves it

+.PP

+GEFS aims to solve the problems with these file systems.

+The data and metadata is copied on write, with atomic commits.

+If the file server crashes before the superblocks are updated,

+then the next mount will see the last commit that was synced to disk.

+Some data may be lost, by default up to 5 seconds worth, but no corruption will occur.

+Furthermore, because of the use of an indexed data structures, directories do not suffer from O(n) lookups,

+solving a long standing performance issue with large directories.

+.PP

+While snapshots are useful to keep data consistent, disks often fail over time.

+In order to detect corruption and allow space reclamation, block pointers are triples of 64-bit values:

+The block address, the block hash, and the generation that they were born in.

+If corrupted data is returned by the underlying storage medium, this is detected via the block hashes.

+The corruption is reported, and the damaged data may then be recovered from backups, RAID restoration, or some other means.

+Eventually, the goal is to make GEFS self-healing.

+.PP

+Archival dumps are replaced with snapshots.

+Snapshots may be deleted at any time, allowing data within a snapshot to be reclaimed for reuse.

+To enable this, in addition to the address and hash, each block pointer contains a birth generation.

+Blocks are reclaimed using a deadlist algorithm inspired by ZFS.

+.PP

+Finally, the entire file system is based around a relatively novel data structure.

+This data structure is known as a Bε tree [1].

+It's a write optimized variant of a B+ tree, which plays particularly nicely with copy on write semantics.

+This allows GEFS to greatly reduce write amplification seen with traditional copy on write B-trees.

+.PP

+And as a bonus, it solves these problems with less complexity.

+By selecting a suitable data structure, a large amount of complexity elsewhere in the file system falls away.

+The complexity of the core data structure pays dividends.

+Being able to atomically update multiple attributes in the Bε tree,

+making the core data structure safely traversable without locks,

+and having a simple, unified set of operations makes everything else simpler.

+As a result, the total source size of GEFS is currently 8737 lines of code, as compared to CWFS at 15634, and the

+fossil/venti system at 27762 lines of code (20429 for fossil, with an additional 7333 lines for Venti).

+.NH 1

+Bε Trees: A Short Summary

+.PP

+The core data structure used in GEFS is a Bε tree.

+A Bε tree is a modification of a B+ tree, which optimizes writes

+by adding a write buffer to the pivot nodes.

+Like B-trees, Bε trees consist of leaf nodes, which contain keys and values, and pivot nodes.

+Like B-trees, the pivot nodes contain pointers to their children, which are either pivot nodes or leaf nodes.

+Unlike B-trees, the pivot nodes also contain a write buffer.

+.PP

+The Bε tree implements a simple key-value API, with point queries and range scans.

+It diverges form a traditional B-tree key value store by the addition of an upsert operation.

+Upsert operations are operations that insert a modification message into the tree.

+These modifications are addressed to a key.

+.PP

+To insert to the tree, the root node is copied, and the new message is

+inserted into its write buffer.

+When the write buffer is full, it is inspected, and the number of messages directed

+to each child is counted up.

+The child with the largest number of pending writes is picked as the victim, and

+the root's write buffer is flushed towards it.

+This proceeds recursively down the tree, until either an intermediate node has

+sufficient space in its write buffer, or the messages reach a leaf node, at which

+point the value in the leaf is updated.

+.PP

+In order to query a value, the tree is walked as normal, however the path to the

+leaf node is recorded.

+When a value is found, the write buffers along the path to the root are inspected,

+and any messages that have not yet reached the leaves are applied to the final

+value read back.

+.PP

+Because mutations to the leaf nodes are messages that describe a mutation, updates to

+data may be performed without inspecting the data at all.

+For example, when writing to a file, the modification time and QID version of the file

+may be incremented without inspecting the current QID; a 'new version' message may

+be upserted instead.

+This allows skipping read-modify-write cycles that access distant regions of the tree,

+in favor of a simple insertion into the root nodes write buffer.

+Additionally, because all upserts go into the root node, a number of operations may

+be upserted in a single update. As long as we ensure that there is sufficient space

+in the root node's write buffer, the batch insert is atomic.

+Inserts and deletions are upserts, but so are mutations to existing data.

+.PS

+.ps 6

+.vs 4

+boxht=0.2

+down

+R: [

+	right

+R0:	box "k0" wid 0.2

+	box "k16" wid 0.2

+	box "k32" wid 0.2

+R1:	box "k48" wid 0.2

+	box "m0" wid 0.2 fill

+	box "m1" wid 0.2 fill

+	box  wid 0.6 fill

+]

+move down 0.5

+P: [

+	right

+P0:	box "k0" wid 0.2

+P1:	box "k4" wid 0.2

+	box "k8" wid 0.2

+	box "k12" wid 0.2

+	box "m0" wid 0.2 fill

+	box "m1" wid 0.2 fill

+	box  wid 0.6 fill

+	box invis wid 1 "..."

+P2:	box "k48" wid 0.2

+	box "k56" wid 0.2

+	box "k60" wid 0.2

+	box "k64" wid 0.2

+	box "m0" wid 0.2 fill

+	box "m1" wid 0.2 fill

+	box  wid 0.6 fill

+]

+move down 0.5

+L: [

+	right

+L0:	box "k0" wid 0.2

+	box "v0" wid 0.2

+	box "..." wid 0.2

+	box "k3" wid 0.2

+	box "v3" wid 0.2

+	box invis wid 1

+L1:	box "k4" wid 0.2

+	box "v4" wid 0.2

+	box "..." wid 0.2

+	box "k7" wid 0.2

+	box "v7" wid 0.2

+B0:	box invis wid 1 "..."

+L2:	box "k48" wid 0.2

+	box "v49" wid 0.2

+	box "..." wid 0.2

+	box "k54" wid 0.2

+	box "v55" wid 0.2

+]

+arrow from R.R0.s to P.P0.n

+arrow from R.R1.s to P.P2.n

+arrow from P.P0.s to L.L0.n

+arrow from P.P1.s to L.L1.n

+arrow from P.P2.s to L.L2.n

+.PE

+.PP

+In GEFS, for the sake of simplicity all blocks are the same size.

+Unfortunately, this implies that the Bε tree blocks are smaller than optimal,

+and the disk blocks are larger than optimal.

+.PP

+Within a single block, the pivot keys are stored as offsets to variable width data.

+The data itself is unsorted, but the offsets pointing to it are sorted.

+This allows O(1) access to the keys and values, while allowing variable sizes.

+.PS

+.ps 6

+.vs 4

+boxht=0.3

+box "o0" wid 0.2

+box "o1" wid 0.2

+box "o2" wid 0.2

+box  "unused" wid 0.8 fill

+box "k2" wid 0.2

+box "v2" wid 0.7

+box "k0" wid 0.2

+box "v0" wid 0.3

+box "k1" wid 0.4

+box "v1" wid 0.2

+.PE

+.PP

+In order to allow for efficient copy on write operation, the Bε tree in GEFS relaxes several

+of the balance properties of B-trees [5].

+It allows for a smaller amount of fill than would normally be required, and merges nodes with

+their siblings opportunistically.

+In order to prevent sideways pointers between sibling nodes that would need copy on write updates,

+the fill levels are stored in the parent blocks, and updated when updating the child pointers.

+.NH 1

+Mapping Files to Bε Operations

+.PP

+With a description of the core data structure completed, we now need

+to describe how a file system is mapped on to Bε trees.

+.PP

+A GEFS file system consists of a snapshot tree, which points to a number of file system trees.

+The snapshot tree exists to track snapshots, and will be covered later.

+Each snapshot points to a single GEFS metadata tree, which contains all file system state for

+a single version of the file system.

+GEFS is somewhat unique in that all file system data is recorded within a single flat key value

+store.

+There are no directory structures, no indirect blocks, and no other traditional structures.

+Instead, GEFS has the following key-value pairs:

+.LP

+.CW "Kdat(qid, offset) → (ptr)"

+.IP

+Data keys store pointers to data blocks.

+The key is the file qid, concatenated to the block-aligned file offset.

+The value is the pointer to the data block that is being looked up.

+.LP

+.CW "Kent(pqid, name) → (stat)"

+.IP

+Entry keys contain file metadata.

+The key is the qid of the containing directory, concatenated to the name of the file within the directory.

+The value is a stat struct, containing the file metadata, including the qid of the directory entry.

+.LP

+.CW "Kup(qid) → Kent(pqid, name)"

+.IP

+Up keys are maintained so that '..' walks can find their parent directory.

+The key is the qid of the directory.

+The value is the key for the parent directory.

+.PP

+Walking a path is done by starting at the root, which has a parent qid of ~0, and a name of "/".

+The QID of the root is looked up, and the key for the next step on the walk is constructed

+by concatenating the walk element with the root qid.

+This produces the key for the next walk element, which is then looked up, and the next key

+for the walk path is constructed. This continues until the full walk has completed.

+If one of the path elements is '..' instead of a name, then the super key is inspected

+instead to find the parent link of the directory.

+.PP

+If we had a file hierarchy containing the paths 'foo/bar', 'foo/baz/meh', 'quux', 'blorp',

+with 'blorp' containing the text 'hello world', this file system may be represented

+with the following set of keys and values:

+.P1

+Kdat(qid=3, off=0) → Bptr(off=0x712000, hash=04a73, gen=712)

+Kent(pqid=1, name='blorp') → Dir(qid=3, mode=0644, ...)

+Kent(pqid=1, name='foo') → Dir(qid=2, mode=DMDIR|0755, ...)

+Kent(pqid=1, name='quux') → Dir(qid=4, mode=0644, ...)

+Kent(pqid=2, name='bar') → Dir(qid=6, mode=DMDIR|0755, ...)

+Kent(pqid=2, name='baz') → Dir(qid=5, mode=DMDIR|0755, ...)

+Kent(pqid=5, name='meh') → Dir(qid=5, mode=0600, ...)

+Kent(pqid=-1, name='') → Dir(qid=1, mode=DMDIR|0755, ...)

+Kup(qid=2) → Kent(pqid=-1, name='')

+Kup(qid=5) → Kent(pqid=2, name='foo')

+.P2

+Note that all of the keys for a single directory are grouped because they sort together,

+and that if we were to read a file sequentially, all of the data keys for the file would

+be similarly grouped.

+.PP

+If we were to walk

+.CW "foo/bar"

+then we would begin by constructing the key

+.CW "Kent(-1, '')"

+to get the root directory entry.

+The directory entry contains the qid.

+For this example, let's assume that the root qid is 123.

+The key for

+.CW foo

+is then constructed by concatenating the root qid to the first walk name, giving the key

+.CW "Kent(123, foo)"

+This is then looked up, giving the directory entry for

+.CW foo .

+If the directory entry contains the qid 234, then the key

+.CW "Kent(234, bar)"

+is then constructed and looked up.

+The walk is then done.

+.PP

+Because a Bε tree is a sorted data structure, range scans are efficient.

+As a result, listing a directory is done by doing a range scan of all keys

+that start with the qid of the directory entry.

+.PP

+Reading from a file proceeds in a similar way, though with less iteration: When

+writing to a file, the qid is known, so the block key is created by

+concatenating the file qid with the read offset.

+This is then looked up, and the address of the block containing the data is found.

+The block is then read, and the data is returned.

+.PP

+Writing proceeds in a similar manner to reading, and in the general case begins by

+looking up the existing block containing the data so that it can be modified and

+updated.

+If a write happens to fully cover a data block, then a blind upsert of the data

+is done instead.

+Atomically along with the upsert of the new data, a blind write of the version number incremnt,

+mtime, and muid is performed.

+.PP

+Stats and wstat operations both construct and look up the keys for the directory entries,

+either upserting modifications or reading the data back directly.

+.NH 1

+Snapshots

+.PP

+GEFS supports snapshots.

+Each snapshot is referred to by a unique integer id, and is fully immutable once it is taken.

+For human use, a snapshot may be labeled.

+The labels may move to new snapshots, either automatically if they point to a tip of a snapshot, or via user intervention.

+For the sake of space reclamation, snapshots are reference counted.

+Each snapshot takes a reference to the snapshot it descends from.

+Each label also takes a reference to the snapshot that it descends from.

+When a snapshot's only reference is its descendant, then it is deleted, and any space it uses exclusively is reclaimed.

+The only structure GEFS keeps on disk are snapshots, which are taken every 5 seconds.

+This means that in the case of sudden termination, GEFS may lose up to 5 seconds of data, but the data on disk will always be

+consistent.

+.PP

+If there was no space reclamation in gefs, then snapshots would be trivial.

+The tree is copy on write.

+Therefore, as long as blocks are never reclaimed, it would be sufficient to save the current root of the tree

+once all blocks in it were synced to disk.

+However, because snapshots are taken every 5 seconds, disk space would get used uncomfortably quickly.

+As a result, space needs to be reclaimed.

+An advantage of Bε trees here is that often, only the root block will be copied, and updates will be inserted

+into its write buffer.

+.PS

+.ps 6

+.vs 4

+boxht=0.2

+down

+R: [

+	right

+R0:	box  "piv" wid 0.4

+	box  "buf" wid 0.2 fill

+	box  wid 0.2 fill 0.75

+	move right 0.5

+R1:	box  "piv" wid 0.4

+	box  "buf" wid 0.3 fill

+	box  wid 0.1 fill 0.75

+]

+move down 0.5

+P: [

+	right

+P0:	box "piv" wid 0.4

+	box "buf" wid 0.4 fill

+	box invis wid 1 "..."

+P1:	box "piv" wid 0.4

+	box "buf" wid 0.4 fill

+]

+move down 0.5

+L: [

+	right

+L0:	box "vals" wid 1

+	box invis wid 1

+L1:	box "vals" wid 1

+	box invis wid 1 "..."

+L2:	box "vals" wid 1

+]

+arrow from R.R0.sw to P.P0.n

+arrow from R.R0.se to P.P1.n

+arrow from R.R1.sw to P.P0.n

+arrow from R.R1.se to P.P1.n

+arrow from P.P0.sw to L.L0.n

+arrow from P.P0.se to L.L1.n

+arrow from P.P1.s to L.L2.n

+.PE

+.PP

+There are a number of options for space reclamation.

+Some that were considered when implementing GEFS included garbage collection, in the style of HAMMER [3],

+or optimized reference counting in the style of BTRFS [4], but both of these options have significant downsides.

+Garbage collection requires that the entire disk get scanned to find unreferenced blocks.

+This means that there are scheduled performance degradations, and in the limit of throughput, the bandwidth spent scanning

+must approach the bandwidth spent on metadata updates, as each block must be scanned and then reclaimed.

+Reference counting implies a large number of scattered writes to maintain the reference counts of blocks.

+.PP

+As a result, the algorithm for space reclamation is lifted directly from ZFS [6].

+It is based on the idea of using deadlists to track blocks that became free within a snapshot.

+If snapshots are immutable, then a block may not be freed as long as a snapshot exists.

+This implies that block lifetimes are contiguous.

+A block may not exist in a snapshot and be available for reallocation.

+Thus, when freeing a block, there are 2 cases: Either a block was born within the pending snapshot, and died within it,

+or it was born in a previous snapshot and was killed by the pending snapshot.

+.PP

+To build intuition, let's start by imagining the crudest possible implementation of snapshot space reclamation.

+Assuming that block pointers contain their birth generation, we can walk the entire tree.

+When a block's birth time is <= the previous snapshot, it is referred to by an older snapshot.

+We may not reclaim it.

+If the subsequent snapshot refers to this block, then it was born in this snapshot but is still in use.

+We may not reclaim it.

+Otherwise, the block is free, and we can reclaim it.

+.PP

+Obviously, this is slow: It involves full tree walks of multiple snapshots.

+It may walk large numbers of blocks that are not freed.

+.PP

+So, in order to do better, we can keep track of blocks that we want to delete from this snapshot as we delete them,

+instead of trying to reconstruct the list when we delete the snapshot.

+When we attempt to delete a block, there are two cases:

+First, block's birth time may be newer than the previous snapshot, in which case it may be freed immediately.

+And second, the block may have been born in the previous snapshot or earlier, in which case we need to put it on the current

+snapshot's deadlist.

+When the current snapshot is deleted, the current snapshot's deadlist is merged with the next snapshot's deadlist.

+All blocks on the deadlist that were born after the previous snapshot are freed.

+.PS

+.ps 6

+.vs 4

+down

+H: [

+	P:[

+		move right 0

+		line <-

+		box invis "prev" wid 0.35

+	]

+	D: [

+		move right 0.5

+		line <-

+	D:	box invis "del" wid 0.35

+	] with .w at P.w - (0, P.ht)

+	N: [

+		move right 1

+		line <-

+	N:	box invis "next" wid 0.35

+	] with .w at D.w - (0, D.ht)

+S:	spline -> from D.D.e right 0.2 then to N.N.n

+	"merge" at S.nw + (0.1, 0.1)

+]

+S:[

+	right

+	line with .nw at H.sw + (0, 0.2)

+P:	[circle fill wid 0.1]

+	line

+D:	[circle below wid 0.1]

+	line

+N:	[circle fill wid 0.1]

+	"prev" at P.s + (0, - 0.1)

+	"del" at D.s + (0, -0.1)

+	"next" at N.s + (0, -0.1)

+]

+.PE

+.PP

+There's one further optimization we can do on top of this to make deletions extremely fast.

+The deadlists may be sharded by birth generation.

+When a snapshot is deleted, all deadlists within the snapshot are appended to the descendant

+snapshot, and any deadlists with a birth time after the deleted snapshot in the descendant

+may be reclaimed.

+With this approach, the only lists that need to be scanned are the ones consisting wholly of blocks that must be freed.

+.PS

+.ps 6

+.vs 4

+down

+H: [

+	P:[

+		move right 0

+	L0:	line <-

+	T:	box invis "b0" wid 0.35

+	L1:	line <- with .w at L0.w - (0, 0.15)

+		box invis "b1" wid 0.35

+	L2:	line <- with .w at L1.w - (0, 0.15)

+		box invis "b2" wid 0.35

+	]

+	box invis "prev (gen = 2)" with .w at P.e

+	D: [

+		move right 0.5

+	L0:	line <-

+		box invis "b0" wid 0.35

+	L1:	line <- at L0.w - (0, 0.15)

+	T:	box invis "b1" wid 0.35

+	L1:	line <- with .w at L1.w - (0, 0.15)

+		box invis "b2" wid 0.35

+	] with .w at P.w - (0, P.ht) fill

+	box invis "del (gen = 7)" with .w at D.e + (0.5, 0)

+	N: [

+		move right 1

+	L0:	line <-

+	T:	box invis "b0" wid 0.35

+	L1:	line <- with .w at L0.w - (0, 0.15)

+		box invis "b1" wid 0.35

+	L2:	line <- with .w at L1.w - (0, 0.15)

+		box invis "b7" wid 0.35

+		"(free)"

+	] with .w at D.w - (0, D.ht)

+	box invis "next (gen = 9)" with .w at N.e

+S:	spline -> from D.T.e right 0.2 then to N.T.n

+	"merge" at S.sw + (0.15, 0.15)

+]

+S:[

+	right

+	line with .nw at H.sw + (0, 0.2)

+P:	[circle fill wid 0.1]

+	line

+D:	[circle below wid 0.1]

+	line

+N:	[circle fill wid 0.1]

+	"prev" at P.s + (0, - 0.1)

+	"del" at D.s + (0, -0.1)

+	"next" at N.s + (0, -0.1)

+]

+.PE

+.PP

+The disadvantage of this approach is that appending to the deadlists may need more random writes.

+This is because in the worst case, blocks deleted may be scattered across a large number of generations.

+It seems likely that in practice, most bulk deletions will touch files that were written in a small number of generations,

+and not scattered across the whole history of the disk.

+.PP

+The information about the snapshots, deadlists, and labels are stored in a separate

+snapshot tree. The snapshot tree, of course, can never be snapshotted itself.

+However, it's also a copy on write Bε tree where blocks are reclaimed immediately.

+It's kept consistent by syncing both the root of the snapshot tree and the freelists at the same time.

+If any blocks in the snapshot tree are freed, this freeing is only reflected after the snapshot tree is synced to disk fully.

+.PP

+The key-value pairs in the snapshot tree are stored as follows

+.LP

+.CW "Ksnap(id) → (tree)"

+.IP

+Snapshot keys take a unique numeric snapshot id.

+The value contains the tree root.

+This includes the block pointer for the tree, the snapshot generation of the tree, the previous snapshot of the tree,

+its reference count, and its height.

+.LP

+.CW "Klabel(name) → (snapid)"

+.IP

+Label keys contain a human-readable string identifying a snapshot.

+The value is a snapshot id.

+Labels regularly move between snapshots.

+When mounting a mutable snapshot, the label is updated to point at the latest snapshot every time the tree is synced to disk.

+.LP

+.CW "Kslink(snap, next) → ()"

+.IP

+A snap link key contains a snapshot id, and the id of one of its successors.

+Ideally, the successor would be a value, but our Bε tree requires unique keys, so we hack around it by putting both values

+into the key.

+When we have exactly one next link, and no labels that point at this snapshot, we merge with our successor.

+.LP

+.CW "Kdead(snap, gen) → (headptr, tailptr)"

+.IP

+A dead key contains a pair of snapshot id and deadlist generation.

+The value contains a head and tail pointer for a deadlist.

+These are used to quickly look up and merge deadlists, as described earlier in this paper.

+.NH 1

+Block Allocation

+.PP

+In GEFS, blocks are allocated from arenas.

+Within an arena, allocations are stored in a linked list of blocks, which is read at file system initialization.

+The blocks contain a journal of free or allocate operations, which free or allocate regions of disk.

+When the file system starts, it replays this log of allocations and frees, storing the available regions of blocks in an in-memory AVL tree.

+As the file system runs, it appends to the free space log, and occasionally compresses this log,

+collapsing adjacent free or used blocks into larger regions.

+.PP

+Because of the copy on write structure, it's fairly common for metadata blocks to get allocated and deallocated rapidly.

+Drives (even solid state drives) care a lot about sequential access, so it's beneficial to make a best effort attempt at keeping

+data sequential.

+As a result, GEFS selects the arena to allocate from via round robin, offsetting by the type of block.

+If the round robin counter is 10, and we have 7 arenas, then data blocks (type 0) are allocated from arena 3 ((10+0)%7),

+pivot blocks (type 1) are allocated from arena 4 ((10+1)%7), and leaf blocks (type 2) are allocated from arena 5 ((10+2)%7).

+The round robin counter is incremented after every few thousand block writes, in order to balance writes across arenas.

+Since all arenas are the same, if an arena is full, we simply advance to the next arena.

+.NH 1

+Process Structure

+.PP

+GEFS is implemented in a multiprocess manner.

+There is one protocol proc per posted service or listener, which dispatches 9p messages to the appropriate worker.

+Read-only messages get dispatched to one of many reader procs.

+Write messages get dispatched to the mutator proc, which modifies the in-memory representation of the file system.

+The mutator proc sends dirty blocks to the syncer procs.

+There is also a task proc which messages the mutator proc to do periodic maintenance such as syncing.

+The console proc also sends messages to the mutator proc to update snapshots and do other file system maintenance tasks.

+.PS

+.ps 6

+.vs 4

+R: [

+	down

+C:	box "cons"	wid 0.7

+	move 0.5

+T:	box "task"	wid 0.7

+	move 0.5

+P0:	box "srv"	wid 0.7

+]

+move 0.5

+S: [

+	down

+M0:	box "mutator"	wid 0.7

+	move 0.5

+R0:	box "reader0"	wid 0.7

+	move 0.5

+R1:	box "reader1"	wid 0.7

+]

+move 0.5

+F: [

+	down

+S0:	box "syncer0"	wid 0.7

+	move 0.5

+S1:	box "syncer1"	wid 0.7

+	move 0.5

+S2:	box "syncer2"	wid 0.7

+]

+arrow from R.C.e to S.M0.w

+arrow from R.T.e to S.M0.w

+arrow from R.P0.e to S.M0.w

+arrow from R.P0.e to S.R0.w

+arrow from R.P0.e to S.R1.w

+arrow from S.M0.e to F.S0.w

+arrow from S.M0.e to F.S1.w

+arrow from S.M0.e to F.S2.w

+.PE

+.PP

+Because the file system is copy on write, as long as the blocks aren't reclaimed while a reader is accessing the tree,

+writes need not block reads.

+However, if a block is freed within the same snapshot, it's possible that a reader might observe a block with an inconsistent state.

+This is handled by using epoch based reclamation to free blocks.

+.PP

+When a proc starts to operate on the tree, it enters an epoch. This is done by atomically taking the current global epoch,

+and setting the proc's local epoch to that, with an additional bit set to indicate that the proc is active:

+.P1

+	epoch[pid] = atomic_load(globalepoch) | Active

+.P2

+As the mutator frees blocks, instead of immediately making them reusable, it puts the blocks on the limbo list for its generation:

+.P1

+	limbo[gen] = append(limbo[gen], b)

+.P2

+When the proc finishes operating on the tree, it leaves the epoch by clearing the active bit.

+When the mutator leaves the current epoch, it also attempts to advance the global epoch.

+This is done by looping over all worker epochs, and checking if any of them are active in an old epoch.

+If the old epoch is empty, then it's safe to advance the current epoch and clear the old epoch's limbo list.

+.P1

+ge = atomic_load(globalepoch);

+for(w in workers){

+	e = atomic_load(epoch[w]);

+	if((e & Active) && e != (ge | Active))

+		return;

+}

+globalepoch = globalepoch+1

+freeblks(limbo[globalepoch - 2])

+.P2

+.PP

+This epoch based approach allows GEFS to avoid contention between writes and reads.

+A writer may freely mutate the tree as multiple readers traverse it, with no locking between the processes,

+beyond what is required for the 9p implementation.

+There is still contention on the FID table, the block cache, and a number of other in-memory data structures.

+.PP

+The block cache is currently a simple LRU cache with a set of preallocated blocks.

+.NH 1

+Future Work

+.PP

+Currently, GEFS is buggy, and the disk format is still subject to change.

+In its current state, it would be a bad idea to trust your data to it.

+Testing and debugging is underway, including simulating disk failures for every block written.

+In addition, disk inspection and repair tools would need to be written.

+Write performance is also acceptable, but not as good as I would like it to be.

+We top out at several hundred megabytes per second.

+A large part of this is that for simplicity, every upsert to the tree copies the root block.

+A small sorted write buffer in an AVL tree that gets flushed when the root block would reach disk would greatly improve

+write performance.

+.PP

+On top of this, I have been convinced that fs(3) is not the right choice for RAID.

+Therefore, I would like to implement RAID directly in GEFS.

+When implementing RAID5 naïvely, there is a window of inconsistency where a block has been replicated to only some devices, but not all of them.

+This is known as the "write hole".

+Implementing mirroring natively would allow the file system to mirror the blocks fully before they appear in the data structures.

+Having native RAID would also allow automatic recovery based on block pointer hashes.

+This is not possible with fs(3), because if one copy of a block is corrupt, fs(3) would not know which one had the incorrect data.

+.PP

+Furthermore, growing the file system would be extremely useful for taking flashed disk images and growing them to the full size of the underlying storage.

+The way that arenas work allows for this to happen fairly easily, but right now there's no way to change the list of arenas.

+Additionally, the naive round robin allocation across arenas works doesn't allow for balancing.

+.PP

+There are a number of disk-format changing optimizations that should be done.

+For example, small writes should be done via blind writes to the tree. This would allow small writes to be blindingly fast,

+and give us the ability to keep small files directly in the values of the tree, without allocating a disk block for them.

+.PP

+Deletion of files could also be done via a range deletion.

+Currently, each block deleted requires an upsert to the tree, which makes file deletion O(n) with a relatively high cost.

+Moving to a range deletion makes deletion blindingly fast, at the cost of a large amount of complexity: The deletion message

+would need to split as it gets flushed down the tree.

+It's an open question whether the benefit is worth the complexity.

+.PP

+Finally, it would be good to find a method of supporting narrow snapshots, where only a range of the file hierarchy is retained.

+.NH 1

+References

+.LP

+[1] Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson,

+Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Yang Zhan,

+.LP

+``An Introduction to Bε Trees and Write-Optimization,''

+.I ";login:",

+ October 2015, Vol. 40, No. 5" ,

+.LP

+[2] William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao,

+Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael Bender,

+Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter,

+``BetrFS: A Right-Optimized Write-Optimized File System,''

+.I "Proceedings of the 13th USENIX Conference on File and Storage Technologies,"

+2015

+.LP

+[3] Matthew Dillon, "The HAMMER Filesystem,"

+June 2008.

+.LP

+[4] Ohad Rodeh, Josef Bacik, Chris Mason, "BTRFS: The Linux B-Tree Filesystem"

+.I "ACM Transactions on Storage, Volume 9, Issue 3, Article No 9, pp 1-32,"

+August 2013

+.LP

+[5] Ohad Rodeh, "B-trees, Shadowing, and Clones",

+.LP

+.I H-0245 (H0611-006)

+November 12, 2006

+.LP

+[6] Matt Ahrens, `` How ZFS Snapshots Really Work,''

+.I BSDCan,

+2019

+.LP

+[7] Gregory R. Ganger, Marshall Kirk McKusick, Craig A. N. Soules,

+and Yale N. Patt.

+``Soft Updates: A Solution to the Metadata Update Problem

+in File Systems,''

+.I "ACM Transactions on Computer Systems" ,

+Vol 18., No. 2, May 2000, pp. 127\-153.

+.LP

+[8] Valerie Aurora,

+``Soft updates, hard problems''

+.I "Linux Weekly News",

+July 1, 2009,

+https://lwn.net/Articles/339337/

+.LP

+[9] kvik,

+.I "Clone",

+https://shithub.us/kvik/clone/HEAD/info.html

--- a/mkfile

+++ b/mkfile

@@ -28,3 +28,9 @@

 	atomic.h

 </sys/src/cmd/mkone

+</sys/doc/fonts

+%.ps: %.ms

+	{ echo $FONTS; cat $stem.ms } | pic | tbl | eqn | troff -ms | lp -dstdout > $target

+%.pdf: %.ps

+	ps2pdf $stem.ps $stem.pdf

--

⑨