ref: 17c81c4b2f1b5c4e217fc50b58636db84f9a65a0
parent: 32de1f7a58b7b3a8476d1ceb3532e3317004a95c
author: Ori Bernstein <ori@eigenstate.org>
date: Tue Jan 2 13:17:31 EST 2024
gefs.ms: update to match current reality
--- a/gefs.ms
+++ b/gefs.ms
@@ -16,10 +16,10 @@
.NH 1
The Current Situation
.PP
-Currently, Plan 9 has several general purpose disk file systems available.
-All of them share a common set of problems.
-On power loss, the file systems tend to get corrupted, requiring manual recovery steps.
-Silent disk corruption is not caught by the file system, and reads will silently return incorrect data.
+Plan 9 has several general purpose disk file systems available.
+While they have served us well, all of them leave much to be desired.
+On power loss, the file systems may get corrupted.
+Partial disk failure is not caught by the file system, and reads may silently return incorrect data.
They tend to require a large, unshrinkable disk for archival dumps, and behave poorly when the disk fills.
Additionally, all of them do O(n) scans to look up files in directories when walking to a file.
This causes poor performance in large directories.
@@ -35,46 +35,43 @@
.PP
Finally, fossil, the default file system on 9legacy, is large and complicated.
It uses soft-updates for crash safety[7], an approach that has worked poorly in practice for the BSD filesystems[8].
-Fossil also has a history (and present) of deadlocks and crashes due its inherent complexity.
-There are regular reports of deadlocks and crashes when using tools such as clone[9].
While the bugs can be fixed as they're found, simplicity requires a rethink of the on disk data structures.
And even after adding all this complexity, the fossil+venti system provides no way to recover space when the disk fills.
.NH 1
-How GEFS Improves Things
+Why GEFS Is Good Enough
.PP
-GEFS aims to solve the problems with these file systems.
+GEFS aims to solve these problems with the above file systems.
The data and metadata is copied on write, with atomic commits.
+This happens by construction, with fewer subtle ordering requirements than soft updates.
If the file server crashes before the superblocks are updated,
then the next mount will see the last commit that was synced to disk.
-Some data may be lost, by default up to 5 seconds worth, but no corruption will occur.
+Some data may be lost, but no corruption will occur.
Furthermore, because of the use of an indexed data structures, directories do not suffer from O(n) lookups,
solving a long standing performance issue with large directories.
.PP
-While snapshots are useful to keep data consistent, disks often fail over time.
-In order to detect corruption and allow space reclamation, block pointers are triples of 64-bit values:
-The block address, the block hash, and the generation that they were born in.
-If corrupted data is returned by the underlying storage medium, this is detected via the block hashes.
-The corruption is reported, and the damaged data may then be recovered from backups, RAID restoration, or some other means.
-Eventually, the goal is to make GEFS self-healing.
+The file system is based around a relatively novel data structure: the Bε tree [1].
+The Bε tree is a write optimized variant of a B+ tree.
+In addition to good overall performance, it plays particularly nicely with copy on write semantics.
+This allows GEFS to greatly reduce write amplification seen with traditional copy on write B-trees.
+The reduced write amplification allows GEFS to get away with a nearly trivial implementation of snapshotting.
.PP
-Archival dumps are replaced with snapshots.
+As a result of the choice of data structure, archival dumps are replaced with snapshots.
Snapshots may be deleted at any time, allowing data within a snapshot to be reclaimed for reuse.
-To enable this, in addition to the address and hash, each block pointer contains a birth generation.
+To enable this, each block pointer contains a birth generation.
Blocks are reclaimed using a deadlist algorithm inspired by ZFS.
+This algorithm is described later in the paper.
.PP
-Finally, the entire file system is based around a relatively novel data structure.
-This data structure is known as a Bε tree [1].
-It's a write optimized variant of a B+ tree, which plays particularly nicely with copy on write semantics.
-This allows GEFS to greatly reduce write amplification seen with traditional copy on write B-trees.
+While snapshot consistency is useful to keep data consistent, disks often fail over time.
+In order to detect corruption, block pointers contain a hash of the data that they point at.
+If corrupted data is returned by the underlying storage medium, this is detected via the block hashes.
+And if a programmer error causes the file system to write garbage to disk, this can be often be caught early.
+The corruption is reported, and the damaged data may then be recovered from backups, RAID restoration, or some other means.
.PP
-And as a bonus, it solves these problems with less complexity.
By selecting a suitable data structure, a large amount of complexity elsewhere in the file system falls away.
The complexity of the core data structure pays dividends.
Being able to atomically update multiple attributes in the Bε tree,
making the core data structure safely traversable without locks,
and having a simple, unified set of operations makes everything else simpler.
-As a result, the total source size of GEFS is currently 8737 lines of code, as compared to CWFS at 15634, and the
-fossil/venti system at 27762 lines of code (20429 for fossil, with an additional 7333 lines for Venti).
.NH 1
Bε Trees: A Short Summary
.PP
@@ -189,13 +186,15 @@
arrow from P.P2.s to L.L2.n
.PE
.PP
-In GEFS, for the sake of simplicity all blocks are the same size.
-Unfortunately, this implies that the Bε tree blocks are smaller than optimal,
+For the sake of simplicity, GEFS makes all blocks the same size.
+This implies that the Bε tree blocks are smaller than optimal,
and the disk blocks are larger than optimal.
+The simplifications this allows in the block layer appear to be worth it.
.PP
Within a single block, the pivot keys are stored as offsets to variable width data.
The data itself is unsorted, but the offsets pointing to it are sorted.
-This allows O(1) access to the keys and values, while allowing variable sizes.
+This allows O(1) access to the keys and values given an index, or O(log(n))
+access while searching, while allowing variable size keys and values.
.PS
.ps 6
.vs 4
@@ -319,17 +318,11 @@
.NH 1
Snapshots
.PP
-GEFS supports snapshots.
-Each snapshot is referred to by a unique integer id, and is fully immutable once it is taken.
-For human use, a snapshot may be labeled.
-The labels may move to new snapshots, either automatically if they point to a tip of a snapshot, or via user intervention.
-For the sake of space reclamation, snapshots are reference counted.
-Each snapshot takes a reference to the snapshot it descends from.
-Each label also takes a reference to the snapshot that it descends from.
-When a snapshot's only reference is its descendant, then it is deleted, and any space it uses exclusively is reclaimed.
-The only structure GEFS keeps on disk are snapshots, which are taken every 5 seconds.
-This means that in the case of sudden termination, GEFS may lose up to 5 seconds of data, but the data on disk will always be
-consistent.
+Snapshots are an important feature of GEFS.
+Each GEFS snapshot is referred to by a unique integer id, and is fully immutable once it is taken.
+Snapshots are labelled with a human readable string.
+When marked mutable, the labels move to new snapshots as the file system is written to and synced.
+A snapshot may only be referred to by 0 or 1 mutable labels, along with as many immutable labels as desired.
.PP
If there was no space reclamation in gefs, then snapshots would be trivial.
The tree is copy on write.
@@ -336,9 +329,6 @@
Therefore, as long as blocks are never reclaimed, it would be sufficient to save the current root of the tree
once all blocks in it were synced to disk.
However, because snapshots are taken every 5 seconds, disk space would get used uncomfortably quickly.
-As a result, space needs to be reclaimed.
-An advantage of Bε trees here is that often, only the root block will be copied, and updates will be inserted
-into its write buffer.
.PS
.ps 6
.vs 4
@@ -393,7 +383,7 @@
must approach the bandwidth spent on metadata updates, as each block must be scanned and then reclaimed.
Reference counting implies a large number of scattered writes to maintain the reference counts of blocks.
.PP
-As a result, the algorithm for space reclamation is lifted directly from ZFS [6].
+As a result, the algorithm for space reclamation is lifted from ZFS [6].
It is based on the idea of using deadlists to track blocks that became free within a snapshot.
If snapshots are immutable, then a block may not be freed as long as a snapshot exists.
This implies that block lifetimes are contiguous.
@@ -463,6 +453,18 @@
snapshot, and any deadlists with a birth time after the deleted snapshot in the descendant
may be reclaimed.
With this approach, the only lists that need to be scanned are the ones consisting wholly of blocks that must be freed.
+.PP
+All of this asumes that there is a single, linear history of snapshots.
+However, GEFS allows users to take mutable snapshots off of any label, which breaks the assumption.
+If the assumption is broken, two different mutable labels may kill the same block,
+which would lead to double frees.
+GEFS handles this by adding the concept of a
+.I base
+to each snapshot.
+This base id is the first snapshot in a snapshot timeline.
+Any blocks born before the base are not considered owned by the snapshot,
+and no record of their demise will be made in that snapshot.
+The cleanup is left to the snapshot that was used as the base.
.PS
.ps 6
.vs 4
@@ -578,12 +580,48 @@
Process Structure
.PP
GEFS is implemented in a multiprocess manner.
-There is one protocol proc per posted service or listener, which dispatches 9p messages to the appropriate worker.
-Read-only messages get dispatched to one of many reader procs.
+There are six types of proc that GEFS uses for its operation:
+The
+.IR console ,
+.IR dispatch ,
+.IR mutation ,
+.IR sweeper ,
+.IR reader ,
+and
+.IR sycer .
+Most of these processes can be replicated,
+however, there may only be one
+.IR mutator ,
+.IR sweeper ,
+or
+.I console
+at a time.
+Protocol parsing is handled by one of several dispatch procs.
+There is one of these per posted service or listener.
+Each dispatches 9p messages to the appropriate worker, depending on the 9p message type.
+Read-only messages get dispatched to one of multiple reader procs.
Write messages get dispatched to the mutator proc, which modifies the in-memory representation of the file system.
-The mutator proc sends dirty blocks to the syncer procs.
-There is also a task proc which messages the mutator proc to do periodic maintenance such as syncing.
-The console proc also sends messages to the mutator proc to update snapshots and do other file system maintenance tasks.
+The mutator proc generates dirty blocks purely in memory, and sends them to the syncer procs.
+The job of the syncer proc is simply to write blocks back to disk asynchronously.
+There are also some tasks that may take a long time, and can be done in the background.
+These are sent to the sweeper proc.
+Because the tree is a shared data structure, the sweeper and mutator do not work in parallel.
+Instead, they must hold the mutator lock to accomplish anything.
+Finally, the task proc schedules periodic maintenance operations.
+These include syncing the file system and taking automatic snapshots.
+.PP
+The work of the sweeper could be done by the mutator,
+and in early versions of the file system, it was.
+However, some operations such as removing very large files
+can involve a lot of messages being inserted into the tree,
+which may block other writers.
+As a result, the long running operations are better deferred to a
+background process, which splits them into small chunks, allowing
+the mutator to make progress between them.
+.PP
+Data flow through these processes is unidirectional,
+and any block that has made it out of the mutating processes is immutable.
+This makes it reasonably easy to reason about consistency.
.PS
.ps 6
.vs 4
@@ -598,6 +636,8 @@
move 0.5
S: [
down
+S0: box "sweeper" wid 0.7
+ move 0.5
M0: box "mutator" wid 0.7
move 0.5
R0: box "reader0" wid 0.7
@@ -621,19 +661,29 @@
arrow from S.M0.e to F.S0.w
arrow from S.M0.e to F.S1.w
arrow from S.M0.e to F.S2.w
+arrow from S.S0.e to F.S0.w
+arrow from S.S0.e to F.S1.w
+arrow from S.S0.e to F.S2.w
+arrow from S.M0.s to S.S0.n
.PE
.PP
-Because the file system is copy on write, as long as the blocks aren't reclaimed while a reader is accessing the tree,
-writes need not block reads.
-However, if a block is freed within the same snapshot, it's possible that a reader might observe a block with an inconsistent state.
-This is handled by using epoch based reclamation to free blocks.
+Because the file system is copy on write,
+as long as the blocks aren't reclaimed while a reader is accessing the tree, writes need not block reads.
+However, if a block is freed within the same snapshot,
+a naive implementation would allow the reader to observe a corrupt block.
+As a result, some additional cleverness is needed:
+block reclamation needs to be deferred until all readers are done reading a block.
+The algorithm selected for this is epoch based reclamation.
.PP
-When a proc starts to operate on the tree, it enters an epoch. This is done by atomically taking the current global epoch,
-and setting the proc's local epoch to that, with an additional bit set to indicate that the proc is active:
+When a proc starts to operate on the tree, it enters an epoch.
+This is done by atomically taking the current global epoch,
+and setting the proc's local epoch to that,
+with an additional bit set to indicate that the proc is active:
.P1
epoch[pid] = atomic_load(globalepoch) | Active
.P2
-As the mutator frees blocks, instead of immediately making them reusable, it puts the blocks on the limbo list for its generation:
+As the mutator frees blocks, instead of immediately making them reusable,
+it puts the blocks on the limbo list for its current generation:
.P1
limbo[gen] = append(limbo[gen], b)
.P2
@@ -652,44 +702,15 @@
freeblks(limbo[globalepoch - 2])
.P2
.PP
+If the old epoch is not empty, then the blocks are not freed, and the cleanup is deferred.
+If a reader stalls out for a very long time, this can lead to a large accumulation of garbage,
+and as a result, GEFS starts to apply backpressure to the writers if the limbo list begins
+to get too large.
+.PP
This epoch based approach allows GEFS to avoid contention between writes and reads.
A writer may freely mutate the tree as multiple readers traverse it, with no locking between the processes,
beyond what is required for the 9p implementation.
There is still contention on the FID table, the block cache, and a number of other in-memory data structures.
-.PP
-The block cache is currently a simple LRU cache with a set of preallocated blocks.
-.NH 1
-Future Work
-.PP
-Currently, GEFS is almost certainly buggy, and the disk format is still subject to change.
-In its current state, it would be a bad idea to trust your data to it.
-Testing and debugging is underway, including simulating disk failures for every block written.
-In addition, disk inspection and repair tools would need to be written.
-Write performance is also acceptable, but not as good as I would like it to be.
-We top out at several hundred megabytes per second.
-A large part of this is that for simplicity, every upsert to the tree copies the root block.
-A small sorted write buffer in an AVL tree that gets flushed when the root block would reach disk would greatly improve
-write performance.
-.PP
-On top of this, I have been convinced that fs(3) is not the right choice for RAID.
-Therefore, I would like to implement RAID directly in GEFS.
-When implementing RAID5 naïvely, there is a window of inconsistency where a block has been replicated to only some devices, but not all of them.
-This is known as the "write hole".
-Implementing mirroring natively would allow the file system to mirror the blocks fully before they appear in the data structures.
-Having native RAID would also allow automatic recovery based on block pointer hashes.
-This is not possible with fs(3), because if one copy of a block is corrupt, fs(3) would not know which one had the incorrect data.
-.PP
-There are a number of disk-format changing optimizations that should be done.
-For example, small writes should be done via blind writes to the tree. This would allow small writes to be blindingly fast,
-and give us the ability to keep small files directly in the values of the tree, without allocating a disk block for them.
-.PP
-Deletion of files could also be done via a range deletion.
-Currently, each block deleted requires an upsert to the tree, which makes file deletion O(n) with a relatively high cost.
-Moving to a range deletion makes deletion blindingly fast, at the cost of a large amount of complexity: The deletion message
-would need to split as it gets flushed down the tree.
-It's an open question whether the benefit is worth the complexity.
-.PP
-Finally, it would be good to find a method of supporting narrow snapshots, where only a range of the file hierarchy is retained.
.NH 1
References
.LP