ref: 6413e86cc41bc6d0a0b5e8949e69765fd28dcb21
parent: 23d91e5e4cae62b6d855b6b4721861f620621118
author: Noam Preil <noam@pixelhero.dev>
date: Mon Feb 19 14:17:46 EST 2024
research
--- a/notebook
+++ b/notebook
@@ -554,3 +554,157 @@
...My approach here has been completely wrong. I should not be blindly reimplementing venti/checkarenas on top of neoventi's code base at all. I should be understanding the disk format, and then writing a from-scratch corruption checker.
+For now, I should probably take another glance at the venti paper, and at my actual venti, and get a concrete spec for what the current disk format is. ...the paper isn't nearly complete enough, though. To /sys/src, of course.
+
+The config file - which may be an actual file, if not inside of venti, so let's assume that for simplicity - specifies the roots.
+
+arenas /dev/kaladin/arenas
+isect /dev/kaladin/isect
+
+arenas entries are loaded as ArenaParts in venti, which is done via initarenapart on a Part, which is just an abstraction over an opened file, effectively (there's a minor caveat that partitions in venti can actually be _parts of a file_, and not an entire file - i.e. a single disk can be partitioned _by venti_, and each part tracks the offset and size into the underlying fd).
+
+Similarly, the index section is loaded as an ISect by initisect.
+
+Each partition should have 256KiB of unused space at the beginning. Let's verify that in practice...
+
+The space is not wiped by formatting, and is not truly _blank_. The config file exists at the very end of this space, in fact!
+
+After that, arena partitions have a 16 byte header, consisting of four four-byte big-endian fields: the magic number 0xa9e4a5e7,
+
+Fields are:
+
+u32 magic
+u32 version
+u32 blocksize
+u32 arenabase
+
+The magic is 0xa9e4a5e7U.
+
+% dd -if /dev/kaladin/arenas -bs 1024 -skip 256 -count 1 | xd
+1+0 records in
+1+0 records out
+0000000 a9e4a5e7 00000003 00002000 000c2000
+0000010 00000000 00000000 00000000 00000000
+
+As can be seen from the venti I'm typing this on, the current arenaparts version is 3. My block size is set to 8KiB, and the first arenabase is 0xc2000. I'm not certain what the arena base is; likely the offset of the first arena, but relative to the partition? The header?
+
+There's also a table which lists all arenas in the partition, which appears to be aligned to the first block after the header.
+
+ap->tabbase = (PartBlank + HeadSize + ap->blocksize - 1) & ~(ap->blocksize - 1);
+
+The table is required to be _before_ the arenabase. As tabbase is relative to the partition, the arenabase appears to be as well.
+
+0xc2000
+
+this shows the arena base as 776KiB into the file; let's test that. Arena header magic is 0xd15c4eadU.
+% dd -if /dev/kaladin/arenas -bs 1024 -skip 776 -count 1 | xd
+1+0 records in
+1+0 records out
+0000000 d15c4ead 00000005 6172656e 6173302e
+...
+
+Okay, so there's definitely an arena there. What's the use of arenabase, though, if all arenas and their offsets are listed in the table? For recovery in case the table is corrupt? Ahh, no, it's really meant to mark the _end of the table_.
+
+The arena partition is just a set of arenas with a table to allow quick lookups by name, apparently?
+
+The arena table is... textual, of course. Cool. It should be easy to find it; 256KiB into the file, plus 512 bytes, then round up to the next 8KiB block; that's just 264KiB in, and - since the arena base is 776KiB in - it should be exactly 512KiB? Suspiciously exact, but let's check.
+
+% dd -if /dev/kaladin/arenas -bs 1024 -skip 264 -count 512 >/tmp/table
+
+While /tmp/table is 512KiB, most of it is NULL. The partitioner appears to round up to 512KiB no matter what.
+
+The table itself is formatted simply enough. It's a u32 which lists the number of entries, followed by that number of entries, each consisting of a textual name, a tab character, a u64 start, a tab, a u64 stop, and a newline delimiter.
+
+% cat /tmp/table
+1416
+arenas0.0 794624 537665536
+
+The first entry is listed as starting at 794624, which happens to be exactly 776KiB into the file, exactly equal to the arenabase - this confirms that addresses in the map are, in fact, direct offsets into the partition.
+
+I'm going to define an ArenaParts version=4 that uses a binary table instead, but that's a later project. TODO.
+
+Arenas contain both a header and a trailer. Both occupy one block. NOTE: the usage of blocks as a unit for fundamental structures imposes a limit on the block size.
+
+The header and trailer _both_ contain some of the information, for redundancy.
+
+The trailer structure is as follows:
+
+u32 magic = 0xf2a14eadU
+u32 version = (4|5)
+[64]u8 arena name
+Disk stats:
+ u32 clumps
+ u32 cclumps
+u32 ctime
+u32 wtime
+?version == 5: u32 clumpmagic
+ u64 used
+ u64 uncompressed size
+ bool sealed
+bool has_memstats
+?has_memstats: Memory stats
+ u32 clumps
+ u32 cclumps
+ u64 used
+ u64 uncsize
+ bool sealed
+
+Arenas are indexed live. The disk stats represents what is committed fully to disk: i.e., what the index sees. The memory stats reveal clumps that are in the arena, but not in the index.
+
+The arena header structure is as follows:
+
+u32 magic = 0xd15c4eadU
+u32 version
+[64]u8 arena name
+u32 blocksize
+u64 size
+?version==5: u32 clumpmagic
+
+The first arena begins at 794624, which is 97 blocks in. The table ends at the end of block 95.
+% dd -if /dev/kaladin/arenas -bs 8192 -skip 96 -count 1 > /tmp/foo
+
+Block 96 is totally empty. This may be intentional, to prevent an overwrite?
+
+Block 97, on the other hand, contains the arena header, as expected - followed by a lot of empty space.
+
+0000000 d15c4ead 00000005 6172656e 6173302e
+0000010 30000000 00000000 00000000 00000000
+0000020 00000000 00000000 00000000 00000000
+0000030 00000000 00000000 00000000 00000000
+0000040 00000000 00000000 00002000 00000000
+0000050 20000000 05228fa3 00000000 00000000
+
+All the space after that is empty, and reading it is pointless.
+
+dat.h:114: ArenaSize4 = 2 * U64Size + 6 * U32Size + ANameSize + U8Size,
+dat.h:115: ArenaSize5 = ArenaSize4 + U32Size,
+dat.h:116: ArenaSize5a = ArenaSize5 + 2 * U8Size + 2 * U32Size + 2 * U64Size,
+
+Could just be reading 1 512-byte sectore instead of 16 of them >_> operating at the block level could make some sense, but the readpart call specifies the block size explicitly. Specifying just the arena header size would likely be better >_>
+
+Anywho, it's pretty clear that this arena is version 5, named `arenas0.0', has block size 0x2000 (8KiB), size is 1/8th of 4GiB => 512MiB, and the clump magic is 05228fa3. After this block should just be clumps?
+
+% dd -if /dev/kaladin/arenas -bs 8192 -skip 98 -count 1 > /tmp/foo
+
+0000000 05228fa3 02002800 507b589d 11008702
+
+Next block certainly begins with the clump magic!
+
+% xd /tmp/foo | grep 05228fa3
+0000000 05228fa3 02002800 507b589d 11008702
+00000b0 841026d0 05228fa3 02004100 50b52af7
+0000180 05228fa3 02006100 f0561e10 bb976037
+0000400 05228fa3 01004601 2cf747e8 0c4f54aa
+0000460 abf77a54 c018482b db007048 05228fa3
+0000500 83e22783 e02fc928 09fce8a0 05228fa3
+
+But it also contains that signature multiple times, and I'm pretty sure that those are all legitimately different clumps. A single block can contain multiple clumps. Can a clump cross blocks, though?
+
+% dd -if /dev/kaladin/arenas -bs 8192 -skip 99 -count 1 > /tmp/foo
+% xd /tmp/foo
+0000000 9bcaea2e c32578e8 de947f10 d9dcc43c
+
+Yep. Lovely. TODO: we should _try_ to pack clumps efficiently, and buffer them to do so, if needed. Having a read-write buffer area in venti would be useful for other purposes as well (e.g. score management).
+
+As far as reading goes, though, I see why venti requires a block cache, and operates at the block level. Going to add one to neoventi now.
+