shithub: neoventi

ref: 6c16dccd849e657c3d11d233541e7da3b2992c5e
dir: /notebook/

View raw version
Let's get venti lookups working. At present, we're not finding the entry in the bucket, even though the score was taken directly from fossil, which I'm typing this into, so it definitely is there.

Walk backwards through the logic in oldventi, compare the results, find the problem, document this here.

lookupscore in icache.c checks the index cache, then falls back to loadientry if it's not found there. That takes in the index, score, and type, and generates an index entry; let's check that out in /sys/src/cmd/venti/srv/index.c:652...

First, it checks the bloom filter; that's used to detect absence early, so we can skip it.

Then, it loads the index bucket - double check that we're grabbing the right bucket, then check that we're looking up the entry in the bucket correctly...


loadibucket goes through loadibucket1, but nothing happens in loadibucket. loadibucket1 processes the args a bit for loadibucket0, which does the real work.

Given the index, the bucket number, an out pointer to which the index section is written if found, an out pointer to which the section-relative bucket number is written if found, an out pointer to which the index bucket is unpacked from the raw bytes, and the mode with which to retrieve the disk block.

The absolute bucket index is the hashbits(score, 32) divided by the index divisor. Aside from that, all parameters are passed through from loadibucket exactly. The score is a u8*, and I think I used char* for neoventi. This is quite probably a bug even if it's not the one I'm looking for. Nothing else is of interest from loadientry, parameter-wise.

Let's go in order, though. Double check the hash...

Given u8* score and the result bits, the hash is calculated as the first four bytes of the score in little-endian order. The maximum length of the hash is 32 bits; if fewer bits are requested, the four bytes are simply shifted over.

...we're using U32GET(score) to retrieve the hash, but score is char*. U32GET only casts the _result_ to u32, not the parameter. Fix that, compare the resulting bucket index...

That was definitely _one_ of the issues; this moves us from bucket 4880611 to bucket 4226740. There's likely other stupid type errors in here still >_>

Going to preemptively switch to using u8int* for the score everywhere. That might be enough to solve this, and even if not, it will prevent other stupid bugs in the future.

Didn't fix this, but I wasn't really expecting it, so it's fine ☺ 

Idea: maybe write a frontend to venti/srv code that dumps intermediate info for comparison? That should be relatively easy, there's other venti/srv frontends (wrarena, for instance) I can use as a template, and it would it easy to verify.

Yeah, that was a good idea % ./6.tester
found in bucket 1913977

It COMPLETELY disagrees? Need to double-check unit; venti/tester is reporting section-relative, I don't think I am in neoventi.

Nope, not the problem.

venti/tester modified reports: found in bucket 1913977, [0, 4880645], 1684300339/880 = 1913977

The index divisor is 880, the 'hash' is 1684300339.

neoventi reports 'Looking in bucket 3719531835 / 880 = 4226740'.

The hash is wrong??

THE SCORE IS WRONG! >_>

We have it in string form, we want it in raw bytes!

...no, actually, we have it in raw bytes. I was embedding it in string form in the tester >_> Oops.

With that fixed, % ./6.tester
found in bucket 4226740, [0, 4880645], 3719531835/880 = 4226740

We're looking in the right bucket! Yay!

That means we're misreading the bucket? Not scanning it correctly?

Bucket data is after the u16 entries and u32 magic; IBucketSize in venti/srv is U32Size+U16Size. unpackibucket sets the n to the u16 at the front, and the data to buf+IBucketSize. magic is checked as the u32 after the first u16.

We're doing all that right.

We're reading the bucket with one entry in it. Probably our bucketlookup that's wrong, then?

...uh. Or... maybe not?

Extended the tester to unpack and report the bucket contents:

	IBucket b;
	unpackibucket(&b, no->data, sect->bucketmagic);
	print("bucket entries: %d\n", b.n);
	int h = bucklook(score, -1, b.data, b.n);
	print("bucklook: %d\n", h);
	if(h&1){
		h ^= 1;
		print("Found at offset %d\n", h);
	} else {
		print("not found in bucket\n");
	}

Result:

bucket entries: 1
bucklook: 38
not found in bucket

venti/srv doesn't see that score, either??? What the hell?
vacfs: vacfsopen: no block with score ddb38d3b21f95644b181052dc096ebdf2862dd83/16 exists

Yeah, confirmed. Huh??? I copied that score from fossil/l... fossil/last. Are its output scores invalid? ...yep. Oops.

Let's try with a new score! 4091f8b8aafc7b8a815f38534b70a171e4ae3e44 taken from fscons.

Establish a baseline first, which I should ALWAYS do: vacfs is able to mount that one. okay. venti/tester does not see it, though. Are vac scores not direct venti scores?

I'm fairly sure they should be, and a quick skim through vacfs code supports that. Okay, let's try a different approach: venti/tester expansion to loadientry directly, none of the manual bucket futzing:

	IEntry e;
	if(loadientry(index, score, -1, &e) < 0){
		print("failed to load ientry\n");
	} else {
		print("ientry loaded\n");
	}

failed to load ientry

...hmmm. Is it possible for data in the latest arena to be unindexed? Perhaps the issue is the assumption that the data should be in the index to begin with.

Walking through the source; readlump=>lookupscore, which, if the entry is not in the cache, will loadientry; if loadientry fails, it passes an error up, and readlump fails.

...data in the latest arena has to be loaded into the index on startup, doesn't it? is it injecting it directly into the index cache and leaving it dirty??

Yep, syncindex(). Okay, let's add a syncindex call to venti/tester...

That's not working, either. let's go a step higher: try readlump()?

failed to read lump: no block with score %V%/4542084 exists

what. 
seterr(EOk, "no block with score %V/%d exists", score, type);

Ahhhhhh, %V format handler not installed, so score isn't printed, and it's printing the score as the type.
ventifmtinstall();

failed to read lump: no block with score 4091f8b8aafc7b8a815f38534b70a171e4ae3e44/-1 exists

much better! Now, copy-paste that score and vacfs it again...

 % vacfs -h tcp!127.1!13031 vac:4091f8b8aafc7b8a815f38534b70a171e4ae3e44
% ls /n/vac
/n/vac/active
/n/vac/archive
/n/vac/snapshot

and lo! This is bullshit! Ah wait, reading the lump before syncindex, let's fix that...
	if(syncindex(index) < 0)
		sysfatal("sync failed: %r");
	Packet *data = readlump(score, -1, -1, nil);
	if(data == nil)
		print("failed to read lump: %r\n");
	else
		print("successfully read lump!\n");
	IEntry e;
	if(loadientry(index, score, -1, &e) < 0){
	...


and huh!

successfully read lump!
failed to load ientry

Progress!

...is it being inserted into the lump cache on startup, but not indexed?

	int cached;
	Packet *data = readlump(score, -1, -1, &cached);
	if(data == nil)
		print("failed to read lump: %r\n");
	else
		print("successfully read lump, cached: %d\n", cached);
successfully read lump, cached: 0

The plot thickens! It's NOT in the lump cache! readlump then uses lookupscore to find the index address... which checks the index cache, the loads the ientry. Let's check the icache?
	IAddr addr;
	if(lookupscore(score, -1, &addr) == 0)
		print("found score in index cache at %llud, size %lud, type %d, blocks %d\n", addr.addr, addr.size, addr.type, addr.blocks);
	else
		print("not found in index cache\n");

found score in index cache at 75373226395, size 300, type 16, blocks 1
failed to load ientry

Yeeeep, there it is. it's inserted into the index cache on startup. That's separate from syncindex, though?

Nope, removing syncindex results in it not being in the index cache. Whelp, pretty sure that's a bug in venti/venti: I don't think it will allow you to read the latest data when in read-only mode, where it skips syncindex.

Okay, I don't intend to have this behavior, necessarily. Let's just do a full flush. 
% hget http://127.1:13032/flushicache

Note: the index flusher is disgustingly slow. Disk I/O and CPU were both at effectively zero for minutes while it ran. I didn't realize it was _working_ it's so slow O_o.

venti/read -h tcp!127.1!14011 vac:4091f8b8aafc7b8a815f38534b70a171e4ae3e44
kill 108118
Initializing neoventi build 1...
reading 1416 map entries
found 1416 arenas
Parsing index v1...
reading 1 map entries
reading 1416 map entries
received a connection at /net/tcp/18, fd 7
handling read of score 4091f8b8aafc7b8a815f38534b70a171e4ae3e44
Looking in bucket 1083308216 / 880 = 1231032
relative bucket: 1231032, section [0, 4880645]
entries: 3
arena arenas0.140, start 212543899, size 17
WE HAVE THE DATA!!!!!

Okay! We're able to read the data now. All RPC calls start with the RPC size, then message type, then tag. For Rread, the rest of the message is the data.

	dbuf = malloc(4 + size);
	dbuf[0] = (size+2)>>8;
	dbuf[1] = (size+2) & 0xFF;
	dbuf[2] = VtRread;
	dbuf[3] = buf[3];
	vtreadarena(arena, addr, dbuf+4, &size);
	print("WE HAVE THE DATA!!!!! size: %ud\n", size);
	// just slam the dbuf packet over the wire!
	if(write(conn.fd, dbuf, size + 4) != size+4){
		print("failed to write data\n");

...


venti/read: could not read block: read asked for 4091f8b8aafc7b8a815f38534b70a171e4ae3e44 got db1a96f91d93fca56f68c352a428a5fbdb71f0d5

Oh, lovely. Bucket's good, but actual data read's busted. Back to venti/tester...
	Packet *data = readlump(score, -1, -1, &cached);
	if(data == nil)
		print("failed to read lump: %r\n");
	else
		print("successfully read lump, cached: %d, size: %lud\n", cached, packetsize(data));
successfully read lump, cached: 0, size: 300

okay. back to readlump...

Looks up the score as an IAddr, which is... an address into the index? It then uses the index to map that to an address inside of an arena, before actually loading the clump from that arena.

Our vtreadlookup combines these steps: given a score, it looks it up in the index, finds the arena+address, and yields that info all to the caller. So, let's compare the intermediate results.

First off, lookupscore() already gives us the right size. venti/tester already exposes it:

	if(lookupscore(score, -1, &addr) == 0)
		print("found score in index cache at %llud, size %lud, type %d, blocks %d\n", addr.addr, addr.size, addr.type, addr.blocks);

 found score in index cache at 75374255700, size 300, type 0, blocks 1

lookupscore() grabs the score from loadientry(), which is just loading the bucket and finding the entry in the bucket.

Found the issue while reading the code, no test needed :P

	*addr = U64GET((buf + 6 + (entry * IEntrySize) + 26));
	*size = U16GET((buf + 6 + (entry * IEntrySize) + 28));

The address is eight bytes, not two. The offset should be 34, not 28. With that fixed,

 arena arenas0.140, start 212543899, size 300
WE HAVE THE DATA!!!!! size: 300

could not read block: read asked for 4091f8b8aafc7b8a815f38534b70a171e4ae3e44 got 746311957a8fafc724bc122884778f97d4a5c769

That's a different wrong hash, at least? and now the size is correct! Either the address is wrong, the conversion is wrong, or the way it's being read/sent is wrong, I think.

tester update:
	u64int aa;
	Arena *arena = amapitoa(index, addr.addr, &aa);
	if(arena)
		print("mapped to arena addr %llud\n", aa);

mapped to arena addr 213573204

On our side, we've got:
	arena arenas0.140, start 212543899, size 300

That address looks wrong?

Let's expand the tester a bit:
	Clump cl;
	u8int readscore[20];
	ZBlock *zb = loadclump(arena, aa, addr.blocks, &cl, readscore, 1);
	if(zb)
		print("read clump from disk: size %lud\n", cl.info.uncsize);

read clump from disk: size 300

okay. let's double check that it is the offset that's the problem by hardcoding it and seeing what happens.

venti/read: could not read block: read asked for 4091f8b8aafc7b8a815f38534b70a171e4ae3e44 got f8e0d5ed942a9fe3c234569b14b18de0536cdd61

We just get a different wrong hash >_>

Okay, that was maybe a silly approach. let's figure out _why_ the addresses differ...

First, let's double check our pre-mapping addresses.

tester gives > found score in index cache at 75374255700, size 300, type 0, blocks 1

75374255700.

we've got > found base addr 75373226395

whoops. Our index entry is wrong, then? Let's double check the bucket, and where in it we are...

tester gives offset 38 for the bucket.

we've got... found in bucket at offset 0

...huh. Oops? One entry is 38 bytes; we're finding it as the first entry in the bucket, while venti/srv finds it as the second?

Ahhhh, the type is probably wrong. The hash for both entries is the same? Looks like the same data was written with multiple types??

...ohhhh, the clump on disk isn't just the raw bytes! Oops!

It'll probably work without checking the type if I fix the disk read?

Yeaaaaaaah, loadclump:
• Reads blocks, not bytes
• calls unpackclump to actually get the data out
• uncompresses if necessary

We can't ignore any of that. The clump encoding may need more than the listed size.

...we can try something stupid, though. The clump on disk consists of:
4 bytes of magic
1 byte of type
2 bytes for size
2 bytes for uncompressed size
20 bytes for the score
1 byte for the encoding (compression info)
4 bytes for the creator
4 bytes for the time
and then the data

For an uncompressed block, we shooooould be able to just read 38 bytes plus the data, and skip the first 38 bytes?

Didn't work; still had the wrong hash. Hopefully that means it's compressed?
	char clump[38];
	vtreadarena(arena, addr, clump, &size);
	print("clump encoding: %d\n", clump[29]);

Didn't even need the tester for this; the encoding is 2, which is - according to venti/srv/dat.h - ClumpECompress. It's compressed and needs unwhacking.

Let's just copy unwhack in wholesale. That's ~200 lines of code, and there's no point reimplementing _that_ or, bluntly, understanding it.

it fucking wooooorks!

static int
readclump(uchar *dst, VtArena arena, u64int addr, u8int blocks)
{
	uchar *buf = malloc(blocks << ABlockLog);
	u32int magic;
	u16int size;
	size = blocks<<ABlockLog;
	vtreadarena(arena, addr, buf, &size);
	size = U16GET(buf+7);
	if(buf[29] == 2){
		// Needs decompression
		Unwhack uw;
		unwhackinit(&uw);
		if(unwhack(&uw, dst, size, buf+38, U16GET(buf+5)) != size)
			sysfatal("decompression failed: %s", uw.err);
	} else if(buf[29] == 1)
		memcpy(dst, buf+38, size);
	free(buf);
	return 1;
}

<date Sun Dec 24 19:09:00 EST 2023

Okay! Everything works and has been partly cleaned up and refactored. Needs more of that, but getting the write path working is a higher priority right now.

Probably going to want to buffer writes, and actually handle Tsync as a flush call. If I'm going to want a server-side cache as well, I should probably have them unified, and have a separate set of dirty entries? The question is, what layer do I put the cache at?

I could do a disk-level cache, which tracks data by address on disk, with a dirty bit on each entry - this would be a singular cache which handles both index and arena data. IIRC this would be akin to venti's dcache, and I'm not sure I want that.

The index cache is the most useful, though if it does not get flushed, we need to do a scan on startup to populate it the same way venti does, and this has been one of the behaviors that has irked me. Blocking until all data is available is reasonable in this situation, but getting *into* that situation is kinda crazy.

One of the major annoyances with original venti is how poorly it utilizes the storage disk. While writing data, it will take a minute before it starts flushing _any_ of it out. We should be trying to write data more immediately than that.

On the other hand, always writing immediately may not be ideal? but eh, I'd rather try anyways and then see how it works out. Venti is WORM, so normal patterns fall into two general categories in my experience:

• File system running on top
	- This flushes periodically
	- Most of the time, it will be totally idle
	- When it writes, it writes all data at once
• One-time upload of data
	- Archives, vacballs
	- This should also just fire data at us full-speed

In both situations, we should expect to have a constant stream of data when we're writing. We have two streams of data to write:

• The log
	• This is written linearly, chunk after chunk after chunk
	• Each chunk has a checksum, but calculating those is trivial
• The index
	• This writes to totally random parts of the disk
		• _scatter storage addressing_, after all!

Buffering the log might be reasonable, since it allows for writing larger blocks at a time.

but... why speculate when I can measure? Start by just writing data the instant it comes in. If that's not fast enough, look into using multiple procs for flushing data. If that's not fast enough, look into other options. Overengineering is Bad™.

Just double checked venti and it looks like it has one writer per partition, so it shouldn't be terribly hard to beat its performance without even resorting to multithreading - just by having a better implementation of the single-threaded writing.

Caching / prefetching / some such system will definitely be needed to win on the read side - or, well, *probably* be needed. Parallelized readers might be sufficient, though that may require changes to libventi / vacfs / etc to enable multiple reads in flight at a time. But again, measure before designing.

Worst case, neoventi's prototype has demonstrated that I _can_ do this, even if I have to change the design a bunch more. And, I kinda want to do the final implementation in Hare anyways. I'll need to set up a new neoventi for write testing, because testing it on the root drive would be moronic - I'd have to boot the "real" venti read-only for that, and I've no idea how fossil would handle that. That makes this as good a time as any to actually implement the config handling...

# Mon Dec 25 10:08:18 EST 2023

Actually, ended up focusing on the refactor / cleanup, still. That's sufficient for now, though. And, took a break to fix album art loading in zuke :P which broke during a botched merge.

I should probably also take a break to implement a neoventi formatter, and disk structure checkers, to replace the venti tools. I can test those against existing venti, and it'll be useful for setting up the test system instead of relying on venti's tools.

I also need to take a pause, and reread the original venti paper, in full, before doing so. That will be important for improving neoventi design. Having a better understanding of what I'm replacing is important, and will be critical for writing the paper.

TASK(fossil): neofossil should use spare blocks as cache. After a daily sync, my used blocks is 0, which is ridiculous - I should not need to reach out to venti to access e.g. /amd64/bin/rc! 
	Mon Dec 25 15:24:36 EST 2023:
	Turns out that this is intentional, per the venti paper. "Venti is as fast as disk anyways." Given that their own numbers show it as, at BEST, half as fast, this is ridiculous. I want to make it true, but it's *not* currently true, and it's silly to waste network bandwidth when cache space is so cheap anyways.


TASK(paper): look at more modern competitors than WAFL/AFS/etc. Look into gefs and ZFS, in particular.

TASK(neoventi): investigate the possibility of storing large blocks, and e.g. streaming media files directly out of venti. (Also useful for e.g. torrents?)

TASK(neoventi, research): multi-level blocks? e.g. reserve space for a torrent as a giant block, and _also_ for its individual blocks, pointing to the same storage on disk?

TASK(neofossil, research): look into gear chunking more thoroughly. Wanna understand the math on this one :)

TASK(neofossil, research): need to look into improving security, as well. Actually implementing authentication is a good starting point.

Need to shut off as the train is arriving now, but I'm on page 4.

Finished reading the paper; got a few interesting things from it I missed previously, but it's mostly unhelpful. DESIGN(neoventi): need to figure out what to replace the type with / how better to serve the "look up root scores" use case in a secure manner. Per-user lists in read-write storage, with a 'public' bit on each entry, might be acceptable?

Going to implement the checkers _before_ the formatter. That'll ensure I have a total awareness of the disk format when I go to write the formatter.

Checkarenas first. venti/checkarenas operates on a _singular arena partition_ as follows:

• initarenapart
• for arena ∈ arenapart:
	• if verbose, print arena info
	• if scanning the entire arena:
		• For clump ∈ arena:
			• Verify the clump's magic
			• If the clump's magic is wrong, and we're set to fix, assume that there was a failed write, and mark the clump as free?
				• TODO(venti/check, correctness): what effect does this have if there's a corrupt clump in the middle of correct data in the active arena? Would we end up appending to it?
				• TODO(venti/check, correctness): what happens if we run this on a sealed arena that has been corrupted in the middle?
			• Load the clump's data from the log
			• Verify the clump's score
			• If the score is wrong _and the encoding is none_, assume that the block was only partially written, and ignore it.
				• TODO(venti/check, correctness): check if this assumption holds - and, also, what to expect if we crash in various places.
			• detect invalid types
				• Mark them as corrupt, if set to fix
	• Verify that the header is correct
	• If the memstats and diskstats are out of sync, replace the diskstats.
		• TODO: check why this is needed. Seems likely that this is to recover from when we think we wrote data but it didn't get synced properly? That shouldn't be possible, though, since we should not update stats before writing out the blocks they refer to, presumably? venti is crazy enough that it wouldn't surprise me if it violates that requirement though.

So what I actually want for my replacement, then:

• For arena ∈ arenapart:
	• For clump ∈ arena:
		• Read clump from log, decompressing if necessary
		• Read clump info from arena directory trailer
		• Cross-reference clump
		• Validate score
		• If compressed:
			• Recompress, double check result
	• Verify the header
	• Verify the trailer
	• If sealed, verify the arena's checksum

TASK(neoventi, optimization, unimportant): optimize arena partition loading? Takes like 600ms right now, which is hardly terrible, but it can probably be better.

neoventi/checkarenas is reading the wrong address. While investigating venti/checkarenas behavior, I noticed it was reading the same address repeatedly (and hitting the disk cache on most of them), and assumed it was doing something wrong.

clumpmagic clump 0, aa 0
getting dblock from c4000
preading c4000
loadclump
getting dblock from c4000
found c4000 in cache
readclumpinfo
preading 200be000

clumpmagic clump 1, aa 4e
getting dblock from c4000

It is, but not as much so as I thought. The double-reads are avoidable, but it's actually caching entire 8KiB sections?, so all the smaller blocks that fit in get prefetched at once.

Actually, it's doing something else _worse_ than I thought.


clumpmagic clump 15, aa 6ca
loadclump
getting dblock from c4000
found c4000 in cache
getting dblock from c6000
preading c6000

clumpmagic clump 16, aa 26f0
loadclump
getting dblock from c6000
found c6000 in cache
getting dblock from c8000
preading c8000

clumpmagic clump 17, aa 4716
loadclump
getting dblock from c8000
found c8000 in cache
getting dblock from c8000
found c8000 in cache
getting dblock from ca000
preading ca000
readclumpinfo
found 200be000 in cache
clumpmagic clump 18, aa 673c
loadclump
getting dblock from ca000
found ca000 in cache
getting dblock from ca000
found ca000 in cache
getting dblock from cc000
preading cc000
readclumpinfo
found 200be000 in cache
clumpmagic clump 19, aa 8762
loadclump
getting dblock from cc000
found cc000 in cache
getting dblock from cc000
found cc000 in cache
getting dblock from ce000
preading ce000
readclumpinfo
found 200be000 in cache
clumpmagic clump 20, aa a788
loadclump
getting dblock from ce000
found ce000 in cache
getting dblock from ce000
found ce000 in cache
getting dblock from d0000
preading d0000
readclumpinfo
found 200be000 in cache

It is not maintaining alignment _at all_. This is terrible for performance, I think? A single block read should only need to read from one sector, not two? Should probably test this. Regardless, this is unimportant for the read path, and should only impact how I design the write path.

The important question, though, is why we're failing to pread. 

...fd is bad. Jeez. The address might be actually right?? What the heck...

... oh for fucks sake.

for(int i = numarenas; i >= 0; i -=1)

EXCLUSIVE BOUND. NOT INCLUSIVE BOUND.

Whoops.