shithub: neoventi

ref: 4b60d3dd32efd00a1d07cb08662926ba9836719f
dir: /notebook/

View raw version
Let's get venti lookups working. At present, we're not finding the entry in the bucket, even though the score was taken directly from fossil, which I'm typing this into, so it definitely is there.

Walk backwards through the logic in oldventi, compare the results, find the problem, document this here.

lookupscore in icache.c checks the index cache, then falls back to loadientry if it's not found there. That takes in the index, score, and type, and generates an index entry; let's check that out in /sys/src/cmd/venti/srv/index.c:652...

First, it checks the bloom filter; that's used to detect absence early, so we can skip it.

Then, it loads the index bucket - double check that we're grabbing the right bucket, then check that we're looking up the entry in the bucket correctly...


loadibucket goes through loadibucket1, but nothing happens in loadibucket. loadibucket1 processes the args a bit for loadibucket0, which does the real work.

Given the index, the bucket number, an out pointer to which the index section is written if found, an out pointer to which the section-relative bucket number is written if found, an out pointer to which the index bucket is unpacked from the raw bytes, and the mode with which to retrieve the disk block.

The absolute bucket index is the hashbits(score, 32) divided by the index divisor. Aside from that, all parameters are passed through from loadibucket exactly. The score is a u8*, and I think I used char* for neoventi. This is quite probably a bug even if it's not the one I'm looking for. Nothing else is of interest from loadientry, parameter-wise.

Let's go in order, though. Double check the hash...

Given u8* score and the result bits, the hash is calculated as the first four bytes of the score in little-endian order. The maximum length of the hash is 32 bits; if fewer bits are requested, the four bytes are simply shifted over.

...we're using U32GET(score) to retrieve the hash, but score is char*. U32GET only casts the _result_ to u32, not the parameter. Fix that, compare the resulting bucket index...

That was definitely _one_ of the issues; this moves us from bucket 4880611 to bucket 4226740. There's likely other stupid type errors in here still >_>

Going to preemptively switch to using u8int* for the score everywhere. That might be enough to solve this, and even if not, it will prevent other stupid bugs in the future.

Didn't fix this, but I wasn't really expecting it, so it's fine ☺ 

Idea: maybe write a frontend to venti/srv code that dumps intermediate info for comparison? That should be relatively easy, there's other venti/srv frontends (wrarena, for instance) I can use as a template, and it would it easy to verify.

Yeah, that was a good idea % ./6.tester
found in bucket 1913977

It COMPLETELY disagrees? Need to double-check unit; venti/tester is reporting section-relative, I don't think I am in neoventi.

Nope, not the problem.

venti/tester modified reports: found in bucket 1913977, [0, 4880645], 1684300339/880 = 1913977

The index divisor is 880, the 'hash' is 1684300339.

neoventi reports 'Looking in bucket 3719531835 / 880 = 4226740'.

The hash is wrong??

THE SCORE IS WRONG! >_>

We have it in string form, we want it in raw bytes!

...no, actually, we have it in raw bytes. I was embedding it in string form in the tester >_> Oops.

With that fixed, % ./6.tester
found in bucket 4226740, [0, 4880645], 3719531835/880 = 4226740

We're looking in the right bucket! Yay!

That means we're misreading the bucket? Not scanning it correctly?

Bucket data is after the u16 entries and u32 magic; IBucketSize in venti/srv is U32Size+U16Size. unpackibucket sets the n to the u16 at the front, and the data to buf+IBucketSize. magic is checked as the u32 after the first u16.

We're doing all that right.

We're reading the bucket with one entry in it. Probably our bucketlookup that's wrong, then?

...uh. Or... maybe not?

Extended the tester to unpack and report the bucket contents:

	IBucket b;
	unpackibucket(&b, no->data, sect->bucketmagic);
	print("bucket entries: %d\n", b.n);
	int h = bucklook(score, -1, b.data, b.n);
	print("bucklook: %d\n", h);
	if(h&1){
		h ^= 1;
		print("Found at offset %d\n", h);
	} else {
		print("not found in bucket\n");
	}

Result:

bucket entries: 1
bucklook: 38
not found in bucket

venti/srv doesn't see that score, either??? What the hell?
vacfs: vacfsopen: no block with score ddb38d3b21f95644b181052dc096ebdf2862dd83/16 exists

Yeah, confirmed. Huh??? I copied that score from fossil/l... fossil/last. Are its output scores invalid? ...yep. Oops.

Let's try with a new score! 4091f8b8aafc7b8a815f38534b70a171e4ae3e44 taken from fscons.

Establish a baseline first, which I should ALWAYS do: vacfs is able to mount that one. okay. venti/tester does not see it, though. Are vac scores not direct venti scores?

I'm fairly sure they should be, and a quick skim through vacfs code supports that. Okay, let's try a different approach: venti/tester expansion to loadientry directly, none of the manual bucket futzing:

	IEntry e;
	if(loadientry(index, score, -1, &e) < 0){
		print("failed to load ientry\n");
	} else {
		print("ientry loaded\n");
	}

failed to load ientry

...hmmm. Is it possible for data in the latest arena to be unindexed? Perhaps the issue is the assumption that the data should be in the index to begin with.

Walking through the source; readlump=>lookupscore, which, if the entry is not in the cache, will loadientry; if loadientry fails, it passes an error up, and readlump fails.

...data in the latest arena has to be loaded into the index on startup, doesn't it? is it injecting it directly into the index cache and leaving it dirty??

Yep, syncindex(). Okay, let's add a syncindex call to venti/tester...

That's not working, either. let's go a step higher: try readlump()?

failed to read lump: no block with score %V%/4542084 exists

what. 
seterr(EOk, "no block with score %V/%d exists", score, type);

Ahhhhhh, %V format handler not installed, so score isn't printed, and it's printing the score as the type.
ventifmtinstall();

failed to read lump: no block with score 4091f8b8aafc7b8a815f38534b70a171e4ae3e44/-1 exists

much better! Now, copy-paste that score and vacfs it again...

 % vacfs -h tcp!127.1!13031 vac:4091f8b8aafc7b8a815f38534b70a171e4ae3e44
% ls /n/vac
/n/vac/active
/n/vac/archive
/n/vac/snapshot

and lo! This is bullshit! Ah wait, reading the lump before syncindex, let's fix that...
	if(syncindex(index) < 0)
		sysfatal("sync failed: %r");
	Packet *data = readlump(score, -1, -1, nil);
	if(data == nil)
		print("failed to read lump: %r\n");
	else
		print("successfully read lump!\n");
	IEntry e;
	if(loadientry(index, score, -1, &e) < 0){
	...


and huh!

successfully read lump!
failed to load ientry

Progress!

...is it being inserted into the lump cache on startup, but not indexed?

	int cached;
	Packet *data = readlump(score, -1, -1, &cached);
	if(data == nil)
		print("failed to read lump: %r\n");
	else
		print("successfully read lump, cached: %d\n", cached);
successfully read lump, cached: 0

The plot thickens! It's NOT in the lump cache! readlump then uses lookupscore to find the index address... which checks the index cache, the loads the ientry. Let's check the icache?
	IAddr addr;
	if(lookupscore(score, -1, &addr) == 0)
		print("found score in index cache at %llud, size %lud, type %d, blocks %d\n", addr.addr, addr.size, addr.type, addr.blocks);
	else
		print("not found in index cache\n");

found score in index cache at 75373226395, size 300, type 16, blocks 1
failed to load ientry

Yeeeep, there it is. it's inserted into the index cache on startup. That's separate from syncindex, though?

Nope, removing syncindex results in it not being in the index cache. Whelp, pretty sure that's a bug in venti/venti: I don't think it will allow you to read the latest data when in read-only mode, where it skips syncindex.

Okay, I don't intend to have this behavior, necessarily. Let's just do a full flush. 
% hget http://127.1:13032/flushicache

Note: the index flusher is disgustingly slow. Disk I/O and CPU were both at effectively zero for minutes while it ran. I didn't realize it was _working_ it's so slow O_o.

venti/read -h tcp!127.1!14011 vac:4091f8b8aafc7b8a815f38534b70a171e4ae3e44
kill 108118
Initializing neoventi build 1...
reading 1416 map entries
found 1416 arenas
Parsing index v1...
reading 1 map entries
reading 1416 map entries
received a connection at /net/tcp/18, fd 7
handling read of score 4091f8b8aafc7b8a815f38534b70a171e4ae3e44
Looking in bucket 1083308216 / 880 = 1231032
relative bucket: 1231032, section [0, 4880645]
entries: 3
arena arenas0.140, start 212543899, size 17
WE HAVE THE DATA!!!!!

Okay! We're able to read the data now. All RPC calls start with the RPC size, then message type, then tag. For Rread, the rest of the message is the data.

	dbuf = malloc(4 + size);
	dbuf[0] = (size+2)>>8;
	dbuf[1] = (size+2) & 0xFF;
	dbuf[2] = VtRread;
	dbuf[3] = buf[3];
	vtreadarena(arena, addr, dbuf+4, &size);
	print("WE HAVE THE DATA!!!!! size: %ud\n", size);
	// just slam the dbuf packet over the wire!
	if(write(conn.fd, dbuf, size + 4) != size+4){
		print("failed to write data\n");

...


venti/read: could not read block: read asked for 4091f8b8aafc7b8a815f38534b70a171e4ae3e44 got db1a96f91d93fca56f68c352a428a5fbdb71f0d5

Oh, lovely. Bucket's good, but actual data read's busted. Back to venti/tester...
	Packet *data = readlump(score, -1, -1, &cached);
	if(data == nil)
		print("failed to read lump: %r\n");
	else
		print("successfully read lump, cached: %d, size: %lud\n", cached, packetsize(data));
successfully read lump, cached: 0, size: 300

okay. back to readlump...

Looks up the score as an IAddr, which is... an address into the index? It then uses the index to map that to an address inside of an arena, before actually loading the clump from that arena.

Our vtreadlookup combines these steps: given a score, it looks it up in the index, finds the arena+address, and yields that info all to the caller. So, let's compare the intermediate results.

First off, lookupscore() already gives us the right size. venti/tester already exposes it:

	if(lookupscore(score, -1, &addr) == 0)
		print("found score in index cache at %llud, size %lud, type %d, blocks %d\n", addr.addr, addr.size, addr.type, addr.blocks);

 found score in index cache at 75374255700, size 300, type 0, blocks 1

lookupscore() grabs the score from loadientry(), which is just loading the bucket and finding the entry in the bucket.

Found the issue while reading the code, no test needed :P

	*addr = U64GET((buf + 6 + (entry * IEntrySize) + 26));
	*size = U16GET((buf + 6 + (entry * IEntrySize) + 28));

The address is eight bytes, not two. The offset should be 34, not 28. With that fixed,

 arena arenas0.140, start 212543899, size 300
WE HAVE THE DATA!!!!! size: 300

could not read block: read asked for 4091f8b8aafc7b8a815f38534b70a171e4ae3e44 got 746311957a8fafc724bc122884778f97d4a5c769

That's a different wrong hash, at least? and now the size is correct! Either the address is wrong, the conversion is wrong, or the way it's being read/sent is wrong, I think.

tester update:
	u64int aa;
	Arena *arena = amapitoa(index, addr.addr, &aa);
	if(arena)
		print("mapped to arena addr %llud\n", aa);

mapped to arena addr 213573204

On our side, we've got:
	arena arenas0.140, start 212543899, size 300

That address looks wrong?

Let's expand the tester a bit:
	Clump cl;
	u8int readscore[20];
	ZBlock *zb = loadclump(arena, aa, addr.blocks, &cl, readscore, 1);
	if(zb)
		print("read clump from disk: size %lud\n", cl.info.uncsize);

read clump from disk: size 300

okay. let's double check that it is the offset that's the problem by hardcoding it and seeing what happens.

venti/read: could not read block: read asked for 4091f8b8aafc7b8a815f38534b70a171e4ae3e44 got f8e0d5ed942a9fe3c234569b14b18de0536cdd61

We just get a different wrong hash >_>

Okay, that was maybe a silly approach. let's figure out _why_ the addresses differ...

First, let's double check our pre-mapping addresses.

tester gives > found score in index cache at 75374255700, size 300, type 0, blocks 1

75374255700.

we've got > found base addr 75373226395

whoops. Our index entry is wrong, then? Let's double check the bucket, and where in it we are...

tester gives offset 38 for the bucket.

we've got... found in bucket at offset 0

...huh. Oops? One entry is 38 bytes; we're finding it as the first entry in the bucket, while venti/srv finds it as the second?

Ahhhh, the type is probably wrong. The hash for both entries is the same? Looks like the same data was written with multiple types??

...ohhhh, the clump on disk isn't just the raw bytes! Oops!

It'll probably work without checking the type if I fix the disk read?

Yeaaaaaaah, loadclump:
• Reads blocks, not bytes
• calls unpackclump to actually get the data out
• uncompresses if necessary

We can't ignore any of that. The clump encoding may need more than the listed size.

...we can try something stupid, though. The clump on disk consists of:
4 bytes of magic
1 byte of type
2 bytes for size
2 bytes for uncompressed size
20 bytes for the score
1 byte for the encoding (compression info)
4 bytes for the creator
4 bytes for the time
and then the data

For an uncompressed block, we shooooould be able to just read 38 bytes plus the data, and skip the first 38 bytes?

Didn't work; still had the wrong hash. Hopefully that means it's compressed?
	char clump[38];
	vtreadarena(arena, addr, clump, &size);
	print("clump encoding: %d\n", clump[29]);

Didn't even need the tester for this; the encoding is 2, which is - according to venti/srv/dat.h - ClumpECompress. It's compressed and needs unwhacking.

Let's just copy unwhack in wholesale. That's ~200 lines of code, and there's no point reimplementing _that_ or, bluntly, understanding it.

it fucking wooooorks!

static int
readclump(uchar *dst, VtArena arena, u64int addr, u8int blocks)
{
	uchar *buf = malloc(blocks << ABlockLog);
	u32int magic;
	u16int size;
	size = blocks<<ABlockLog;
	vtreadarena(arena, addr, buf, &size);
	size = U16GET(buf+7);
	if(buf[29] == 2){
		// Needs decompression
		Unwhack uw;
		unwhackinit(&uw);
		if(unwhack(&uw, dst, size, buf+38, U16GET(buf+5)) != size)
			sysfatal("decompression failed: %s", uw.err);
	} else if(buf[29] == 1)
		memcpy(dst, buf+38, size);
	free(buf);
	return 1;
}

<date Sun Dec 24 19:09:00 EST 2023

Okay! Everything works and has been partly cleaned up and refactored. Needs more of that, but getting the write path working is a higher priority right now.

Probably going to want to buffer writes, and actually handle Tsync as a flush call. If I'm going to want a server-side cache as well, I should probably have them unified, and have a separate set of dirty entries? The question is, what layer do I put the cache at?

I could do a disk-level cache, which tracks data by address on disk, with a dirty bit on each entry - this would be a singular cache which handles both index and arena data. IIRC this would be akin to venti's dcache, and I'm not sure I want that.

The index cache is the most useful, though if it does not get flushed, we need to do a scan on startup to populate it the same way venti does, and this has been one of the behaviors that has irked me. Blocking until all data is available is reasonable in this situation, but getting *into* that situation is kinda crazy.

One of the major annoyances with original venti is how poorly it utilizes the storage disk. While writing data, it will take a minute before it starts flushing _any_ of it out. We should be trying to write data more immediately than that.

On the other hand, always writing immediately may not be ideal? but eh, I'd rather try anyways and then see how it works out. Venti is WORM, so normal patterns fall into two general categories in my experience:

• File system running on top
	- This flushes periodically
	- Most of the time, it will be totally idle
	- When it writes, it writes all data at once
• One-time upload of data
	- Archives, vacballs
	- This should also just fire data at us full-speed

In both situations, we should expect to have a constant stream of data when we're writing. We have two streams of data to write:

• The log
	• This is written linearly, chunk after chunk after chunk
	• Each chunk has a checksum, but calculating those is trivial
• The index
	• This writes to totally random parts of the disk
		• _scatter storage addressing_, after all!

Buffering the log might be reasonable, since it allows for writing larger blocks at a time.

but... why speculate when I can measure? Start by just writing data the instant it comes in. If that's not fast enough, look into using multiple procs for flushing data. If that's not fast enough, look into other options. Overengineering is Bad™.

Just double checked venti and it looks like it has one writer per partition, so it shouldn't be terribly hard to beat its performance without even resorting to multithreading - just by having a better implementation of the single-threaded writing.

Caching / prefetching / some such system will definitely be needed to win on the read side - or, well, *probably* be needed. Parallelized readers might be sufficient, though that may require changes to libventi / vacfs / etc to enable multiple reads in flight at a time. But again, measure before designing.

Worst case, neoventi's prototype has demonstrated that I _can_ do this, even if I have to change the design a bunch more. And, I kinda want to do the final implementation in Hare anyways. I'll need to set up a new neoventi for write testing, because testing it on the root drive would be moronic - I'd have to boot the "real" venti read-only for that, and I've no idea how fossil would handle that. That makes this as good a time as any to actually implement the config handling...

# Mon Dec 25 10:08:18 EST 2023

Actually, ended up focusing on the refactor / cleanup, still. That's sufficient for now, though. And, took a break to fix album art loading in zuke :P which broke during a botched merge.

I should probably also take a break to implement a neoventi formatter, and disk structure checkers, to replace the venti tools. I can test those against existing venti, and it'll be useful for setting up the test system instead of relying on venti's tools.

I also need to take a pause, and reread the original venti paper, in full, before doing so. That will be important for improving neoventi design. Having a better understanding of what I'm replacing is important, and will be critical for writing the paper.

TASK(fossil): neofossil should use spare blocks as cache. After a daily sync, my used blocks is 0, which is ridiculous - I should not need to reach out to venti to access e.g. /amd64/bin/rc! 
	Mon Dec 25 15:24:36 EST 2023:
	Turns out that this is intentional, per the venti paper. "Venti is as fast as disk anyways." Given that their own numbers show it as, at BEST, half as fast, this is ridiculous. I want to make it true, but it's *not* currently true, and it's silly to waste network bandwidth when cache space is so cheap anyways.


TASK(paper): look at more modern competitors than WAFL/AFS/etc. Look into gefs and ZFS, in particular.

TASK(neoventi): investigate the possibility of storing large blocks, and e.g. streaming media files directly out of venti. (Also useful for e.g. torrents?)

TASK(neoventi, research): multi-level blocks? e.g. reserve space for a torrent as a giant block, and _also_ for its individual blocks, pointing to the same storage on disk?

TASK(neofossil, research): look into gear chunking more thoroughly. Wanna understand the math on this one :)

TASK(neofossil, research): need to look into improving security, as well. Actually implementing authentication is a good starting point.

Need to shut off as the train is arriving now, but I'm on page 4.

Finished reading the paper; got a few interesting things from it I missed previously, but it's mostly unhelpful. DESIGN(neoventi): need to figure out what to replace the type with / how better to serve the "look up root scores" use case in a secure manner. Per-user lists in read-write storage, with a 'public' bit on each entry, might be acceptable?

Going to implement the checkers _before_ the formatter. That'll ensure I have a total awareness of the disk format when I go to write the formatter.

Checkarenas first. venti/checkarenas operates on a _singular arena partition_ as follows:

• initarenapart
• for arena ∈ arenapart:
	• if verbose, print arena info
	• if scanning the entire arena:
		• For clump ∈ arena:
			• Verify the clump's magic
			• If the clump's magic is wrong, and we're set to fix, assume that there was a failed write, and mark the clump as free?
				• TODO(venti/check, correctness): what effect does this have if there's a corrupt clump in the middle of correct data in the active arena? Would we end up appending to it?
				• TODO(venti/check, correctness): what happens if we run this on a sealed arena that has been corrupted in the middle?
			• Load the clump's data from the log
			• Verify the clump's score
			• If the score is wrong _and the encoding is none_, assume that the block was only partially written, and ignore it.
				• TODO(venti/check, correctness): check if this assumption holds - and, also, what to expect if we crash in various places.
			• detect invalid types
				• Mark them as corrupt, if set to fix
	• Verify that the header is correct
	• If the memstats and diskstats are out of sync, replace the diskstats.
		• TODO: check why this is needed. Seems likely that this is to recover from when we think we wrote data but it didn't get synced properly? That shouldn't be possible, though, since we should not update stats before writing out the blocks they refer to, presumably? venti is crazy enough that it wouldn't surprise me if it violates that requirement though.

So what I actually want for my replacement, then:

• For arena ∈ arenapart:
	• For clump ∈ arena:
		• Read clump from log, decompressing if necessary
		• Read clump info from arena directory trailer
		• Cross-reference clump
		• Validate score
		• If compressed:
			• Recompress, double check result
	• Verify the header
	• Verify the trailer
	• If sealed, verify the arena's checksum

TASK(neoventi, optimization, unimportant): optimize arena partition loading? Takes like 600ms right now, which is hardly terrible, but it can probably be better.

neoventi/checkarenas is reading the wrong address. While investigating venti/checkarenas behavior, I noticed it was reading the same address repeatedly (and hitting the disk cache on most of them), and assumed it was doing something wrong.

clumpmagic clump 0, aa 0
getting dblock from c4000
preading c4000
loadclump
getting dblock from c4000
found c4000 in cache
readclumpinfo
preading 200be000

clumpmagic clump 1, aa 4e
getting dblock from c4000

It is, but not as much so as I thought. The double-reads are avoidable, but it's actually caching entire 8KiB sections?, so all the smaller blocks that fit in get prefetched at once.

Actually, it's doing something else _worse_ than I thought.


clumpmagic clump 15, aa 6ca
loadclump
getting dblock from c4000
found c4000 in cache
getting dblock from c6000
preading c6000

clumpmagic clump 16, aa 26f0
loadclump
getting dblock from c6000
found c6000 in cache
getting dblock from c8000
preading c8000

clumpmagic clump 17, aa 4716
loadclump
getting dblock from c8000
found c8000 in cache
getting dblock from c8000
found c8000 in cache
getting dblock from ca000
preading ca000
readclumpinfo
found 200be000 in cache
clumpmagic clump 18, aa 673c
loadclump
getting dblock from ca000
found ca000 in cache
getting dblock from ca000
found ca000 in cache
getting dblock from cc000
preading cc000
readclumpinfo
found 200be000 in cache
clumpmagic clump 19, aa 8762
loadclump
getting dblock from cc000
found cc000 in cache
getting dblock from cc000
found cc000 in cache
getting dblock from ce000
preading ce000
readclumpinfo
found 200be000 in cache
clumpmagic clump 20, aa a788
loadclump
getting dblock from ce000
found ce000 in cache
getting dblock from ce000
found ce000 in cache
getting dblock from d0000
preading d0000
readclumpinfo
found 200be000 in cache

It is not maintaining alignment _at all_. This is terrible for performance, I think? A single block read should only need to read from one sector, not two? Should probably test this. Regardless, this is unimportant for the read path, and should only impact how I design the write path.

The important question, though, is why we're failing to pread. 

...fd is bad. Jeez. The address might be actually right?? What the heck...

... oh for fucks sake.

for(int i = numarenas; i >= 0; i -=1)

EXCLUSIVE BOUND. NOT INCLUSIVE BOUND.

Whoops.

...My approach here has been completely wrong. I should not be blindly reimplementing venti/checkarenas on top of neoventi's code base at all. I should be understanding the disk format, and then writing a from-scratch corruption checker.

For now, I should probably take another glance at the venti paper, and at my actual venti, and get a concrete spec for what the current disk format is. ...the paper isn't nearly complete enough, though. To /sys/src, of course.

The config file - which may be an actual file, if not inside of venti, so let's assume that for simplicity - specifies the roots.

arenas /dev/kaladin/arenas
isect /dev/kaladin/isect

arenas entries are loaded as ArenaParts in venti, which is done via initarenapart on a Part, which is just an abstraction over an opened file, effectively (there's a minor caveat that partitions in venti can actually be _parts of a file_, and not an entire file - i.e. a single disk can be partitioned _by venti_, and each part tracks the offset and size into the underlying fd).

Similarly, the index section is loaded as an ISect by initisect.

Each partition should have 256KiB of unused space at the beginning. Let's verify that in practice...

The space is not wiped by formatting, and is not truly _blank_. The config file exists at the very end of this space, in fact!

After that, arena partitions have a 16 byte header, consisting of four four-byte big-endian fields: the magic number 0xa9e4a5e7, 

Fields are:

u32 magic
u32 version
u32 blocksize
u32 arenabase

The magic is 0xa9e4a5e7U.

% dd -if /dev/kaladin/arenas -bs 1024 -skip 256 -count 1 | xd
1+0 records in
1+0 records out
0000000  a9e4a5e7 00000003 00002000 000c2000
0000010  00000000 00000000 00000000 00000000

As can be seen from the venti I'm typing this on, the current arenaparts version is 3. My block size is set to 8KiB, and the first arenabase is 0xc2000. I'm not certain what the arena base is; likely the offset of the first arena, but relative to the partition? The header?

There's also a table which lists all arenas in the partition, which appears to be aligned to the first block after the header.

ap->tabbase = (PartBlank + HeadSize + ap->blocksize - 1) & ~(ap->blocksize - 1);

The table is required to be _before_ the arenabase. As tabbase is relative to the partition, the arenabase appears to be as well.

0xc2000

this shows the arena base as 776KiB into the file; let's test that. Arena header magic is 0xd15c4eadU.
% dd -if /dev/kaladin/arenas -bs 1024 -skip 776 -count 1 | xd
1+0 records in
1+0 records out
0000000  d15c4ead 00000005 6172656e 6173302e
...

Okay, so there's definitely an arena there. What's the use of arenabase, though, if all arenas and their offsets are listed in the table? For recovery in case the table is corrupt? Ahh, no, it's really meant to mark the _end of the table_.

The arena partition is just a set of arenas with a table to allow quick lookups by name, apparently?

The arena table is... textual, of course. Cool. It should be easy to find it; 256KiB into the file, plus 512 bytes, then round up to the next 8KiB block; that's just 264KiB in, and - since the arena base is 776KiB in - it should be exactly 512KiB? Suspiciously exact, but let's check.

% dd -if /dev/kaladin/arenas -bs 1024 -skip 264 -count 512 >/tmp/table

While /tmp/table is 512KiB, most of it is NULL. The partitioner appears to round up to 512KiB no matter what. 

The table itself is formatted simply enough. It's a u32 which lists the number of entries, followed by that number of entries, each consisting of a textual name, a tab character, a u64 start, a tab, a u64 stop, and a newline delimiter.

% cat /tmp/table
1416
arenas0.0	794624	537665536

The first entry is listed as starting at 794624, which happens to be exactly 776KiB into the file, exactly equal to the arenabase - this confirms that addresses in the map are, in fact, direct offsets into the partition.

I'm going to define an ArenaParts version=4 that uses a binary table instead, but that's a later project. TODO.

Arenas contain both a header and a trailer. Both occupy one block. NOTE: the usage of blocks as a unit for fundamental structures imposes a limit on the block size.

The header and trailer _both_ contain some of the information, for redundancy.

The trailer structure is as follows:

u32 magic = 0xf2a14eadU
u32 version = (4|5)
[64]u8 arena name
Disk stats:
	u32 clumps
	u32 cclumps
u32 ctime
u32 wtime
?version == 5: u32 clumpmagic
	u64 used
	u64 uncompressed size
	bool sealed
bool has_memstats
?has_memstats: Memory stats
	u32 clumps
	u32 cclumps
	u64 used
	u64 uncsize
	bool sealed

Arenas are indexed live. The disk stats represents what is committed fully to disk: i.e., what the index sees. The memory stats reveal clumps that are in the arena, but not in the index.

The arena header structure is as follows:

u32 magic = 0xd15c4eadU
u32 version
[64]u8 arena name
u32  blocksize
u64 size
?version==5: u32 clumpmagic

The first arena begins at 794624, which is 97 blocks in. The table ends at the end of block 95. 
% dd -if /dev/kaladin/arenas -bs 8192 -skip 96 -count 1 > /tmp/foo

Block 96 is totally empty. This may be intentional, to prevent an overwrite?

Block 97, on the other hand, contains the arena header, as expected - followed by a lot of empty space.

0000000  d15c4ead 00000005 6172656e 6173302e
0000010  30000000 00000000 00000000 00000000
0000020  00000000 00000000 00000000 00000000
0000030  00000000 00000000 00000000 00000000
0000040  00000000 00000000 00002000 00000000
0000050  20000000 05228fa3 00000000 00000000

All the space after that is empty, and reading it is pointless.

dat.h:114: 	ArenaSize4		= 2 * U64Size + 6 * U32Size + ANameSize + U8Size,
dat.h:115: 	ArenaSize5			= ArenaSize4 + U32Size,
dat.h:116: 	ArenaSize5a		= ArenaSize5 + 2 * U8Size + 2 * U32Size + 2 * U64Size,

Could just be reading 1 512-byte sectore instead of 16 of them >_> operating at the block level could make some sense, but the readpart call specifies the block size explicitly. Specifying just the arena header size would likely be better >_>

Anywho, it's pretty clear that this arena is version 5, named `arenas0.0', has block size 0x2000 (8KiB), size is 1/8th of 4GiB => 512MiB, and the clump magic is 05228fa3. After this block should just be clumps?

%  dd -if /dev/kaladin/arenas -bs 8192 -skip 98 -count 1 > /tmp/foo

0000000  05228fa3 02002800 507b589d 11008702

Next block certainly begins with the clump magic!

% xd /tmp/foo | grep 05228fa3
0000000  05228fa3 02002800 507b589d 11008702
00000b0  841026d0 05228fa3 02004100 50b52af7
0000180  05228fa3 02006100 f0561e10 bb976037
0000400  05228fa3 01004601 2cf747e8 0c4f54aa
0000460  abf77a54 c018482b db007048 05228fa3
0000500  83e22783 e02fc928 09fce8a0 05228fa3

But it also contains that signature multiple times, and I'm pretty sure that those are all legitimately different clumps. A single block can contain multiple clumps. Can a clump cross blocks, though?

% dd -if /dev/kaladin/arenas -bs 8192 -skip 99 -count 1 > /tmp/foo
% xd /tmp/foo
0000000  9bcaea2e c32578e8 de947f10 d9dcc43c

Yep. Lovely. TODO: we should _try_ to pack clumps efficiently, and buffer them to do so, if needed. Having a read-write buffer area in venti would be useful for other purposes as well (e.g. score management).

As far as reading goes, though, I see why venti requires a block cache, and operates at the block level. Going to add one to neoventi now.

I want a unified cache for neoventi - for the index, and the arenas. For disk blocks, I want a map of (arena, block index) to (data block). For index, I want a map of (score) to (index entry). What does venti use for its caches?

Venti just uses the disk address as the cache key, and has the partition pointer in the cache entry for confirmation - so, if two partitions happen to have a block at the same offset, both will go in the same bucket, and one will simply be skipped during lookups.

The block size means that we've got at least a few bits we can use for the fd in the cache key, instead of needing to put the fd in the cache entry.

Need two things for a good cache: a good hash table to back it up, and a good eviction algorithm. The fifty tabs of research I've got open suggests that S3-FIFO is probably the best eviction algorithm, but I need to figure out the hash table first.

I'll probably want to take some inspiration from CLHT (the cache-line hash table) for optimal throughput, but I need to do more research and design.

One option for modification is to store keys and values in separate cache lines; this worsens best-case / common time from one cache line to two, but in exchange it should significantly increase what proportion of accesses hit the best-case time, and reduce space overhead.

It's probably reasonable to just implement a CLHT-like, write some benchmarking code, and then worry about improving it later. A CLHT-alike should be more than good enough, and far better than the shit that original venti is doing anyways.

So, short-term: we want 64 bytes per bucket. The key must contain the block index as well as the partition that the block is from. If we require a minimum block size of 256 bytes, which seems reasonable, we can use 8 bytes for the key: 7 for the block index, one for the partition index. The value for the disk cache should be a block; it's reasonable to use a pointer for now. That means 16 bytes per key-value.

We need room for a lock, though. Our libc uses an int for that, so we only strictly need 4 bytes. That still leaves inadequate room for a full fourth entry, short of changing the key or value format. I could use some of that room for chaining, as CLHT does, but I think linear probing should be better.

The other option is to try to reduce the key or value size. If we require one byte per key for the fd, can we reasonably reduce the block index to less than seven bytes? The "obvious" answer of three bytes would give a mere 4096 block indices, nowhere near sufficient. Can the value be reduced? Well, the cache size is static. If we had, say, 32GiB of block cache at 8KiB per entry, that'd be 32*1024^3 / 8*1024 = 4*1024*1024 = 4M entries, which requires 21 bits for indices, or three bytes! Using three full bytes would be fine up to ~128GiB of disk cache, and I'm more than happy setting that as a limit and optimizing for realistic hardware. ...even though my PC could realistically have more than 128GiB if I paid for more RAM >_>

using four bytes gives nicer math, so what if we did that? 8 bytes per key + 4 for value index = 12 bytes per entry; 5 entries + an int for locking gives us our nice bucket size of 64 bytes :D

But! That would make this cache specific to the disk blocks! Could it be reused more generally, or would it need to be a u64->u32 map?

Well, actually, would a u64->u32 map be sufficient for the other caches? For the index cache, in particular? It should be fine for clumps; if we use the on-disk address for clumps, we can just do the same trick for that map as for the disk blocks. For the index, though, keys are 20 bytes. A dedicated cache with key-buckets and value-buckets might work better there.

Three keys plus a lock for the key buckets, 15 values plus a lock for value buckets? Could work decently well. Having the disk cache should be enough to make a huge difference regardless, since it'd mean not needing any I/O at all for redundant index accesses, even though it'd mean redundant memory processing - but with a really good disk cache, that might be enough to catch up to venti on some of our tests anyways >_>

So! Short order of business is a u64→u32 map. Buckets contain a Lock, and five keys and five values. Keys are (block_index<<8)|(fd&0xff); and we assert that no fds are greater than 0xff; values are indices into the raw disk cache buffer, and refer to [blocksize]-sized entries, not bytes.

typedef struct {
	Lock;
	// index into raw cache
	u32int value[5];
	// block_index<<8 | fd; assert(fd < 0xff);
	u64int key[5];
} bucket_t;

The cache is managed as one giant array of buckets; if an entry is missing in a given bucket, we scan forwards to the next bucket. If it's missing from that bucket, we repeat until we find either an empty slot or the correct entry.

We'll start with a hardcoded table size of 1GiB, and hardcoded block size of 8KiB. This gives us 1024^3 / 8*1024 = 128K entries. With five per bucket, that means ~26K buckets, or ~1.6MiB of overhead.

...sizeof(Lock) in libc is actually eight. We can just copy lock.c and #pragma pack it, I guess. _tas won't care, probably? Unless it wants alignment? Which would make sense, and the lock should probably be aligned for maximum performance anyways. Could reduce entry size to four, I guess? Or use three bytes per value? Or try to reduce the key size...

Given 32TiB of storage, and 8K block size, maximum block index is 32*1024^4 / 8*1024 = 4*1024^3 = 4GiB = 4 bytes. We... can probably reduce key size to a u32, too? That would make us a u32→u32 map, and need 8 bytes per entry. It'd mean seven entries per bucket, too. That should work?

Ah! Except we need a byte for the fd, still. We could do nine bytes per entry, and still get six entries total?

typedef struct {
	Lock;
	// index into raw cache
	u32int value[6];
	u32int index[6];
	u8int fd[6];
} bucket_t;

It'd mean two wasted bytes per bucket, though.

Three bytes for the value would give a limit of 128G of cache, but that's probably fine. Three bytes value + one byte fd + four bytes index = 8 bytes per entry again. Lots of finagling, and lower limits, but more than adequate.

typedef struct {
	Lock;
	// index into raw cache
	u32int index[7];
	// 24 bits of value index plus 8 bits of fd
	u32int value[7];
} bucket_t;

When caching disk blocks, we don't want to rely on the score, do we? Probably saner to use the block index as the key there for bucket selection, too; a block will contain multiple scores (clumps / index entries) anyways.

Fibonnaci hashing on the (index|fd) ought to be sufficient for spreading entries among buckets.

Can probably clean up that bucket format a bit, though, but lack of u24int makes it harder :/

We also need a free list to track available slots in the main data bucket. Instead of a linked list, we can make this a large statically-sized stack; the map size can never change. Given 128*1024 entries, we need 128*1024 slots on the stack, maximum, plus a current length, and each slot needs merely to hold a slot index, so a u32 is fine. This will use 512KiB; it could be reduced if we reduced entry count to 64*1024, but that's not a limitation I'm eager to enforce.

To insert, we need to do a few things:

- Pick a bucket
- Lock the bucket
- Check if it already contains the key
	- If so, update the value, and evict the old cache entry
- Otherwise, grab an empty slot, and set the key
	- Lock the free list
	- Grab an empty cache entry from the free list
	- Remove the entry from the free list
	- Unlock the free list
	- Copy the data into that entry
	- Set the value of the slot to that cache entry

Buckets are picked via fibonnaci hash of the (index<<8|fd).

In order to be able to test this, I need to adapt the disk reading functions to operate in blocks - which they should already be doing, but were not since I misunderstood venti's operation due to rushing through things.

Currently, to read blocks, we allocate space for blocks, read the data into them, then grab the data we want and store it in the destination buffer. Ideally, we'd have the cache lookup in there, such that we either grab the data from cache or reserve a cache entry, read directly into the cache entries, and then extract from the cache entry into the dbuf.

For now, going to just extract the block reading, and have it insert into cache...

That was easy enough.

Going to use the zero index to indicate an absent entry; the first block in a given partition never has important data anyways :)

Quickly changing the API to match intended usage, as well. cachemapinsert should take the fd, the block index, and a cache entry. We should then have a cachegrab(fd, index) that yields a buffer and an indicator about whether it is free or needs to be read into. Usage would look like:

char *buf;
if(cachegrab(&buf, fd, index)){
	// data is in buf
} else {
	// read data into buf, and call cacheinsertmap(fd, index, buf)
}

Ah. Right. Cache insertion needs not the buffer but the _index_. Could do math to get it back but that seems silly. So... cachegrab should yield a cache index?

Also, it'd be nice to be able to use the memory directly within the cache without needing to copy to a destination buffer. Since some chunks can cross blocks, that'd require a way to use multiple sequential blocks in the cache at once, and to ensure they will not be evicted before we're done with them.

The latter is important anyways, and a refcount per entry is likely the way to go. The former requirement is a lot harder. And, to e.g. handle a read request, we'll need to copy from the cache into the destination buffer anyways.

So, easier to design the interface such that you can grab a block, read it straight into the cache, then do a _single_ copy into the destination buffer, and repeat for other needed blocks. Oh! And cachegrab can also set up the cache entry, so that clock reading could look something like:

u32int cacheindex;
if(cachelookup(&cacheindex, fd, blockindex))
	return;
pread(fd, cachebuf(cacheindex), ...;
return cacheindex;

And that's it. Cachelookup can, on a miss, grab a spot from the free list / trigger an eviction if needed. The caller can then cacheunref(cacheindex) when done with the entry.

The one downside of reading straight into cache is that under extreme cache pressure, reads will become blocked by the rate of cache evictions - that is, if the cache is so small that the set of active reads cannot fit into it, there will be a major limit to throughput. At sizes just above that, the cache's utility will be negligible as most entries will be evicted quickly.

The upside is that it reduces memory requirements, and it ought to be trivial to ensure that never happens. 1GiB is enough for >100K entries, which is far more than sufficient for purpose. Even just a few megs ought to be plenty to avoid total thrashing.

We also need write locking on reads into the cache. Given two concurrent reads to the same data, one will read from disk into cache, both should yield the same cache index, and the memory-reader must block on the disk-reader to finish.

A spin lock, held when yielding only to writers, should be sufficient, and require just eight bytes per entry - 1MiB, given a 1GiB cache. On the other hand, this is bumping the complexity / cost of avoiding copying, so it might be better to just...not? Holding the bucket lock while inserting would be... maybe fine, actually?

u32int cacheindex;
if(cachelookup(&cacheindex, fd, blockindex))
	return;
pread(fd, cachebuf(cacheindex), ...;
// unlocks the bucket
cachewritten(fd, blockindex);
return cachebuf(cacheindex);

...except, we no longer need the cache index, since we no longer are responsible for inserting it!

char *buf;
if(!cachelookup(&buf, fd, index)){
	pread(fd, buf, ...);
	cachewritten(fd, index);
}
return buf;

We don't even need any sync on the other reader, since one of the parallel cachelookup calls will block on the other thread reading the data :)

We still need a way to track buffer references, though. We can stuff a secret eight bytes before each buffer, and then just atomically increment it? For now, lacking atomics, we can toss a lock and adjust it under the lock. Entries can only be reclaimed when they have no active readers.

Ooh, one option for tracking readers without too much locking is to have a list of items being read. Could be designed to fit pretty cmpactly, and should never be huge, since we don't expect to have more than a dozen active readers at once. Might be easier to manage.

...short term, can just lock the bucket during a read, too? That doesn't help for evictions, though? Ah, sure it does; pick a random bucket, try to lock it, advance until we find a bucket we can lock, evict its first entry >_> Suboptimal, but might work.

and, having cachedone require the fd and block index instead of the buffer is not great.

With the cache disabled as follows, I measured time to walk the fossil root I've been using for testing.

char*
vtreadarenablock(VtArena *arena, u32int blockindex)
{
	char *buf = malloc(8192);
//	if(!cachelookup(&buf, arena->fd, blockindex)){
		if(pread(arena->fd, buf, arena->blocksize, arena->base+(blockindex*arena->blocksize)) != arena->blocksize)
			return nil;
	//}
	return buf;
}

It took 29.96s.

With the cache enabled, I get a compression error >_>

Lots of pain and a month later, figured out the issue (the pain wasn't related to neoventi, mostly, just depression and grief. YAYYY!)

Short story, the assumptions were wrong. the fd was useless as part of the key, because all the arenas are in the same file, and the block index is _arena-local_. Drop the fd, use a bespoke cache-specific arena index with the block index, and voila, the key goes from five bytes to four bytes and everything works!

(With cachedone() function modified to linearly probe.)

Bonus, since the value is three bytes, this takes entry size from eight to seven bytes; the 56 bytes are now enough for eight entries instead of seven :D

(and, if the lock is reduced from eight bytes to one, we can trivially add a ninth entry per cache line!)

This reduces the _cold_ time for walk /n/vac from ~30s to ~23s, and warm time from ~30s to ~19s. Profiling indicates that >70% of time is still spent in read() - including on the warm run when all data is in cache! - which heavily implies that the index access is the current bottleneck. So, we need to stuff the index entries into the disk cache as well as a first step.

First idea is to give index sections a cache index as with arenas - there's definitely room in the address space for that! - and then use the index bucket as the block index (which it is!)

Problem with that is that, unlike with arenas, where 16 bits is enough for the block index, a typical venti has a single large index section, not many small ones, and so we can not fit the bucket into 16 bits.

A few approachs are obvious:

- Split the index section on disk. Adjust the formatter to default to generating many smaller sections.
	- This requires regenerating the index, and breaks compatibility. hard pass.
- _Fake_ splitting the index section. Generate many index sections in memory that all correspond to the same real index section, each with a subset of buckets.
	- This results in roughly one index section per 10GiB of storage. For 32TiB, as we specced out for the maximum storage in arenas that can fit in the cache as is, we'd need ~3200 index sections. This won't pollute the cachekey space at all. (For comparison, for arenas, we definitionally need ~1 entry per 512MiB!)
	- This can be approximated by faking the key space. Each index section is given a _base_ cache index, and its range is calculated (so that future sections can be added with a higher base). For my system, I'd need ~75 fake index sections.
	- This can be accomplished by splitting the bucket into the lower sixteen bits (simulated block index), and then at _most_ 12 bits are needed for the index base offset _and that's if 32TiB are represented in the singular index section!_
	- This seems like a simple enough solution.

Did that, and now it's 44s cold and 27s hot. No access to disk is needed on subsequent walks, but it's still slower. Going to do a dozen warm runs and then check the profile...

Oh! and that's 44s/27s _with profiling enabled!!!_ That's kinda bullshit.

regardless, down to 41s/24s by reducing how long locks are held. Need to revisit the cache locking, because this is still not great. ...it also implies there _is_ some lock contention, somehow??? Or, perhaps, it's the lack of linear probing after getting a cache entry (prevents some cache misses and computation).

35.5% in read(), with 1.2M calls?? That's odd, there shouldn't be that many, I think?
27.5% in cachelookup, 15% in lock, 7% in unlock, and 6% in cachedone. That's ~55% of cache lookup overhead. 3.2 _billion_ calls to lock and unlock.

Ah right. those read calls are network related :P They're 1/3rd of execution time, so might be worth looking into fixing that up (/ pipelining: start reading next request while handling current one! TODO!), but for now, the cache problem is much more severe.

With locking outright removed (We're single threaded for now anyways!!), cold is down to 28s _with profiling!_, and 11.5 warm _WITH PROFILING_!!!

Down to 27.15s/10.8s by removing some hot allocation.

Venti can do it in ~8s warm. Can we beat that without parallelizing more? ...

74.6% of time is spent on read. A bit of that was the from-disk, but most is network; let's call it ~70% network overhead. 15% is the cachelookup function. 8% is strcmp+arenafromindex. 1% is processing index entries from the bucket; adding an index cache would have minimal effect on this workload, at least. .6% is decompression. Everything else is trivial.

strcmp can probably be avoided. What are we using that for that there's 1.2B calls to it in the profile??

Mapping arenas from name to index, apparently. That's silly. We should just have the arena map entry contain the VtArena*. Then we only need to do the indexing at startup.

That brings us down to 24s cold (WITH PROFILING) and 7.74s warm! venti is BEAT!

Without profiling, 24.4s cold, and... 8s warm???

the hell?

meh. close enough. It's plausible that the profiling is actually faster here for some reason lol

Interestingly, the profile has changed a bit: 62% read, 31% cachelookup, 3.3% vtreadlookup, 1.6% decompression, and then a long tail of who cares.

Going to split cachelookup into pieces so I can see what's actually taking time. Lack of inling means that might acatually make things a tad slower, but the code will be cleaner so I don't really care.

static bucket_t*
getbucket(u32int key)
{
	// Fibonacci hashing!
	return &buckets[key * 11400714819323198485LL >> BITS];
}

static void
put(bucket_t *bucket, int i, u32int key, u32int cacheindex)
{
	bucket->index[i] = key;
	bucket->values[i*3] = cacheindex&0xFF;
	bucket->values[i*3+1] = (cacheindex>>8)&0xFF;
	bucket->values[i*3+2] = cacheindex>>16;
}

// Looks up a cache entry. If the entry is not present in cache,
// this grabs a free buffer, inserts it into cache, and yields
// its index so that the caller may fill it up.
// Returns whether the data was already in cache.
int
cachelookup(char **buf, u16int arenaindex, u32int blockindex)
{
	u32int cacheindex;
	u32int key = (arenaindex << 16) | blockindex;
	bucket_t *bucket = getbucket(key);
	int existed;
	while(bucket != bucketend){
		for(int i = 0; i < NENTRIES; i += 1){
			if(bucket->index[i] == 0){
				// Key does not exist
				cacheindex = grabfree();
				*buf = &data[cacheindex*8192];
				put(bucket, i, key, cacheindex);
				return 0;
			}
			if(bucket->index[i] == key){
				u32int cacheindex = bucket->values[i*3] | (bucket->values[i*3+1]<<8) | (bucket->values[i*3+2]<<16);
				*buf = &data[cacheindex * 8192];
				return 1;
			}
		}
		bucket++;
	}
//FIXME not unreachable
//if the final bucket is full, we must handle it
	sysfatal("unreachable");
	return 0;
}

 0.0    0.028  1102923	getbucket
 0.0    0.000    41077	put
14.5   70.896  1102923	cachelookup

Iiiiinteresting. Picking the bucket and inserting into it are trivial. Let's split this further...


static void
get(bucket_t *bucket, int i, char **buf)
{
	u32int cacheindex = bucket->values[i*3] | (bucket->values[i*3+1]<<8) | (bucket->values[i*3+2]<<16);
	*buf = &data[cacheindex * 8192];
}

static int
trybucket(bucket_t *bucket, int *i, u32int key)
{
	for(*i = 0; *i < NENTRIES; *i += 1){
		if(bucket->index[*i] == key)
			return 1;
		if(bucket->index[*i] == 0)
			return 0;
	}
	return -1;
}

// Looks up a cache entry. If the entry is not present in cache,
// this grabs a free buffer, inserts it into cache, and yields
// its index so that the caller may fill it up.
// Returns whether the data was already in cache.
int
cachelookup(char **buf, u16int arenaindex, u32int blockindex)
{
	u32int cacheindex;
	int i;
	u32int key = (arenaindex << 16) | blockindex;
	bucket_t *bucket = getbucket(key);
	while(bucket != bucketend){
		switch(trybucket(bucket, &i, key)){
		case -1:
			// miss. linear probe.
			bucket++;
			continue;
		case 0:
			// found an empty slot
			cacheindex = grabfree();
			*buf = &data[cacheindex*8192];
			put(bucket, i, key, cacheindex);
			return 0;
		case 1:
			// already exists
			get(bucket, i, buf);
			return 1;
		}
	}
//FIXME not unreachable
//if the final bucket is full, we must handle it
	sysfatal("unreachable");
	return 0;
}

I'll be surprised if this isn't noticably slower again.

Ha, yeah. 32.1s cold, 15.75 warm. Yikes. Knew it would be slower but 2x slower on the warm path? should be a good profile, at least.

45.7  224.300  1368400	read
31.2  153.185 3565574884	trybucket
19.4   95.175    1427308	cachelookup
 1.6    7.741   684185	vtreadlookup

...oh. This is probably the weakness of linear probing with insufficient entries; there's an average of 3000 bucket lookups needed to find a good slot. Oops.

Also, BITS is set wrong. We're able to use all buckets, but only really _trying_ to use the first 255, because we're shifting over by 56. We have 64K entries, and we were reserving entries/4 buckets (since we had seven entries per bucket, and /7 would not give power of two bucket count!), so we should have had that set to 50.

Changing that gives cold time of 20.7s even with the patches! Warm time is 4.39s.

If I set entries to 0x40000, things break. That should be ~2GiB of memory; I'd think it was malloc related but I'm not using malloc, and the issue occurs later. Invalid address in a pread syscall...

The buffer 0x7fffdfff is apparently invalid, for some reason? 

data is at 10ded90, ends a 410ded90. Regardless of whether it's valid, it shouldn't actually be returned anywhere. Weird...

...max index into data should be 0x40000 * 0x2000 is 2GiB. Are array indices of type int? If so, it might be wrapping >_>

Ahhhhh, sbrk is failing to allocate 2GiB?

Right. the _sbrk_ parameter was an int >_> (usize)0x2000 and now it works.

Also, with the cache fixed so that linear probing is less bad:

90.0  187.229  1368400	read
 4.7    9.844   684185	vtreadlookup
 1.8    3.760   664759	unwhack
 1.6    3.338   743123	vtreadarenablock
 0.3    0.717        1	listen
 0.3    0.673     1416	loadarena
 0.2    0.377   766800	memcpy
 0.1    0.280   684185	aindexfromaddr
 0.1    0.266  1427308	cachelookup
 0.1    0.148   684185	bucketlookup
 0.1    0.143   684185	vtread
 0.1    0.141   684187	vtrecv
 0.1    0.133  1572426	memcmp
 0.1    0.127   684185	vtreadarena

Next up is to reduce the number of read()s. Can do this by reading multiple packets per call (read into a 64K buffer, track its length, grab packets from it, rehydrate as needed). Might be able to reduce that 187s to closer to 5s. There were more than 20 runs in that profile, so the result would likely be a total runtime of ~1s.

Nooope. reading from the TCP file will only give us a packet at a time.

This will at best reduce total time by ~45%, because it combines the 2-byte read with the read of the rest of the packet. Still, not bad.

Okay, lol. From 23.68s cold and 4.45s warm to 20.69s cold and 4.35s warm. Time to grab another profile :/

...the profile is treating time read is _blocked_ as time spent reading. Oi. Going to run neoventi and then run this as quickly as possible: mk vac; i=();while(! ~ $#i 50){time walk /n/vac > /dev/null; i=($i ())} ; slay vacfs | rc

Should take around 4-5 minutes, and the profile might be a tad more accurate

92.7 1415.727 11089337	read

We're blocked on the client :P Most of the runtime is waiting for vacfs to get back to us :D

What the hell is wrong with the config??

It should be the last Maxconfig bytes of the first PartBlank!

readifile() →  readpart(PartBlank-Maxconfig, Maxconfig) → pread(fd, buf, Maxconfig, PartBlank-Maxconfig)
dat.h:40: 	PartBlank		= 256*1024,	/* untouched section at beginning of partition */
	Maxconfig = 8 * 1024,

% venti/conf /dev/jasnah/arenas 
index main
arenas /dev/jasnah/arenas
isect /dev/jasnah/isect
mem 256m
bcmem 256m
icmem 512m
httpaddr tcp!*!13012
addr tcp!*!13011

% dd -if /dev/jasnah/arenas -bs 8192 -count 1 -skip 248
[garbage]

% dd -if /dev/jasnah/arenas -bs 8192 -count 1 -iseek 248
[garbage]

% dd -if /dev/jasnah/arenas -bs 1 -count 8192 -skip 253952
[garbage]

I'm preeeetty sure I"m using dd right >_<

Do I have the address wrong??

...oh wait. oops. 248KiB in. not 248*8K in. dummy.

% dd -if /dev/jasnah/arenas -bs 1024 -count 8 -skip 248

that works. lol.

Okay, config parsing is good. Let's fix multiclient support.

With a foreground proc for handling, I - oh. This is why testing is important! neoventi shuts down on disconnect >_<

Ohhhhh on vthangup we're bouncing() with 2 as the parameter, which triggers exits(). That was easy lmao.

annnnd with that fixed it crashes on hangup with one thread, too!

7.neoventi 55793: suicide: sys: trap: misaligned pc pc=0x5
mk: ./7.neoventi -c /dev/jasnah/arenas  : exit status=7.neoventi 55793:

... huh.

waiting for packet...Received 0 bytes!
hanging up: unknown control request

unknown control request is not a neoventi error, that's a driver error. Preeeetty sure it's just from the connection dropping, though. Probably. Maybe not.

The trampoline appears to be going to the wrong place... the heck?

inittrampoline calls setjmp(). The longjmp is correctly returning to that setjmp, but the return in inittrampoline is not returning to handleproc a second time??

          Longjmp restores the environment saved by the last call of
          setjmp. It then causes execution to continue as if the call
          of setjmp had just returned with value val. The invoker of
          setjmp must not itself have returned in the interim.  All
          accessible data have values as of the time longjmp was
          called.
The invoker of
          setjmp must not itself have returned in the interim.

ohhhh dumb. duuumb.

manually inlining initrampoline() fixed that. Multithread??

Yep!!! :D

Technically works. Ish. So long as only one actually does anything at a time >_<

Cache not locking right, maybe?

I don't think I ever restored bucket locking...

Just going to lock around puts, since there's no eviction yet... annnnnd it hangs. lol.

> locking bucket...locked.
> bucket put, unlocking...done

The only line that locks a bucket has that print before it, so... this isn't hanging on lock??

hanging in pread() after grabbing the block from the cache???? what the hell!

yeah, print() debugging confirms.
locking bucket...locked.
bucket put, unlocking...done
reading data into cache...

the hell?!

Okay, reverted and that's still happening, so... um. What??

Fully reverted everything and it's still going on. Confusion.

Rebooting is a _really_ stupid potential "solution." This is an issue.

I can read from the file using e.g. dd, so this _probably_ isn't e.g. a kernel issue where something is locked because i unplugged the RNDIS device while it was in use :P

eughhhh

tangentially, I need to make my acme scripts clean up when acme dies >_< makes looking through the process list much more annoying than it really needs to be.

Good news is that it's still broken after reboot. Did something get accidentally added to a commit and break it? :/

8e31146d45d0612b82a5438012e6ed03be3995a9 works. cool, let's bisect this.
cdc922649bc14a2e25abd4593b244f36fffff1d0 does not. Okay, did ca929 break it?

Nope, that one works. Um. cdc922649bc14a2e25abd4593b244f36fffff1d0 broke it??

git/export cdc922649bc14a2e25abd4593b244f36fffff1d0
From: Noam Preil <noam@pixelhero.dev>
Date: Thu, 04 Jul 2024 21:07:18 +0000
Subject: [PATCH] server: handle each client in a separate proc

---
diff ca929c334fedc59fc1240b2c64b887082ef331f2 cdc922649bc14a2e25abd4593b244f36fffff1d0
--- a/notebook
+++ b/notebook
@@ -1155,3 +1155,46 @@
 % dd -if /dev/jasnah/arenas -bs 1024 -count 8 -skip 248
 
 that works. lol.
+
+Okay, config parsing is good. Let's fix multiclient support.
+
+With a foreground proc for handling, I - oh. This is why testing is important! neoventi shuts down on disconnect >_<
+
+Ohhhhh on vthangup we're bouncing() with 2 as the parameter, which triggers exits(). That was easy lmao.
+
+annnnd with that fixed it crashes on hangup with one thread, too!
+
+7.neoventi 55793: suicide: sys: trap: misaligned pc pc=0x5
+mk: ./7.neoventi -c /dev/jasnah/arenas  : exit status=7.neoventi 55793:
+
+... huh.
+
+waiting for packet...Received 0 bytes!
+hanging up: unknown control request
+
+unknown control request is not a neoventi error, that's a driver error. Preeeetty sure it's just from the connection dropping, though. Probably. Maybe not.
+
+The trampoline appears to be going to the wrong place... the heck?
+
+inittrampoline calls setjmp(). The longjmp is correctly returning to that setjmp, but the return in inittrampoline is not returning to handleproc a second time??
+
+          Longjmp restores the environment saved by the last call of
+          setjmp. It then causes execution to continue as if the call
+          of setjmp had just returned with value val. The invoker of
+          setjmp must not itself have returned in the interim.  All
+          accessible data have values as of the time longjmp was
+          called.
+The invoker of
+          setjmp must not itself have returned in the interim.
+
+ohhhh dumb. duuumb.
+
+manually inlining initrampoline() fixed that. Multithread??
+
+Yep!!! :D
+
+Technically works. Ish. So long as only one actually does anything at a time >_<
+
+Cache not locking right, maybe?
+
+Yep!
--- a/server.c
+++ b/server.c
@@ -17,6 +17,7 @@
 	msg = vsmprint(msg, args);
 	werrstr(msg);
 	free(msg);
+	fprint(2, "error: %r\n");
 	va_end(args);
 	if(tbuf != nil){
 		len = snprint(tbuf+6, 0x10000, "neoventi: %r");
@@ -171,38 +172,31 @@
 	}
 }
 
-static int
-inittrampoline(VtConn *conn)
+static void
+handleproc(void *fd)
 {
-	switch(setjmp(conn->bounce)){
+	char buf[MaxPacketSize];
+	VtConn conn;
+	conn.fd = (int)(usize)fd;
+	switch(setjmp(conn.bounce)){
 	case 0:
-		return 1;
+		vthello(conn, buf);
+		vtloop(conn, buf);
+		abort();
 	case 1:
 		fprint(2, "abandoning client: %r\n");
+		break;
 	case 2:
-		exits(nil);
-		return 0;
+		fprint(2, "hanging up: %r\n");
+		break;
 	default:
 		fprint(2, "internal error: unexpected bounce code\n");
-		return 0;
+		break;
 	}
+	close(conn.fd);
 }
 
 static void
-handleproc(void *fd)
-{
-	char buf[MaxPacketSize];
-	VtConn conn;
-	conn.fd = (int)(usize)fd;
-	if(!inittrampoline(&conn)){
-		close(conn.fd);
-		return;
-	}
-	vthello(conn, buf);
-	vtloop(conn, buf);
-}
-
-static void
 handle(int ctl, char *dir)
 {
 	int fd = accept(ctl, dir);
@@ -210,8 +204,7 @@
 		fprint(2, "failed to accept connection\n");
 		return;
 	}
-	handleproc((void*)fd);
-//	proccreate(handleproc, (void*)fd, mainstacksize);
+	proccreate(handleproc, (void*)fd, mainstacksize);
 }
 
 void


but, but that's just what it's supposed to be?!?!

Dammit. Isolate this into smaller chunks :(

% git/branch front
% git/revert -c ca929c334fedc59fc1240b2c64b887082ef331f2 server.c
% mk all; mk fg
% mk vac

Works. Okay. Um.

% git/revert server.c
% mk all; mk fg
% mk vac

Hangs. Yep. That confirms it.

-	handleproc((void*)fd);
-//	proccreate(handleproc, (void*)fd, mainstacksize);
+//	handleproc((void*)fd);
+	proccreate(handleproc, (void*)fd, mainstacksize);
 }

That does it??? really?? but it worked fine!!

Is it because I'm passing the conn - and thus the jmp_buf - around on the stack?

Ohhh my gosh it's the stack size >_<

I had it bumped to see if that fixed a different issue, then fixed that when committing >_<

It needs to be bumped, though. Which makes sense; we've got two packet-sized buffers, and those are 64K each, so 128K isn't big enough. Let's bump mainstacksize to, I dunno, 256K should be plenty.

Yep, that's good. Cool!

Though, if I run two vacfses of the same score, start one of them walking /n/vac, give it a minute, then start the otehr doing so, it works fine until the second one catches up (which of course it does because our cache ACTUALLY WORKS *cough* venti, take notes *cough*), at which point it crashes as expected due to the total lock of protection on the buckets

but now i can test that again!!

walk: /n/vac/active/.git/objects/86: read asked for 284e1b75fc49962be625ee7b0d85237a2201d682 got 6355d846c98f941707c

...at least it didn't crash this time? :/

panic: D2B called on non-block 81680df4 (double-free?)

neoventi did, though, eventually >_<

but... only sorta???

one of the vacfses is still working. One of the procs panicced though???? What the HELL?

ahhh, one of the procs handling connections crashed, but its panic _does not affect_ the other procs for some reason?
% acid 8503
/proc/8503/text:arm64 plan 9 executable
/sys/lib/acid/port
/sys/lib/acid/arm64
acid: lstk()
abort()+0xc /sys/src/libc/9sys/abort.c:6
ppanic(fmt=0x362b1)+0x124 /sys/src/libc/port/malloc.c:162
	pv=0x300f0
	msg=0x368d0
	v=0x81680d28
	n=0x81680d280000002f
D2B(v=0x81680df4)+0x70 /sys/src/libc/port/pool.c:926
	a=0x81680dec
poolfreel(v=0x81680df4,p=0x34f40)+0x20 /sys/src/libc/port/pool.c:1152
	ab=0x300f0
poolfree(p=0x34f40,v=0x81680df4)+0x38 /sys/src/libc/port/pool.c:1287
free()+0x1c /sys/src/libc/port/malloc.c:250
readclump(dst=0x816a0f2c,addr=0x145880)+0xe4 /usr/glenda/src/disk/neoventi/disk.c:170
	buf=0x1292060d67435ec9
	size=0x10e74000012a6
vtread(buf=0x816a0f28,conn=0xfefefefe00000009)+0x98 /usr/glenda/src/disk/neoventi/server.c:90
	addr=0x145880
vtconnhandle(buf=0x816a0f28,conn=0xfefefefe00000009)+0xf8 /usr/glenda/src/disk/neoventi/server.c:100
vtloop(conn=0xfefefefe00000009,packetbuf=0x816a0f28)+0x1dc /usr/glenda/src/disk/neoventi/server.c:161
	sz=0xfefe0000fefe001c
	buf=0xec310706000c1a00
	offset=0xc1a00fefe0000
	len=0xfefe001cfefe001a
handleproc()+0x2d0 /usr/glenda/src/disk/neoventi/server.c:183
	conn=0xfefefefe00000009
	buf=0x7afc5656000c1a00
launcherarm64(arg=0x9,f=0x11298)+0xc /sys/src/libthread/arm64.c:8
launcherarm64(arg=0x9,f=0x11298)+0xfffffffffffffff8 /sys/src/libthread/channel.c:583
abort /sys/src/libc/9sys/abort.c:4


Oh my _gosh_. That fre should not exist, that's a static buffer!!

Decompression failed due to some race in the cache, I think :/

sbrk is shared between the procs, it has to be or none of the reads would succeed.

Ohhh, right, inserting the buffer into cache doesn't mean we've read into it yet, need to implement the cacheunlock API properly.

Okay, two clients both walking at once annnnnd seems good! :D

added two more vacfses and it seems stable :)

though... one of the first two is hanging. um.

cool. Did a fresh test with three vacfses and two are hanging. drugs to the rescue!

% ps | grep vacfs
glenda         6575    0:00   0:00     2944K Pread    vacfs
glenda         7024    0:05   0:01     3008K Pread    vacfs
glenda         7045    0:05   0:01     3008K Pread    vacfs
glenda         7597    0:06   0:02     3008K Pread    vacfs
glenda         8492    0:15   0:05     3008K Pread    vacfs
glenda         8504    0:00   0:00     3008K Pread    vacfs
glenda        10666    0:06   0:02     3008K Pread    vacfs
glenda        10697    0:10   0:03     3008K Pread    vacfs
glenda        11284    0:03   0:01     3008K Pread    vacfs
glenda        11331    0:03   0:01     3008K Pread    vacfs
glenda        12268    0:05   0:02     3008K Pread    vacfs
glenda        12286    0:05   0:02     3008K Pread    vacfs
glenda        12319    0:06   0:02     3008K Pread    vacfs

Three most recent are probably the way to go.

/proc/12286/text:arm64 plan 9 executable
/sys/lib/acid/port
/sys/lib/acid/arm64
acid: lstk()
pread()+0xc /sys/src/libc/9syscall/pread.s:6
read(buf=0x5be7e,n=0x4000000002)+0x18 /sys/src/libc/9sys/read.c:7
_vtrecv(z=0x5aa78)+0x74 /sys/src/libventi/send.c:82
	p=0x5b860
	size=0x5b86000000000
	b=0x5be7e
	n=0x5be7e00000000
	buf=0x5aa7800000000
	len=0x2a16c
vtrecv(z=0x5aa78)+0x8c /sys/src/libventi/send.c:209

Cool, blocked on neoventi. That makes sense. Ish.

Neoventi?

glenda        12250    0:00   0:01   397696K Open     7.neoventi
glenda        12267    0:01   0:14   397696K Sleep    7.neoventi
glenda        12285    0:01   0:10   397696K Sleep    7.neoventi
glenda        12318    0:01   0:14   397696K Pread    7.neoventi

Probably the two that are asleep?

sleep()+0xc /sys/src/libc/9syscall/sleep.s:6
lock(lk=0x810a0440)+0xb0 /sys/src/libc/arm64/lock.c:28
	i=0x11a58000003e8
cachelookup(arenaindex=0x6be,blockindex=0x1938,buf=0x81690df8)+0x3c /usr/glenda/src/disk/neoventi/cache.c:128
	key=0x806be1938
	tried=0x1
	bucket=0x810a0440

Sure seems that way!!

I'm not grabbing the arena block while holding an index lock, am I? ...even if I were, that wouldn't double-lock.

That's from vtreadlookup. Let's check the other proc?

vtreadlookup(score=0x81660ec4,addr=0x81650de0)+0x94 /usr/glenda/src/disk/neoventi/disk.c:70
	bucket=0x321938

Both are lookups of that bucket.

...oh, shoot, did it linearly probe to a different bucket? I need to linear probe to make sure I'm unlocking the correct bucket.

and also unlock() should error if already unlocked, no??

void
unlock(Lock *lk)
{
	lk->val = 0;
}

NO?!

Fuck that, I'm making it do so on my branch.

--- a/sys/src/libc/port/lock.c
+++ b/sys/src/libc/port/lock.c
@@ -35,5 +35,7 @@
 void
 unlock(Lock *lk)
 {
+	if(canlock(lk))
+		sysfatal("double-unlock!");
 	lk->val = 0;
 }

Should be easy to test, if my hypothesis about the neoventi bug is correct anyways.

...okay, so that didn't get triggered, though two are hanging. hmmm. Acid time :/

glenda        15888    0:01   0:09   397696K Sleep    7.neoventi
glenda        15908    0:01   0:13   397696K Sleep    7.neoventi

/proc/15888/text:arm64 plan 9 executable
/sys/lib/acid/port
/sys/lib/acid/arm64
acid: sleep()+0xc /sys/src/libc/9syscall/sleep.s:6
lock(lk=0x810a0440)+0x8c /sys/src/libc/arm64/lock.c:24
	i=0x11a58000002f5
cachelookup(arenaindex=0x6be,blockindex=0x1938,buf=0x81650d90)+0x3c /usr/glenda/src/disk/neoventi/cache.c:128
	key=0x806be1938
	tried=0x1
	bucket=0x810a0440
	i=0x8
vtreadlookup(score=0x81660ec4,addr=0x81650de0)+0x94 /usr/glenda/src/disk/neoventi/disk.c:70
	bucket=0x321938
	s_sect=0xf1df8
	buf=0x44738
	key=0xf1df8000006be
	entry=0x4473800000000
vtread(buf=0x81660ec0,conn=0xfefefefe00000008)+0x14 /usr/glenda/src/disk/neoventi/server.c:88
	addr=0x8
vtconnhandle(buf=0x81660ec0,conn=0xfefefefe00000008)+0xf8 /usr/glenda/src/disk/neoventi/server.c:100
vtloop(conn=0xfefefefe00000008,packetbuf=0x81660ec0)+0x1dc /usr/glenda/src/disk/neoventi/server.c:161
	sz=0xfefe0000fefe001c
	buf=0x78a7bfe5000c1a00
	offset=0xc1a00fefe0000
	len=0xfefe001cfefe001a
handleproc()+0x2d0 /usr/glenda/src/disk/neoventi/server.c:183
	conn=0xfefefefe00000008
	buf=0x78a7bfe5000c1a00
launcherarm64(arg=0x8,f=0x11298)+0xc /sys/src/libthread/arm64.c:8
launcherarm64(arg=0x8,f=0x11298)+0xfffffffffffffff8 /sys/src/libthread/channel.c:583
lock+0x8c /sys/src/libc/arm64/lock.c:24
acid: 

/proc/15908/text:arm64 plan 9 executable
/sys/lib/acid/port
/sys/lib/acid/arm64
acid: sleep()+0xc /sys/src/libc/9syscall/sleep.s:6
lock(lk=0x810a0440)+0x8c /sys/src/libc/arm64/lock.c:24
	i=0x11a58000003a7
cachelookup(arenaindex=0x6be,blockindex=0x1938,buf=0x81690df8)+0x3c /usr/glenda/src/disk/neoventi/cache.c:128
	key=0x806be1938
	tried=0x1
	bucket=0x810a0440
	i=0x8
vtreadlookup(score=0x816a0f2c,addr=0x81690e48)+0x94 /usr/glenda/src/disk/neoventi/disk.c:70
	bucket=0x321938
	s_sect=0xf1df8
	buf=0x45c78
	key=0xf1df8000006be
	entry=0x45c7800000000
vtread(buf=0x816a0f28,conn=0xfefefefe00000009)+0x14 /usr/glenda/src/disk/neoventi/server.c:88
	addr=0x9
vtconnhandle(buf=0x816a0f28,conn=0xfefefefe00000009)+0xf8 /usr/glenda/src/disk/neoventi/server.c:100
vtloop(conn=0xfefefefe00000009,packetbuf=0x816a0f28)+0x1dc /usr/glenda/src/disk/neoventi/server.c:161
	sz=0xfefe0000fefe001c
	buf=0x78a7bfe5000c1a00
	offset=0xc1a00fefe0000
	len=0xfefe001cfefe001a
handleproc()+0x2d0 /usr/glenda/src/disk/neoventi/server.c:183
	conn=0xfefefefe00000009
	buf=0x78a7bfe5000c1a00
launcherarm64(arg=0x9,f=0x11298)+0xc /sys/src/libthread/arm64.c:8
launcherarm64(arg=0x9,f=0x11298)+0xfffffffffffffff8 /sys/src/libthread/channel.c:583
lock+0x8c /sys/src/libc/arm64/lock.c:24
acid: 
echo kill > /proc/15908/ctl

Okay, so both are on bucket 0x810a0440, after a linear probe. I _know_ I'm not unlocking the correct bucket. I assumed maybe this was racing the other proc and had unlocked a bucket locked by the other, though that's laughable in hindsight.

Probably either didn't apply the libc patch properly or that patch doesn't work...

Tested the patch on its own and it's not working?? da heck?

...waaaait a sec is there an asm impl?

arm64/lock.c. ah.

yep, now that patch works :P

So now this should just easily confirm that suspicion.

annnnd I crashed acme by using up all of RAM lol. had 800MiB of acme buffers >_< didn't lose anything other than probably a few lines in here. I should remember to turn autosave on.

Annnd now to retest :P

./7.neoventi: double unlock

yep, as expected :)

probably shouldn't sysfatal. abort would be better so acid works.

double-unlock!
7.neoventi 24985: suicide: sys: trap: fault read addr=0x0 pc=0x13fe4

<lstk 24985
/proc/24985/text:arm64 plan 9 executable
/sys/lib/acid/port
/sys/lib/acid/arm64
acid: abort()+0xc /sys/src/libc/9sys/abort.c:6
unlock(lk=0x810a0440)+0x20 /sys/src/libc/arm64/lock.c:41
cacheunlock(blockindex=0x1938)+0x18 /usr/glenda/src/disk/neoventi/cache.c:111
vtreadlookup(score=0x81660ed4,addr=0x81650df0)+0x128 /usr/glenda/src/disk/neoventi/disk.c:76
	bucket=0x321938
	s_sect=0xf1e08
	buf=0x6d2f2e90
	key=0xf1e08000006be
	entry=0x6d2f2e9000000000
vtread(buf=0x81660ed0,conn=0xfefefefe00000008)+0x14 /usr/glenda/src/disk/neoventi/server.c:88
	addr=0x8
vtconnhandle(buf=0x81660ed0,conn=0xfefefefe00000008)+0xf8 /usr/glenda/src/disk/neoventi/server.c:100
vtloop(conn=0xfefefefe00000008,packetbuf=0x81660ed0)+0x1dc /usr/glenda/src/disk/neoventi/server.c:161
	sz=0xfefe0000fefe001c
	buf=0x78a7bfe5000c1a00
	offset=0xc1a00fefe0000
	len=0xfefe001cfefe001a
handleproc()+0x2d0 /usr/glenda/src/disk/neoventi/server.c:183
	conn=0xfefefefe00000008
	buf=0x78a7bfe5000c1a00
launcherarm64(arg=0x8,f=0x11298)+0xc /sys/src/libthread/arm64.c:8
launcherarm64(arg=0x8,f=0x11298)+0xfffffffffffffff8 /sys/src/libthread/channel.c:583
abort /sys/src/libc/9sys/abort.c:4
acid: 
echo kill > /proc/24985/ctl

Voila, now this is trivial to find :D
void
cacheunlock(u16int arenaindex, u32int blockindex)
{
	u32int key = (arenaindex << 16) | blockindex;
	bucket_t *bucket = getbucket(key);
	int index;
	while(trybucket(bucket, &index, key) != 1)
		bucket++;
	unlock(bucket);
}

That should work...

deliberately treating buckets with openings as needing probing in case a key got removed by eviction.

...though that meanse trybucket needs to check the whole bucket even if it finds a gap, doesn't it, if I don't want to move the end of the bucket forwards, which I don't think I do.

on the other hand, evictions should be relatively rare, since I plan on S3-FIFOing.

Also yeah that totally fixed it :D

...also I can TOTALLY make walk of /n/vac faster, I just took the wrong approach! I need to rewrite vacfs with a new libventi next ;P

With everything cached in the neoventi sied, I'm seeing 26 seconds for a full read??? Coulda sworn it was like five, but different image I think. and reform instead of thinkpad, though this is... generally comparable performance-wise, so I'm uncertain. It'll need further testing.

regardless, seeing six seconds of CPU usage there, which can likely be cut down - and can probably eliminate the network roundtrips by doing some explicit in-fs prefetching to cut it down even further.

...worth comparing against venti again :P

neoventi: 69s cold, 26s warm.
venti:    37.17s cold??? it's winning?! 29 s warm. okay.

_sob_

oh wait this is off of /root, so it's not fully cold. right.

That's _concerningly_ close. What happened, neoventi?? Is it because reform??

...oh my gosh, this is with profiling enabled :O

...identical numbers without, though. the hell?

This might just be hardware, since it's only using one core? IF there's latency bubbles, this might be amplified way beyond...

Of the 69s for a cold read, I'm only seeing 1.06s of CPU time, of which >80% is decompression.

CPU usage on a warm run is basically identical; neoventi only needs one second of CPU time for this. That's not terrible at all.

doesn't explain why a full core is at 100% during this process, though. vacfs isn't using that much either...

probably.

 % tprof 49587
total: 6420

six seconds. So, between neoventi and vacfs, that's only seven seconds of CPU time?? ...walk itself, maybe??

not going to try to write prof to its ctl quickly enough (though it'd be easy with a while(! ps | awk...)), just goint to link walk with -p...

might need to tprof, getting a prof error??

% time walk /n/vac > /dev/null
1 Prof errors
12.46u 3.62s 27.04r 	 walk /n/vac

a prof... file was generated, but is invalid apparently??

file system isn't even close to full, so that's not it. ah well.

WALKLOOP:

rm /sys/log/walk.prof ; while(! ps | grep -s walk)sleep 0.01; pid=`{ps | grep walk | awk '{print($2)}'};echo profile > /proc/$pid/ctl;while(tprof $pid >> /sys/log/walk.prof){ sleep 0.5}

...um.

total: 12760
TEXT 00010000
    ms      %   sym
 11460	 89.8	seen
   220	  1.7	_profin
   200	  1.5	_profout
   110	  0.8	_callpc
   110	  0.8	strchr
    80	  0.6	convM2D
    70	  0.5	treesplay
    60	  0.4	memcpy
    40	  0.3	_restore
    30	  0.2	pooldel
    30	  0.2	blocksetdsize
    30	  0.2	blockcheck
    20	  0.1	D2B
    20	  0.1	treelookupgt
    20	  0.1	Bputc
    20	  0.1	poolfreel
    20	  0.1	dirpackage
    20	  0.1	unlock
    20	  0.1	walk
    20	  0.1	statcheck
    20	  0.1	close
    10	  0.0	pooladd
    10	  0.0	dofile
    10	  0.0	memset
    10	  0.0	canlock
    10	  0.0	trim
    10	  0.0	punlock
    10	  0.0	open
    10	  0.0	create
    10	  0.0	dirread
    10	  0.0	plock
    10	  0.0	fullrune
    10	  0.0	poolallocl
    10	  0.0	memccpy
    10	  0.0	pread

walk is ~13s of CPU time?!

oh right wait that's with -p lol

ummmm

total: 12130
TEXT 00010000
    ms      %   sym
 11530	 95.0	seen
   120	  0.9	memmove
    70	  0.5	convM2D
    50	  0.4	blockcheck
    40	  0.3	dirpackage
    40	  0.3	walk
    40	  0.3	treesplay
    20	  0.1	getdsize
    20	  0.1	poolfreel
    20	  0.1	blocksetdsize
    20	  0.1	pooldel
    20	  0.1	treelookupgt
    20	  0.1	strchr
    20	  0.1	s_append
    10	  0.0	poolalloc
    10	  0.0	s_terminate
    10	  0.0	Bwrite
    10	  0.0	s_free
    10	  0.0	dirread
    10	  0.0	dsize2bsize
    10	  0.0	poolrealloc
    10	  0.0	pread
    10	  0.0	_tas
    10	  0.0	blockmerge

okay so the profiler is shockingly efficient, tbh. expected way worse

but also walk is 12s of CPU time?!

What if I rip out that code and have it walk directories even if it's seen them already? Dangerous for recursive hierarchies, but probably fine for this...

even worse lol, of course.

also, for future reference, /n/vac/archive/2023/0707/usr/glenda/src/clone/.git is legitimately damaged on disk. both venti and neoventi yield the same error, e.g. 

walk: couldn't open /n/vac/archive/2023/0707/usr/glenda/src/clone/.git/index9: file does not exist: '/n/vac/archive/2023/0707/usr/glenda/src/clone/.git/index9'

Going to need a more efficient seen() implementation?

Two easy ideas: switch it to fibbonacci hashing for picking a cache bucket, and bumping the number of buckets from 256 to e.g. 2048...

Interrupted that walk at 300s lol. Much worse without seen() :P

Testing with 2048 buckets... ~16.4s haha, that's almost a 50% reduction >_<

Obvious question: what about 16384 buckets? lol

also, of note: this cut the warm venti time to 19s, too!

64K buckets, neoventi warm: ~15s, too. Dropping that to 16K seems good. Overhead is on the order of eight bytes per bucket, so this should be less than 1M of additional RAM usage, and - bonus - I suspect that this scales better since a larger number of smaller buckets, with conservative allocation, should end up wasting less space, given that buckets start with no space allocated. That's conjecture though.

Regardless, it's nearly twice as fast!

it also extended neoventi's lead a bit, I think? venti went from 29 to 19, neoventi went from 26 to 15 ;P

walk uses `dir->qid.path^dir->qid.vers` as the key to the cache table. That's... i have questions about how uniformly distributed this will be. let's try fibbonaccizing it.

_crosses fingers..._

//
With 4096 buckets, it's... still the same, just worse. Neato.

8K buckets seems sufficient.

Can fuck with bucket growth rate? or initial capacity? (to reduce reallocs/memmoves)

no clear impact, which makes sense; 95% of runtime is in seen() itself, not memory management / shuffling.

Could try caching just by QID, and then subcaching by version? An extra layer of indirection, but fewer comparisons, in principle.

Actually, let's just try removing version from bucket selection...

...also, oops, may have fucked up networking badly by unplugging phone while usb/ether was active?? looks like that may be causing vacfs to hang??

aux/dial is hanging dialing 127.1!14011, I think it may be trying to use the nonexistent phone for it :/

also fossil might be hanging. um.

I ran a `fsys main sync` and... it can't reach venti, maybe???

home fs is fine. I'm going to save this and shut down. Well, after a minute. Can't run a proper sync :(

Sanity test: is everything good?

% sync
main...
prompt: srv: already serving 'fscons'
srv -A boot
prompt: archive vac:10d05a5ed61e46b309dc374daaccc362dbb48783
archive vac:c28717a76c153eed834d684c7eff35814cb25b4c
fsys main sync
	main sync: wrote 2 blocks
done
home...
sys home sync
	home sync: wrote 1 blocks
done

Okay, cool.

% mk fg
./7.neoventi -c /dev/jasnah/arenas
Initializing neoventi build 5... TODO: validate initial stateinitialized, launching server...overridding tcp address for simultaneous testing! tcp!*!14011...

% mk vac && walk /n/vac
% mk vac && walk /n/vac
% mk vac && walk /n/vac
% mk vac && walk /n/vac

Got four walks running and it seems all fine. CPU load is at maybe 50%? Should be better, could be a lot worse.

Should import ori's walk patch.

(All four walks finished without issue :)

Walk seems to be using ~1s of CPU time at the moment. I suspect that'll be even better with ori's approach, since it'll reduce the number of entries we're keeping; we also won't need so many buckets there except when we _want_ full deduplication.

total: 990
TEXT 00010000
    ms      %   sym
   380	 38.3	seen
   120	 12.1	convM2D
    80	  8.0	memmove
    40	  4.0	pread
    40	  4.0	blockcheck
    30	  3.0	Bwrite
    30	  3.0	treesplay
    20	  2.0	poolreallocl
    20	  2.0	dofile
    20	  2.0	poolallocl
    20	  2.0	open
    20	  2.0	close
    20	  2.0	dirpackage
    20	  2.0	strchr
    20	  2.0	strlen
    10	  1.0	strcmp
    10	  1.0	s_newalloc
    10	  1.0	memset
    10	  1.0	_tas
    10	  1.0	walk
    10	  1.0	pooladd
    10	  1.0	pooldel
    10	  1.0	poolcompactl
    10	  1.0	s_free
    10	  1.0	blocksetdsize
    10	  1.0	dsize2bsize

Okay, what about vacfs?

total: 6340
TEXT 00010000
    ms      %   sym
   670	 10.5	_sha1block
   290	  4.5	treesplay
   260	  4.1	blocksetdsize
   250	  3.9	blockcheck
   230	  3.6	lock
   190	  2.9	trim
   180	  2.8	pooladd
   170	  2.6	pooldel
   170	  2.6	memcmp
   150	  2.3	blockmerge
   150	  2.3	_tas
   130	  2.0	memset
   130	  2.0	vdunpack
   120	  1.8	dsize2bsize
   120	  1.8	memmove
   120	  1.8	poolallocl
   120	  1.8	vtentryunpack
   110	  1.7	unlock
   110	  1.7	poolfree
   110	  1.7	poolfreel
    90	  1.4	strchr
    90	  1.4	plock
    80	  1.2	vtcacheglobal
    80	  1.2	D2B
    80	  1.2	convD2M
    80	  1.2	treelookupgt
    80	  1.2	blocksetsize
    70	  1.1	vtblockput
    70	  1.1	io
    60	  0.9	qunlock
    60	  0.9	mallocz
    60	  0.9	newfid
    60	  0.9	punlock
    50	  0.7	packettrim
    50	  0.7	pread
    50	  0.7	vdcopy
    40	  0.6	vacfilewalk
    40	  0.6	_barrier
    40	  0.6	pwrite
    40	  0.6	vtmalloc
    40	  0.6	getdsize
    40	  0.6	stringunpack
    40	  0.6	upheap
    40	  0.6	vtfileunlock
    40	  0.6	convM2S
    40	  0.6	vtfileblock
    30	  0.4	poolalloc
    30	  0.4	vtfilelock2
    30	  0.4	packetsplit
    30	  0.4	meunpack
    30	  0.4	memalloc
    30	  0.4	setmalloctag
    30	  0.4	heapdel
    30	  0.4	vacdirread
    30	  0.4	pqid
    20	  0.3	vderead
    20	  0.3	direntrysize
    20	  0.3	permf
    20	  0.3	rwalk
    20	  0.3	runlock
    20	  0.3	B2D
    20	  0.3	mbresize
    20	  0.3	packetsize
    20	  0.3	fileloadblock
    20	  0.3	_vtrecv
    20	  0.3	checksize
    20	  0.3	vtcachebumpblock
    20	  0.3	strcmp
    20	  0.3	setrealloctag
    20	  0.3	free
    20	  0.3	vtparsescore
    20	  0.3	_vtsend
    20	  0.3	blockwalk
    20	  0.3	downheap
    10	  0.1	vdeclose
    10	  0.1	read
    10	  0.1	strlen
    10	  0.1	vtlog
    10	  0.1	sha1
    10	  0.1	vtsend
    10	  0.1	vdcleanup
    10	  0.1	vtlogopen
    10	  0.1	vacfileisdir
    10	  0.1	fileopensource
    10	  0.1	malloc
    10	  0.1	vacstat
    10	  0.1	ropen
    10	  0.1	fullrune
    10	  0.1	chksource
    10	  0.1	vacfilegetmcount
    10	  0.1	filemetalock
    10	  0.1	vtstrdup
    10	  0.1	filelock
    10	  0.1	vtblockduplock
    10	  0.1	vtfcallfmt
    10	  0.1	panicblock
    10	  0.1	wlock
    10	  0.1	memfree
    10	  0.1	sizetodepth
    10	  0.1	fileload
    10	  0.1	vtfiletruncate
    10	  0.1	shrinkdepth
    10	  0.1	mkindices
    10	  0.1	wunlock
    10	  0.1	vtblockwrite
    10	  0.1	gstring
    10	  0.1	getcallerpc
    10	  0.1	packetfragments
    10	  0.1	fragalloc
    10	  0.1	canlock
    10	  0.1	packetappend
    10	  0.1	sizeS2M
    10	  0.1	convS2M
    10	  0.1	_vtrpc

Six seconds in vacfs, yikes. Neoventi?

total: 820
TEXT 00010000
    ms      %   sym
   590	 71.9	unwhack
   110	 13.4	aindexfromaddr
    40	  4.8	memcpy
    30	  3.6	pread
    20	  2.4	vtreadlookup
    10	  1.2	vtread
    10	  1.2	vtreadarenablock
    10	  1.2	_tas

0.8s, of which most is decompression, but... why is pread in there? Everything is definitely in cache, there's no eviction, we shouldn't be touching disk!!

total: 1950
TEXT 00010000
    ms      %   sym
  1440	 73.8	unwhack
   160	  8.2	aindexfromaddr
   120	  6.1	memcpy
    40	  2.0	vtreadlookup
    40	  2.0	_tas
    40	  2.0	memcmp
    40	  2.0	pread
    10	  0.5	abort
    10	  0.5	vtread
    10	  0.5	vtreadarenablock
    10	  0.5	vtreadarena
    10	  0.5	pwrite
    10	  0.5	trybucket
    10	  0.5	readclump

ahh! read() calls pread() :P that's waiting on the command from the client, then :P

Hm. 8% of that is also the linear scan of the arena map. Turning that into a hash table would help, too. Caching _after_ decompression would come in handy, too. Can we hack that, without adding a second layer of cache? Probably not; that'd mess with addresses of subsequent blocks. A second layer should be trivial, though. Worst case, cache.c is ~150 lines; another copy is no big deal. Making it aware of which blocks are compressed could be even better.

Should be able to eliminate at least 0.6s from the neoventi side, but that's.. basically nothing.

...performance can wait. This is good, and a 2x improvement's nothing to sneeze at >_<

Cache eviction should be next. Get the read side _fully done_, and I should be able to get fossil running off of it while I implement write support, probably?

128 1024 1024 * *
8 1024 * /
8 / 
4 *
p
8192
8192

2 13 ^ p
8192
8192

Okay, so 128M of cache requires 13 bits of addressing.

An RWLock would require much more memory; the easiest solution for protecting buffers during reads is probably to just take a normal Lock. This is not ideal, but it's probably okay; during a read, we only need the lock long enough to copy the data. Copying 8KiB should be fast enough that the lock _does not matter_. Let's test that.

% cat /tmp/bench.c

#include <u.h>
#include <libc.h>

void
main(void)
{
	uchar src[8192], dest[8192];
	vlong start = nsec();
	for(int i = 0; i < 8192; i++)
		src[i] = rand() & 0xFF;
	vlong fillend = nsec();
	memcpy(dest, src, 8192);
	vlong end = nsec();
	print("fill time: %f, copy time:  %f\n", ((double)fillend-(double)start) / 1000000000, ((double)end-(double)fillend) / 1000000000);
}

...
copy time:  0.000004

4 microseconds. Holding a lock for 4 microseconds on read should be _completely fine. That's on the Reform!! I'm not worried that there will be heavy contention here. I can revisit this if proven wrong.

Performance of the simple `walk` test is unchanged, at least. This seems more than fine. committing :)

Okay, read side is now working, very performant, and multi-reader. Now for the real fun: writing!

There's a few things we need to do to facilitate this. The overall idea is to have the VtWrite call. A VtWrite call should, _for now_, be totally synchronous; we can optimize it later. It should use the cache in a write-through fashion, and immediately write data to disk. For testing, we also need to be able to report what writes we _would_ do _without actually doing them_, and... preferably, to test NOT ON THE LIVE SYSTEM. No hurt Jasnah.

Actually, no, it can just bypass the cache! Addresses are unique. So, if we write to disk _without going through the cache_, any subsequent read will simply be uncached!

... Though, no, we're hash-addressed. We need to compute the hash _and check if it already exists. I missed a step here, let's just go over the desired properties of the VtWrite call:

               VtTwrite tag[1] type[1] pad[3] data[]
               VtRwrite tag[1] score[20]

Data comes in, with a type and 3 bytes of padding, which is stupid, but whatever. Type is more of a debugging tool than anything else and I dn't think I need it so we'll ignore it.

So the basic call, then, is

VtScore VtWrite([]byte data);

We're given a block to insert. We need to:

- Hash the block
- When done, return that hash, or an error.
- Check if we already have a copy (attempt to read it)
	- Double check that the data matches (is not a hash collision)
		- If hash collision, return an error.
		- Otherwise, exit early.
- If we don't already have a copy:
	- Reserve space in the arena
	- Write the block to the arena
	- Append to the log
	- Handle arena maintenance
		- e.g. if finishing this arena:
			- jump the address to the next one
			- in the background, start the process of "sealing" the current arena.

This implies, necessarily, that we need to know where the latest write finished. To do that, we need to find the active arena, and then scan it to find the correct position.

Orrrr neoventi could fail to start up on here. Lovely.

ADded debug logs annnd

./7.neoventi: failed to parse arena map

readarenatable is... OH! I reverted the devfs change, didn't I?

This isn't a neoventi bug, this is an unaligned read bug in devfs/crypt!

What if I make this an aligned read? Can I do that with biobuf? Maybe read 512 bytes, then Bseek back?

Confirmed that it's a buffer read failure. I think it's midblock.

Tabbase 270336; it's a multiple of 512, which I think is the block size.

Confirmed that the patch is reverted, though. I guess I'll just apply it for now?

Yep, works with the patch reapplied, hello from Jasnah/14!

I'll add "make neoventi work with disks that require aligned reads" to the todo list and ignore for now :)

Okay, so, now I have it spinning up with the write shim to make sure we have paramters parsed correctly...
static void
vtwrite(VtConn conn, char *buf, u16int len)
{
	char *data;
	data = &buf[8];
	vterr(conn, buf, "TODO: write(data len %d)", len);
}

vacfs ain't gonna cut it this time. venti/write to the rescue!

venti(1):
          Write writes at most 56 kilobytes of data from standard
          input to the server host and prints the resulting score to
          standard output.  If the -t option is omitted, write uses
          type 0, denoting a data block.  If the -z option is given,
          write zero truncates the block before writing it to the
          server.

so, `echo hello | venti/write -h tcp!127.1!14011` ought to do it :)

./7.neoventi: TODO partial packet

hehehe or bugs. Cool. Not another regression / environment issue this time, since mk vac works just fine.

I have a commented out vtrecv under that sysfatal in the loop, guess we'll see what happens if I trust past me without reading it? :P

annnd loops forever. cool. add some debugging prints...

Reading packet from buffer, size 8, current offset in packet is 0
Read length from packet: 12
partly missing?
read 2, expected 8


okay, how abot now?

abandoning client: TODO: write(data len 6)

that's more like it!

It's even better, actually:


-		sysfatal("TODO partial packet");
-//		vtrecv(conn, packetbuf);
+//		sysfatal("TODO partial packet");

just had to remove the sysfatal, the logic was actually correct. IIRC I'd added that just because I couldn't test it at the time :P

% echo -n hello | venti/write -h tcp!127.1!14011
venti/write: vtwrite: neoventi: TODO: write(data len 5)

Perfect, now to just print the data to be super sure...

Well, after committing the fix :)

Also, noted that venti/write at the least is doing two write()s to venti instead of just the one. Unsure how I feel about this. It should be possible to do that more optimally without any tradeoffs i think. TODO.

% echo -n hello | venti/write -h tcp!127.1!14011
venti/write: vtwrite: neoventi: TODO: write(data len 5 / 'hello')

Perfect :)

Step one: compute the score... do I have logic for this yet? I don't think so. It's just sha256 tho iirc? SHA1, actually, 20 bytes / 160 bit.

venti/venti code:

void
scoremem(u8int *score, u8int *buf, int n)
{
	DigestState s;

	memset(&s, 0, sizeof s);
	sha1(buf, n, score, &s);
}

This is just sechash(2) stuff, except done wrong lmfao.

			The first call to the
          routine should have nil as the state parameter.  This call
          returns a state which can be used to chain subsequent calls.

This doesn't mean "zero init a ds", it means "pass nil."

the correct invocation here is just sha1(buf, len, score, nil);

xD okay.

Here's what we can do: we're live-testing against the venti I'm typing this into. We can:

- Compute the score for 'hello'
- Verify that venti/read gives no data for it from neoventi
- Verify the same for oldventi
- use venti/write against the oldventi instance, note down the score
- use venti/read against neoventi, and verify that looking it up succeeds on the write path too

...okay, actually, looking through the sha1 source code, venti might have a point. DigestStates explicitly track whether they were mallocated; passing in nil results in it mallocing and then demallocing. This seems like a really silly workaround though. If the allocation isn't desired, let's patch libsec:

diff df8a21821d8b1728953afd49929ba5b69c925ca1 uncommitted
--- a/sys/src/libsec/port/sha1.c
+++ b/sys/src/libsec/port/sha1.c
@@ -18,12 +18,19 @@
 	int i;
 	uchar *e;
 
+	SHA1state ns;
+
 	if(s == nil){
-		s = malloc(sizeof(*s));
-		if(s == nil)
-			return nil;
-		memset(s, 0, sizeof(*s));
-		s->malloced = 1;
+		if(digest != nil){
+			s = &ns;
+			memset(s, 0, sizeof(*s));
+		} else {
+			s = malloc(sizeof(*s));
+			if(s == nil)
+				return nil;
+			memset(s, 0, sizeof(*s));
+			s->malloced = 1;
+		}
 	}
 
 	if(s->seeded == 0){

This adds sizeof(DigestState) - ~350 bytes - of stack usage to each call, even when not needed, but that shouldn't practically matter, and it removes one alloc+free pair for the specific case of one-shot hashes.

Arguably, this is optimizing libsec for venti's needs, but I don't think that's really a problem here, since it does not cause any problems for any other usage. I'll need to poke cinap and/or moody, though; iirc moody's been through libsec?

regardless, the score should be correct...
static void
vtwrite(VtConn conn, char *buf, u16int len)
{
	char *data;
	u8int score[20];
	data = &buf[8];
	data[len] = 0;
	sha1((uchar*)data, len, score, nil);
	char scorebuf[41];
	for(int i = 0; i < 20; i += 1){
		sprint(&scorebuf[i*2], "%02x", score[i]);
	}
	scorebuf[40] = 0;
	vterr(conn, buf, "TODO: write(data len %d / '%s', score '%s')", len, data, scorebuf);
}
venti/write: vtwrite: neoventi: TODO: write(data len 5 / 'hello', score 'aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d')

% echo -n hello | sha1sum
aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d

Yep!

Now for the fun part: abstract the read logic out of vtread, and then call it from vtwrite, and check if the data already exists...

actually, no, the logic is so simple that there's no need.

	VtAddress addr;
	int exists;
	if(vtreadlookup(score, &addr))
		// exists

Then the handling for validating the data, and for actually writing.

Okay! This is working just fine, but we need to actually make sure it doesn't fail to detect data which _does_ exist. Grab fossil/last?

annnnd fossil/last is giving the wrong result. lovely.



static void
vtwrite(VtConn conn, char *buf, u16int len)
{
	u8int score[20];
	VtAddress addr;
	sha1((uchar*)buf+8, len, score, nil);
	if(vtreadlookup(score, &addr)){
		vterr(conn, buf, "TODO: handle vtwrite for existing data");
	} else {
		vterr(conn, buf, "TODO: insert data");
	}
}

At least the code is stupidly simple!

archive vac:dc294a9b60bc62b368036181784c7a42688c05c8

Okay, let's use this one then.

% venti/read -h tcp!127.1!14011 dc294a9b60bc62b368036181784c7a42688c05c8

That should actually read the block, so

% venti/read -h tcp!127.1!14011 dc294a9b60bc62b368036181784c7a42688c05c8 | venti/write -h tcp!127.1!14011

should trigger the other path

./7.neoventi: index is corrupt: invalid bucket magic: sect 839a5928, buck cbc5e0b7

uhhhhh

okay, not great. Is this what happens when venti modifies the index underneat neoventi? >_<

same happens after relaunching neoventi. Trigger a venti/venti index flush?

% hget http://127.1!13012/flushicache

Of note is that the NVMe activity light is on maybe like 10% of the time during this process. This is just bad.

Okay, but that was the issue :)

Read length from packet: 26
Read length from packet: 306
Read length from packet: 306
abandoning client: TODO: handle vtwrite for existing data


but also venti/write is even worse than I thought, is it writing 306 byte chunks???

ah wait! that was two requests lol

...oh. no, no it wasn't, only the 26-byte one is from the read? eh, noty looking into this now.

Anyways, confirmed that this is correct :) Commit and next step :D

Read side first, since it's easier.

Hahah super easy actually

		if(addr.size != len)
			vterr(conn, buf, "hash collision detected");
		if(!readclump(buf2, addr))
			vterr(conn, buf, "clump read failed: %r");
		if(memcmp(buf2, buf+8, len) != 0)
			vterr(conn, buf, "hash collision detected");

Have the general structure be
	
	if(hash exists){
		
	} else if (data is new) {
		
	}
	send success packet

so now we need that success packet so writes of existing ddata succeed correctly...

               VtRwrite tag[1] score[20]

first two bytes are packet length, then one byte for Rwrite, then one for tag, then 20 for score; 22 byte length.

soo

ah wait no

I was about to write code to manipulate the buffer but I outsmarted myself!

just vtsend lol

memcpy(buf+4, score, 20);
vtsend(conn, buf, 22, VtRwrite, 0);

annnnd crash.

% lstk

ppanic(fmt=0x30bd1)+0x120 /sys/src/libc/port/malloc.c:162
	pv=0x30150
	msg=0x36950
	v=0x1fffefd90
	n=0xfffefd9000000030
D2B(v=0x10131d6f0)+0x6c /sys/src/libc/port/pool.c:926
	a=0x10131d6e8
poolfreel(v=0x10131d6f0,p=0x352d0)+0x18 /sys/src/libc/port/pool.c:1152
	ab=0x30150
poolfree(p=0x352d0,v=0x10131d6f0)+0x34 /sys/src/libc/port/pool.c:1287
free()+0x18 /sys/src/libc/port/malloc.c:250
_schedinit()+0xd0 /sys/src/libthread/sched.c:69
	p=0xde6d8
	t=0xde838
main(argc=0x3,argv=0x1fffeff70)+0x84 /sys/src/libthread/main.c:36
_callmain+0x24 /sys/src/libc/9sys/callmain.c:20
acid: 
echo kill > /proc/484831/ctl

okay that's actually useless. acid -lthread and grab stacks()?

<stdin>:2: (error) setproc: open /proc/484718/mem: file does not exist: '/proc/484718'

uh.

venti/read -h tcp!127.1!14011 dc294a9b60bc62b368036181784c7a42688c05c8 | venti/write -h tcp!127.1!14011

write successful, memcpying
vtsending
panic: D2B called on non-block 10131d720 (double-free?)

Okay, vtsend is doing a Bad. Hmm.

write successful, memcpying
vtsending
sending, marking size...
settting type
write syscall... written!
panic: D2B called on non-block 10131d770 (double-free?)

No, actually, vtsend is working just fine. What's going on??

The request is actually succeeding, we're just crashing _afterwards_?

% venti/read -h tcp!127.1!14011 dc294a9b60bc62b368036181784c7a42688c05c8 | venti/write -h tcp!127.1!14011
dc294a9b60bc62b368036181784c7a42688c05c8

yep!

Request handled!panic: D2B called on non-block 10131d780 (double-free?)

Mmmm, a bug in the partial packet handling, maybe?

acid: stacks()
p=(Proc)0x3c588    pid 27808  Running
	t=(Thread)0x3c6e8    Running    /usr/glenda/src/disk/neoventi/server.c:251 serve [threadmain]
		open()+0xc /sys/src/libc/9syscall/open.s:6
		listen(dir=0x5c710,newdir=0x5c6e8)+0x34 /sys/src/libc/9sys/announce.c:93
			buf=0x7063742f74656e2f
			ctl=0x104c000000007
			n=0x700000009
		serve()+0x64 /usr/glenda/src/disk/neoventi/server.c:251
			adir=0x7063742f74656e2f
			dir=0x7063742f74656e2f
			ctl=0x7
		threadmain(argv=0x1fffeff70)+0x34 /usr/glenda/src/disk/neoventi/neoventi.c:118
		mainlauncher()+0x10 /sys/src/libthread/main.c:17
		launcherarm64(arg=0x1fffeff70,f=0x1a98c)+0x8 /sys/src/libthread/arm64.c:8
		launcherarm64(arg=0x1fffeff70,f=0x1a98c)+0xfffffffffffffff8 /sys/src/libthread/channel.c:583
		listen+0x34 /sys/src/libc/9sys/announce.c:93


p=(Proc)0xde9a8    pid 27838  Sched

hm yeah going to need prints here.

write syscall... written!
Request handled!
reading packet...
packet read
panic: D2B called on non-block 10135d808 (double-free?)

Oh interesting, packet read, n 0, sz 0

packet read, n 0, sz 0
hanging up
panic: D2B called on non-block 10135d838 (double-free?)

So now we're just crashing when hanging up. Lovely.

venti/read on its own works fine and doesn't cause this, though.

hmm

poolfreel(v=0x10131d6f0,p=0x352d0)+0x18 /sys/src/libc/port/pool.c:1152
	ab=0x30150
poolfree(p=0x352d0,v=0x10131d6f0)+0x34 /sys/src/libc/port/pool.c:1287
free()+0x18 /sys/src/libc/port/malloc.c:250
_schedinit()+0xd0 /sys/src/libthread/sched.c:69
	p=0xde6d8
	t=0xde838

			free(t->stk);
			free(t->cmdname);

That's freeing either the stack or the command name, depending on whether it's accurate.

Stack corruption, maybe? Adding in the vtwrite() buffer for data comparison is probably overflowing.

yep, that was it.

% git/diff server.c
-	proccreate(handleproc, (void*)fd, 256*1024);
+	proccreate(handleproc, (void*)fd, 512*1024);

fixed :P

also, TODO: remove libbio usage, it's probably overkill for what we actually need.

Shorter term TODO: see about bumping buffer size to account for Bungetsz to make reads aligned so devfs patch can be reverted. (Performance improvement, since aligned reads are faster.)