shithub: purgatorio

ref: ec35f468e0eba87c9f09cbbe5fa8af2591e6f914
dir: /doc/perform/perform.ms/

View raw version
.TL
Reliable Benchmarking with Limbo on Inferno
.AU
John Bates
Vita Nuova Limited
.SP
3rd March 1999
.br
Revised 20th February 2000
.AB
Our goal was to identify a mechanism for performing
accurate, reliable and realistic benchmarking measurements
with Limbo on Inferno.
The measurements would need to be \fIaccurate\fR in the sense that 
we could be confident that they were sufficiently close to the real time
taken to perform the measured operation.
We would consider a measurement to be \fIreliable\fR if it could be obtained
repeatedly when carried out under the same conditions at a later date.
Additionally, a measurement is only useful if it can be claimed
to be related to the expected performance of a real applicable operation,
we would call such a measurement \fIrealistic\fR.
The emphasis in this document is on achieving a reliable measurement.
.AE
.SH
Expectation
.LP
We would like to be able to make measurements with code like:
.P1
s := Sample.new(NREP);
for(i:=0; i<NREP; i++) {
	t = tstamp();
	sys->sleep(0);
	t = tstamp() - t;
	s.add(t-base);
}
.P2
Where 
.CW s
is an object that can be used to record up to
.CW NREP
measurements and
.CW base
is our predetermined estimate of the cost of actually making the
measurement. The Limbo function
.CW tstamp()
uses \fIdevbench\fR to obtain a microsecond timestamp.
Such a measurement would give us an idea of the cost of making a
.CW sys->sleep(0)
call and by examining a sample of such measurements we could reason about
the true cost of the call. In particular, we would get a feel for
just how reliable our estimate of the true cost was.
.LP
The first task was to determine a suitable value to use for
.CW base
by measuring how long it took to do \fInothing\fR\(dg.
.FS
.vs 9
.FA
\(dg This benchmark is called
.CW BASE
in the \fIdevbench\fR benchmark suite.
.FE
.P1
s := Sample.new(NREP);
for(i:=0; i<NREP; i++) {
	t = tstamp();
	t = tstamp() - t;
	s.add(t-base);
}
(n, mean, min, max, std) := s.stat();
base = min;
.P2
By repeating the measurement a large number of times and then by
choosing the minimum measurement we could hope to choose a
sensible value for
.CW base.
Our confidence in the chosen value 
and in the results of subsequent benchmarking
would be greatest if the
distribution of measurements was \fInarrow\fR with most
of the values close to the minimum.
.SH
Initial Results
.LP
We ran the initial tests on a 150MHz Pentium PC with a standard
Inferno 2.0 native kernel and later with an Inferno 2.3 native kernel.
The commands
.P1
bench -o >x
bgrep BASE x | plot
.P2
run the benchmark suite leaving the observations in a file called
.CW x
and then extract the line containing the results for the test called
.CW BASE
and plot them to a Wm window under Inferno\(dg.
.FS
.vs 9
.FA
\(dg Actually, to generate the graphs in this document I used plot -g which
produces grap(1) output.
.br
.vs
.FE
The command
.CW bgrep
is used in place of
.CW grep
because the Inferno
.CW grep
uses
.CW sys->print
to output lines and so inherits a very small upper bound on line length.
The
.CW -o
option to 
.CW bench
delivers the individual observations rather than summary statistics.
The
.CW BASE
test repeats the above empty loop 1000 times.
Plotting the observations in the order in which they were observed
gives the surprising result shown in Figure 1.
.KF
.G1
copy "x0base.gr"
.G2
.I "Figure 1. Showing the unexpected variation in the BASE results."
.KE
The majority of values appear to be small, but the graph is dominated
by the relatively small number of very large irregular observations which
lie at around 6 milliseconds to 8 milliseconds
and a clearly discernible second set of values
which look to form a more regular pattern at about
the 1 millisecond level.
The smaller set of values can be seen more clearly by removing the
large values and rescaling the graph.
The
.CW -o
option to
.CW Plot
invokes a crude scheme to remove a layer of outliers.
It does this by calculating the sample mean and standard deviation and removing values which lie more than two standard deviations away from the mean. The process is repeated for each occurrence of the
.CW -o
option to
.CW plot.
Thus
.P1
plot -ooo
.P2
will remove three layers of outliers and then plot the remaining values.
.KS
Running
.P1
bgrep BASE x | plot -o
.P2
gives the output shown in figure 2.
.KE
.KF
.G1
copy "x0baseo.gr"
.G2
.I "Figure 2. Great variability in the BASE results, \
even with the large outliers removed."
.KE
Again, a large number of small values (less than 500 microseconds)
and a smaller number of larger values lying at around about the 1 millisecond
level.
Running the commands
.P1
bgrep BASE x | plot -p 1h
bgrep BASE x | plot -p 1h -o -z
.P2
gives a histogram showing the distribution of the results.
Firstly, with all the values
and then with one level of outliers removed \(dg.
.FS
.vs 9
.FA
\(dg The
.CW -z
option to \fIplot\fR forces a zero origin to be part of the plot.
.FE
.KF
.G1
copy "x0baseh.gr"
.G2
.I "Figure 3. Histogram showing the distribution of results \
from the BASE benchmark."
.KE
In Figure 3 we can see a small set of values up above the 7 millisecond
level and then a slightly more focused set at around about 1 millisecond but
together these groups make up much less than 10% of the total values.
.KF
.G1
copy "x0baseho.gr"
.G2
.I "Figure 4. Histogram showing the distribution of results \
from the BASE benchmark with the large outliers removed."
.KE
By removing a level of outliers we can see the distribution of the
lower values more clearly, as shown in Figure 4.
Even this lower cluster ranges from about 220 microseconds up to
about 500 microseconds.
.LP
We can conclude from these initial results
that it is not going to be possible to use this mechanism to
perform accurate or reliable benchmarking unless its performance
can be improved.
.SH
The Measurement Process
.LP
At this stage we're not actually measuring anything real, but looking
to obtain some confidence in our measurements.
We have started by observing times to complete the measurement of
\fInothing\fR. We do this by writing a Limbo program (\fIbench.b\fR)
which contains the following code:
.P1
t0, t1, base: big;
NREP: con 1000;

s := Sample.new(NREP);
treset();
for(i:=0; i<NREP; i++) {
	t0 = tstamp();
	t1 = tstamp();
	s.add(t1-t0);
}
(nil, nil, base, nil, nil) = s.stat();

iob.puts("BASE:1l:Observation::Time:us ");
s.obs(iob);
iob.puts("\\n");
iob.flush();
.P2
The middle section of code performs the benchmark by taking
two consecutive timestamps and noting the difference between them.
This is repeated 1000 times and the observations accumulated
in the sample variable
.CW s.
The function
.CW s.stat()
returns a tuple containing count, mean, minimum, maximum
and standard deviation for the accumulated values.
We assign the minimum value to our variable
.CW base.
The function
.CW treset()
arranges for the timestamp code to start returning
small values and so reduces the likelihood of overflow
or signing related issues.
.LP
The last section of code outputs one line containing the individual
observations separated by spaces and with a suitably formatted
initial \fIword\fR which labels the line so that
.CW plot
can make a useful interpretation of the results.
.KS
.LP
The functions
.CW treset()
and
.CW tstamp()
make use of files served by \fIdevbench\fR and look like this:
.P1
tsfd: ref Sys->FD;
treset()
{
	tsfd = sys->open("/dev/busec", sys->OWRITE);
	if(tsfd == nil)
		return;
	buf := array[1] of byte;
	n := sys->write(tsfd, buf, len buf);
	tsfd = sys->open("/dev/busec", sys->OREAD);
}
.P3
tstamp(): big
{
	buf := array[128] of byte;
	n := sys->read(tsfd, buf, len buf);
	if(n < 0)
		return big 0;

	return big string buf[0:n];
}
.P2
.KE
\fIDevbench\fR must have been previously bound into the
namespace
.P1
bind -b '#x' /dev
.P2
so that the files
.P1
/dev/bctl
/dev/bdata
/dev/busec
.P2
are available.
A microsecond timestamp is provided by reading from
.CW /dev/busec
and is reset to zero 
each time anthing is written to it.
The function
.CW treset()
resets the timestamp by writing to the file and then opens it for reading.
The function
.CW tstamp()
reads the timestamp from the file and converts it into a
.CW big
value for use in the program.
The device driver ignores the offset when reading from the file so that it
is not necessary to
.CW seek
to the beginning of the file before each read.
The important part of the benchmark consists of two consecutive calls to
.CW tstamp().
We need to identify where time is being spent; either in the calls to
.CW tstamp()
or else inside the function itself.
.SH
Costing
.LP
The \fIdevbench\fR suite of programs includes one for timing
individual lines of Limbo code and breaking down the time
spent on each line into times for each of the Dis operations
invoked by that line; the program is called \fIcost\fR.
To find out how long each part of the benchmark timing loop
takes we can invoke
.P1
cost -f bench.b -s bench
.P2
Which results in the following output - edited to just
show the timing loop and the function
.CW tstamp()
\&.
The first line is the output from \fIbench\fR giving the
mean, minimum, maximum and standard deviation of microsecond times for
1000 repetitions of the loop. We see a minimum of 402 microseconds
and a maximum of just over 9 milliseconds.
The remaining lines are the output of \fIcost\fR; those beginning
with a dash are Dis operation times for the preceding Limbo
source code line. Dis operation lines contain an operation name,
an execution count and the mean, minimum, maximum and
standard deviation of execution times.
.P1
BASE:summary:Sample:n:Time:us 1000 486 402 9053 499

-- bench.b
\&...
	treset();
- frame     1   2.13 us   2.13 us   2.13 us      0 ns
- call      1    807 ns    807 ns    807 ns      0 ns
	s := Sample.new(NREP*10);
- mframe    1   4.85 us   4.85 us   4.85 us      0 ns
- movw      1   2.27 us   2.27 us   2.27 us      0 ns
- lea       1   1.57 us   1.57 us   1.57 us      0 ns
- mcall     1   2.71 us   2.71 us   2.71 us      0 ns
	for(i:=0; i<NREP*10; i++) {
- movw      1   1.99 us   1.99 us   1.99 us      0 ns
- blew   1001   1.60 us   1.51 us  34.04 us   1.03 us
		t0 = tstamp();
- frame  1000   2.43 us   2.29 us   3.40 us    133 ns
- lea    1000   1.00 us    953 ns  27.74 us    847 ns
- call   1000   1.45 us   1.29 us   2.24 us    153 ns
		t1 = tstamp();
- frame  1000   3.45 us   3.01 us   5.57 us    293 ns
- lea    1000   1.22 us   1.02 us   1.61 us    173 ns
- call   1000   1.44 us   1.29 us   1.75 us    140 ns
		s.add(t1-t0);
- mframe 1000   3.65 us   3.37 us  24.73 us    700 ns
- movp   1000   1.99 us   1.48 us   2.96 us    267 ns
- subl   1000   2.43 us   2.23 us   4.80 us    173 ns
- mcall  1000   2.11 us   1.95 us   2.51 us    120 ns
- addw   1000   1.62 us   1.48 us   2.50 us    127 ns
- jmp    1000   1.50 us   1.37 us   1.81 us    120 ns
	}
	(nil, nil, base, nil, nil) = s.stat();
- mframe    1   2.66 us   2.66 us   2.66 us      0 ns
- movp      1   1.39 us   1.39 us   1.39 us      0 ns
- lea       1   1.41 us   1.41 us   1.41 us      0 ns
- mcall     1   1.37 us   1.37 us   1.37 us      0 ns
- movl      1   1.56 us   1.56 us   1.56 us      0 ns
	if(sflag)
- beqw      1   2.14 us   2.14 us   2.14 us      0 ns
		iob.puts("BASE:summary:Sample:n:Time:us "
                                       + s.str() + "\\n");
- mframe    1   4.01 us   4.01 us   4.01 us      0 ns
- movp      1   2.87 us   2.87 us   2.87 us      0 ns
- mframe    1   1.75 us   1.75 us   1.75 us      0 ns
- movp      1   1.69 us   1.69 us   1.69 us      0 ns
- lea       1   1.15 us   1.15 us   1.15 us      0 ns
- mcall     1   2.35 us   2.35 us   2.35 us      0 ns
- addc      1  19.57 us  19.57 us  19.57 us      0 ns
- addc      1   8.55 us   8.55 us   8.55 us      0 ns
- lea       1   1.49 us   1.49 us   1.49 us      0 ns
- mcall     1   3.65 us   3.65 us   3.65 us      0 ns
	iob.flush();
- mframe    1   4.50 us   4.50 us   4.50 us      0 ns
- movp      1   2.13 us   2.13 us   2.13 us      0 ns
- lea       1    967 ns    967 ns    967 ns      0 ns
- mcall     1   1.89 us   1.89 us   1.89 us      0 ns
	}
\&...
.P3

# return timestamp in microseconds
tstamp(): big
{
	buf := array[128] of byte;
- newa   2000   6.98 us   5.37 us  80.91 us   2.43 us
	n := sys->read(tsfd, buf, len buf);
- mframe 2000   3.05 us   2.59 us   4.05 us    227 ns
- movp   2000   2.33 us   2.14 us   3.21 us    120 ns
- movp   2000   1.19 us   1.14 us   2.93 us     73 ns
- lena   2000   1.07 us    907 ns  34.48 us    760 ns
- lea    2000    893 ns    860 ns  33.00 us    907 ns
- mcall  2000 133.81 us  89.88 us   8.61 ms 514.44 us
	if(n < 0)
- blew   2000   2.27 us   1.76 us   3.43 us    313 ns
		return big 0;

	return big string buf[0:n];
- movp   2000   2.21 us   1.67 us   3.38 us    253 ns
- slicea 2000  10.06 us   8.35 us  39.31 us   1.40 us
- cvtac  2000  12.02 us  11.13 us  16.63 us    553 ns
- cvtcl  2000  16.39 us  15.78 us  45.16 us    787 ns
- ret    2000  10.52 us   9.79 us  13.97 us    333 ns
}
.P2
We can see from these figures that times for both calls
to
.CW tstamp()
are reasonably low. But looking at the call to
.CW "sys->read()"
in 
.CW tstamp()
we can see a maximum of just over 8½ milliseconds for the \fImcall\fR
operation. We see too that this operation also has a high standard deviation
of 514 microseconds across 2000 calls and so it looks like it may be
the cause of the large observed delays.
.LP
In order to get a second view of what was happening we surrounded the
timing code with calls to enable kernel profiling.
.P1
profon();
for(i:=0; i<NREP*10; i++) {
	t0 = tstamp();
	t1 = tstamp();
	s.add(t1-t0);
}
profoff();
\&...
profon()
{
	if(sys->write(kpctl, array of byte "startclr", 8) < 8) 
		error("kprof start");
}

profoff()
{
	if(sys->write(kpctl, array of byte "stop", 4) < 4)
		error("kprof stop");
}
.P2
Having previously bound in \fIdevkprof\fR
.P1
bind -b '#T' /dev
.P2
and opened \fI/dev/kpctl\fR for writing.
By examining the contents of \fI/dev/kpdata\fR with \fIkprof\fR
we see that the garbage collection functions \fImarkheap\fR
and \fIrungc\fR are taking up a surprisingly large amount
of kernel time for this operation.
.P1
total: 1460	in kernel text: 1460	outside kernel text: 0
KTZERO 80100000
ms	  %	sym
130	  8.9	markheap
110	  7.5	iunlock
100	  6.8	xec
90	  6.1	rungc
80	  5.4	chartorune
70	  4.7	c2string
60	  4.1	numbconv
50	  3.4	string2c
40	  2.7	sched
40	  2.7	wakeup
30	  2.0	splhi
30	  2.0	freeptrs
30	  2.0	memset
30	  2.0	dodiv
30	  2.0	_divvu
30	  2.0	strtoll
20	  1.3	_mulv
20	  1.3	vmachine
20	  1.3	unlock
20	  1.3	runproc
20	  1.3	DEA
10	  0.6	kwrite
.P2
.SH
Interpreting the Results
.LP
In order to progress we need to look more closely at what is
happening inside the call to
.CW "sys->read".
As far as the kernel is concerned, there are two kinds of \fImcall\fR
operations; those which result in a call to a builtin function and those
which result in a call to a non-builtin function in another module.
.CW "Sys->read"
is of the former kind, it is a system builtin whose definition lies in the
function \fISys_read\fR.
When executing \fImcall\fR for a normal function the virtual machine
primes the register set so that on the next iteration the operations
will be fetched from the code associated with the new function.
When executing \fImcall\fR for a builtin the virtual machine actually executes
the associated function before returning to the next operation.
The code for \fISys_read\fR looks like this:
.P1
void
Sys_read(void *fp)
{
	int n;
	F_Sys_read *f;

	f = fp;
	n = f->n;
	if(f->buf == (Array*)H) {
		*f->ret = 0;
		return;		
	}
	if(n > f->buf->len)
		n = f->buf->len;

	release();
	*f->ret = kread(fdchk(f->fd), f->buf->data, n);
	acquire();
}
.P2
The important thing to note is that in common with all builtin
functions which must perform some kernel function
\fISys_read\fR releases the interpreter before carrying out
its task and then acquires it again before returning to
the body of the \fImcall\fR operation.
Releasing the interpreter means removing this Dis thread
from the list of threads to be run and making ready
another virtual machine from the list of kernel processes
waiting to use the interpreter, if necessary by first creating a new
one.
Once the interpreter has been released the process carries
out its operation, in this case a call to \fIkread\fR,
and then attempts to reacquire the interpreter.
If no other process is using the interpreter it will get it back
immediately, otherwise the process sets its state to \fIQueueing\fR
and calls
.CW sched()
to wait for its turn.
How long it has to wait depends on how many other threads are running and
whether they are compiled or interpreted.
If another process wants to use the interpreter then it may run
for a short period of time before making a call to \fIrungc()\fR
to perform
a small amount of garbage collection
and then yielding control to our process.
If the other process is running an interpreted thread then it may execute
for up to 2048 Dis instructions or for a more variable time if
it is a compiled thread.
We can see the likely effect of this by timing calls to \fIrungc\fR.
The following code does this inside \fIdevbench\fR\(dg:
.FS
.vs 9
.FA
\(dg This benchmark is called
.CW GCSCHED
in the \fIdevbench\fR benchmark suite.
.FE
.P1
log("GCSCHED:1d:Observation:n:Time:us");
for(i=0; i<1000; i++) {
	(*ts)(&t0);
	rungc(head);
	(*ts)(&t1);
	log(" %.2f",  ts2us(t1-t0));
	release();
	acquire();
}
log("\\n");
.P2
The calls to
.CW release()
and
.CW acquire()
allow other Dis threads to run and so exercise the memory allocation
code in much the same way that the same calls in
.CW tstamp()
do.
On each call to
.CW rungc()
the garbage collector visits 50 blocks in the heap.
The times for 1000 calls to rungc are shown in Figure 5.
.KF
.G1
copy "x9gc.gr"
.G2
.I "Figure 5. Times for 1000 calls to rungc()."
.KE
Running
.P1
cat results | bcut 1-201 | plot 
.P2
will give us just the first 200 data points in a little more detail,
these are shown in Figure 6.
.KF
.G1
copy "x9gc200.gr"
.G2
.I "Figure 6. Times for the first 200 calls to rungc()."
.KE
We can see that the time for calls to rungc has a very repeatable pattern,
but there is no sign of the large 6 millisecond delay.
.LP
Garbage collection is also done during idle virtual machine cycles.
If no thread is ready to run the interpreter calls
.CW execatidle()
and then sleeps until there is something to do.
.CW Execatidle()
attempts to complete 3 epochs of garbage collection, checking
after each call to
.CW rungc()
to see if any thread is ready to run; if a thread is ready, normal interpretation
is resumed immediately.
.KS
The following code in \fIdevbench\fR is used to determine
how long it might take the garbage collector to run 3 epochs\(dg:
.P1
int
idlegc(void *p)
{
	int done;
	Prog *head;
	vlong t0, t1, tot;
	USED(p);

	head = progn(0);	/* isched.head */
	done = gccolor + 3;
	tot = 0;
	while(gccolor < done && gcruns()) {
		if(tready(nil))
			break;
		(*ts)(&t0);
		rungc(head);
		(*ts)(&t1);
		tot += t1-t0;
	}
	log(" %.2f",  ts2us(tot));
	nidle--;
	if(nidle == 0) {
		log("\n");
		return 1;
	}
	return 0;
}
.P2
.KE
.FS
.vs 9
.FA
\(dg This benchmark is called
.CW GCIDLE
in the \fIdevbench\fR benchmark suite.
.FE
The code is invoked by assigning 100 to
.CW nidle
and then calling
.P1
atidle(idlegc, 0);
.P2
so that the interpreter will call this code when idle
in the same way that it calls the normal garbage collection code.
.KS
The function
.CW tready()
is defined as
.P1
tready(void *a)
{
	return isched.runhd != nil || isched.yield != 0;
}
.P2
.KE
returning \fItrue\fR whenever there is a thread in the run queue
or when another interpreter kernel process has called \fIacquire()\fR
to request that control of the interpreter be \fIyielded\fR to it.
In fact, the code mirrors the garbage collection code in the interpreter,
breaking when another thread is ready to be run.
.KF
.G1
copy "x31.gr"
.G2
.I "Figure 7. Times for performing three epochs of garbage collection."
.KE
.LP
Figure 7 shows the times for performing three epochs of garbage collection.
It is clear from these results that idle garbage collection can take
some considerable time even when the system is relatively quiet.
The times for completing 3 epochs of garbage collection
were of the same order of magnitude as the large delays we were
noticing with the timestamping earlier.
However, the system is clearly designed to relinquish control
whenever another thread needs to run.
It was not, therefore, clear that this could be the cause of these delays.
.SH
An Explanation
.LP
Clock interrupts happen on this system at regular intervals
of 10 milliseconds.
At each such interrupt the handler checks to see if any
kernel processes are ready to run, and if so it calls \fIsched\fR.
This gives the system its preemptive scheduling.
Suppose we are executing a call to
.CW sys->read().
If after calling \fIrelease\fR but before returning
from the call to \fIkread\fR and calling \fIacquire\fR a clock interrupt goes off
then another process may grab the interpreter and complete a scheduler
quantum.
If no other threads are ready to run
because, like us, they are still in the kernel then
.CW tready()
will return \fIfalse\fR
and so it will choose to perform idle garbage collection.
There is an upper bound on how long this may go on for;
it is the minimum of the time it takes to perform three epochs
of garbage collection and the time to the next clock interrupt
which may be as much as 10 milliseconds.
This would give rise to the periodic high values we have seen
and would likely result in delays of the same order of magnitude
as those that we have seen.
.LP
One way to check this hypothesis would be to modify the condition
upon which the idle garbage collection is halted to include a test
for kernel processes that are ready to run.
.KS
The existing code in the function \fIport/dis.c:execatidle()\fR breaks
from the process of garbage collection only when another interpreter
requests control or when another thread is ready to run.
.P1
done = gccolor+3;
while(gccolor < done && gcruns()) {
	if(isched.yield != 0 || isched.runhd != isched.runtl)
		break;

	rungc(isched.head);
}
.P2
.KE
The result of this is that when a thread makes a call that takes
it into the kernel it is effectively given a lower priority than the idle time
garbage collection. This effect can only be seen if a clock interrupt happens
which results in control being switched to an interpreter thread which completes
its execution quantum and enters idle time garbage collection. Ordinarily,
a thread in the kernel will either run to completion or will, itself, call
.CW sched()
while waiting for an event to occur.
.LP
This effect will be visible in any Inferno thread that makes a call to a system
function which then results in a \fIrelease\fR of the interpreter - not
just this benchmarking application.
.SH
Improving the Results
.LP
We have changed the code in \fIexecatidle\fR to call \fIsched\fR
on detection of
ready kernel processes.
.KS
It now looks like this:\(dg
.P1
done = gccolor+3;
while(gccolor < done && gcruns()) {
	if(isched.yield != 0 || isched.runhd != isched.runtl)
		break;
	rungc(isched.head);
	sched();
}
.P2
.KE
.FS
.vs 9
.FA
\(dg A first attempt checked the number of processes ready
to run, making the call to
.I sched
conditional, but that did not correctly account for their priorities
relative to the process running the collector.
.FE
Processes in the kernel running at the
same priority as the collector are interleaved with the collector.
If such a process returns from a
kernel operation and attempts to \fIacquire\fR the interpreter then
this loop will exit the next time around because
.CW isched.yield
will be set.
Kernel processes which do not require the interpreter,
such as the network timers, will get to run without interrupting
the garbage collection.
The effect of this change can be, readily, seen in much smaller delays
shown in Figure 8.
.KF
.G1
copy "x27base.gr"
.G2
.I "Figure 8. Improved results from idle garbage collection change"
.KE
The large 6 millisecond delays have disappeared but regular 1 millisecond
delays still remain. Looking more closely at the first 200
times, in Figure 9, we can see that the pattern looks very similar to the pattern
we know that we get from repeatedly calling \fIrungc()\fR.
.KF
.G1
copy "x27base200.gr"
.G2
.I "Figure 9. Improved results from idle garbage collection change - \
the first 200 values"
.KE
Indeed, this is most likely the cause of the regular pattern.
The call to \fIrungc\fR after completing each quantum of scheduler
activity results in quite a variable delay which means that the real
time taken in executing
.CW tstamp()
will be difficult to predict.
This will be true for any Limbo function that calls a system builtin
which in turn releases the interpreter.
.LP
I suspect that the single large delay will occur when the garbage
collector completes an epoch and runs through marking each
root as a propagator.
There may be ways to improve the situation by smoothing this operation,
perhaps by making it incremental or by temporarily disabling garbage
collection or by optimising the conditions under which garbage collection
is done.
However, garbage collection has to be done sometime, and it is not
clear that, in general, deferring it is always the best thing to do - we may get good
results when the system is quiet but very bad results when garbage
collection is required.
It would seem likely that there may have been some work done on this aspect
of the virtual machine in later incarnations of the Inferno system.
.LP
There are two ways forward. We could accept that garbage collection
is an integral part of the system and not try to measure small units
of time - instead, we could measure the time for a lot of operations
and average them out and then subtract the known garbage collection
background count from the overall time to get a feel for how much time
taken performing our operation of interest.
There are several problems with this approach.
Firstly, it seems unreasonable not to be able to
measure something small with Inferno; frustrating too, when the problem
is caused by a visually regular disturbance.
Secondly, measuring something over a long period of time gives
ample opportunity for other factors to interfere with the operation
of interest (for example, interrupts from external interfaces).
.LP
The second way forward is to modify our measurement process so that
its interaction with the virtual machine is minimised in such a way that
it can be used more predictably and, hence, is more reliable.
We did this by providing a builtin function
.CW bench->microsec()
and by providing an interface which allowed garbage collection to be
disabled for a period of time.
.KS
The Limbo module prototype looks like this:
.P1
{
	PATH:	con "$Bench";

	microsec:	fn(): big;
	disablegc:	fn();
	enablegc:	fn();
};
.P2
.KE
.KS
and the Limbo function to return a timestamp just calls the builtin.
.P1
xstamp(): big
{
	return bench->microsec();
}
.P2
.KE
Using \fIxstamp()\fR in place of \fItstamp()\fR
without disabling garbage collection results in a significant
improvement to the times returned by the
.CW BASE
benchmark\(dg.
These results are shown in Figure 10.
.FS
.vs 9
.FA
\(dg The kernel used to make this recording (and all subsequent in this document)
has had the modifications made to the idle garbage collection code
described earlier. We found that switching to using a builtin with the
old kernel did result in much better times but could still occasionally
fall foul of a clock interrupt.
.FE
.KF
.G1
copy "x28base.gr"
.G2
.I "Figure 10. Further improvement by using a builtin timestamp."
.KE
I suspect that the few large values are as a result of external interrupts
or scheduler quantum garbage collection.
The distribution of these values is shown in Figure 11.
.KF
.G1
copy "x28basehd.gr"
.G2
.I "Figure 11. The distribution of the results shown in Figure 10"
.KE
Out of the 1000 measurements all but 6 of them were
recorded at 6 microseconds.
.KS
.LP
By wrapping the entire benchmark with calls to
.P1
bench->disablegc()
.P2
and
.P1
bench->enablegc()
.P2
which disable and then reenable all garbage collection
the figures can be improved even further.
.KE
.KF
.G1
copy "x29basehd.gr"
.G2
.I "Figure 12. The distribution of results obtained when garbage collection \
is completely disabled."
.KE
The histogram of timestamps obtained when garbage collection
is completely disabled is shown in Figure 12.
It is clear that the values we are now getting
from \fIxstamp()\fR are much more acceptable.
Out of the 1000 measurements, five were recorded at 12 microseconds,
one at 8 microseconds and the remainder at either 6 or 7 microseconds.
.BP
.SH
Conclusion
.LP
Comparing the histogram for the original measurements
with the one obtained by making a simple change to
the interpreter and by using a builtin instead of using
.CW sys->read()
it is clear that considerable improvements have been made.
This behaviour will be visible in any Inferno thread that makes a
call to a system function which, in turn, causes the interpreter
to be released. These effects will be present in any Inferno application
that uses, for example,
.CW "sys->read(), sys->write(), sys->mount()"
and
.CW "sys->bind()"
or any other such builtin system functions.
.LP
The decision to enable or disable garbage collection whilst benchmarking
is a difficult one.
Disabling it results in slightly better measurement, but is somewhat less realistic.
The improvement gained by disabling it doesn't seem good enough to
justify the likely criticism which might be levelled at the subsequent results.
.LP
It would appear that to enable reliable benchmarking
of Limbo programs under Inferno two changes should be applied.
Firstly, a simple change should be made to the interpreter to prevent idle
garbage collection from grabbing a large time slot when scheduling
occurs at a clock interrupt
and one or more interpreter threads are in kernel operations.
Secondly, making the timestamp function a builtin
results in a simpler interface to the kernel and results in
a measurement process whose execution time is reliable.
Applied together, these changes make it possible to perform
reliable benchmarking of Limbo programs under Inferno.