shithub: 9intro

ref: a351bcdccdf5a4273bc8dc3360a48fbb8b8aa9ea
dir: /ch9.ms/

View raw version
.so tmacs
.BC 9 "More tools
.BS 2 "Regular expressions
.LP
We have used
.CW sed
.ix [sed]
to replace one string with another. But, what happens here? 
.P1
; echo foo.xcc | sed 's/.cc/.c/g'
foo..c
; echo focca.x | sed 's/.cc/.c/g'
f.ca.x
.P2
.LP
We need to learn more.
.PP
.ix "text matching
In addresses of the form
.CW /text/
and in commands like
.CW s/text/other/ ,
the string
.CW text
is not a string for
.CW sed .
This happens to many other programs that search for things.
.ix "text search
For example, we have used
.CW grep
.ix [grep]
to print only lines containing a string. Well, the
.I string
given to grep, like in
.P1
; grep string file1 file2 ...
.P2
.LP
is
.I not
a string. It is a
.B "regular expression" .
A regular expression is a little language. It is very useful to master it, because
many commands employ regular expressions to let you do complex things
in an easy way.
.PP
The text in a regular expression represents many different strings. You have
already seen something similar. The
.CW *.c
in the shell, used for globbing, is very similar to a regular expression. Although
.ix globbing
it has a slightly different meaning. But you know that in the shell,
.CW *.c
\fBmatches\fP
with many different strings. In this case, those that are file names in the current
directory that happen to terminate with the characters “\f(CW.c\fP”. That is what
regular expressions, or
.I regexps ,
are for. They are used to select or match text, expressing the kind of text
to be selected in a simple way. They are a language on their own.
A regular expression, as known by
.CW sed ,
.CW grep ,
and many others,
is best defined recursively, as follows.
.IP •
Any single character
.I matches
the string consisting of that character. For example,
.CW a
matches
.CW a ,
but not
.CW b .
.IP •
A single dot, “\f(CW.\fP”, matches
.I any
single character. For example, “\f(CW.\fP” matches
.CW a
and
.CW b ,
but not
.CW ab .
.IP •
.ix "character set
A set of characters, specified by writing a string within brackets, like
.CW [abc123] ,
matches
.I any
character in the string. This example would match
.CW a ,
.CW b ,
or
.CW 3 ,
but not
.CW x .
A set of characters, but starting with
.CW ^ ,
matches any character
.I not
in the set. For example,
.CW [^abc123]
matches
.CW x ,
but not
.CW 1 ,
which is in the string that follows the
.CW ^ .
A range may be used, like in
.CW [a-z0-9] ,
which matches any single character that is a letter or a digit.
.ix "character range
.IP •
.ix "start~of text
.ix "end~of text
.ix "start~of line
.ix "end~of line
A single
.CW ^ ,
matches the start of the text. And a single
.CW $ ,
matches the end of the text. Depending on the program using the
regexp, the text may be a line or a file. For example, when using
.CW grep ,
.CW a
matches the character
.CW a
at
.I any
place. However,
.CW ^a
matches
.CW a
only when it is the first character in a line, and
.CW ^a$
also requires it to be the last character in the line.
.IP •
Two regular expressions concatenated match
any text matching the first regexp followed by any text
matching the second. This is more hard to say
than it is to understand. The expression
.CW abc
matches
.CW abc
because
.CW a
matches
.CW a ,
.CW b
matches
.CW b ,
and so on. The expression
.CW [a-z]x
matches any two characters where the first one matches
.CW [a-z] ,
and the second one is an
.CW x .
.IP •
Adding a
.CW *
after a regular expression, matches zero or any number of
strings that match the expression. For example,
.CW x*
matches the empty string, and also
.CW x ,
.CW xx ,
.CW xxx ,
etc. Beware,
.CW ab*
matches
.CW a ,
.CW ab ,
.CW abb ,
etc. But it does
.I not
match
.CW abab .
The
.CW *
applies to the preceding regexp, with is just
.CW b
in this case.
.IP •
Adding a
.CW +
after a regular expression, matches one or more
strings that match the previous regexp. It is like
.CW * ,
but there has to be at least one match. For example,
.CW x+
does not match the empty string, but it matches every other thing
matched by
.CW x* .
.IP •
.ix "optional string
Adding a
.CW ?
after a regular expression, matches either the empty string or
one string matching the expression. For example,
.CW x?
matches
.CW x
and the empty string. This is used to make parts optional.
.IP •
Different expressions may be surrounded by parenthesis, to alter
grouping. For example,
.CW (ab)+
matches
.CW ab ,
.CW abab ,
etc.
.IP •
Two expressions separated by
.CW |
match anything matched either by the first, or the second regexp. For example,
.CW ab|xy
matches
.CW ab ,
or
.CW xy .
.IP •
.ix backslash
.ix "escape character
A backslash removes the special meaning for any character used for
syntax. This is called a
.I escape
character.
For example,
.CW (
is not a well-formed regular expression, but
.CW \e(
is, and matches the string
.CW ( .
To use a backslash as a plain character, and not as a escape, use the
backslash to escape itself, like in
.CW \e\e .
.LP
That was a long list, but it is easy to learn regular expressions just by using
them. First, let's fix the ones we used in the last section. This is what happen to us.
.P1
; echo foo.xcc | sed 's/.cc/.c/g'
foo..c
; echo focca.x | sed 's/.cc/.c/g'
f.ca.x
.P2
.LP
But we wanted to replace
.CW .cc ,
and not
.I any
character and a
.CW cc .
Now we know that the first argument to the
.CW sed
command
.CW s ,
is a regular expression.
We can try to fix our problem.
.P1
; echo foo.xcc | sed 's/\e.cc/.c/g'
foo.xcc
; echo focca.x | sed 's/\e.cc/.c/g'
focca.x
.P2
.LP
It seems to work. The backslash removes the special meaning for the dot,
and makes it match just one dot. But this may still happen.
.P1
; echo foo.cc.x | sed 's/\e.cc/.c/g'
foo.c.x
.P2
.LP
And we wanted to replace only the extension for file names ending in
.CW .cc .
We can modify our expression to match
.CW .cc
only when immediately before the end of the line (which is the string being
matched here).
.P1
; echo foo.cc.x | sed 's/\e.cc$/.c/g'
foo.cc.x
; echo foo.x.cc | sed 's/\e.cc/.c/g'
foo.x.c
.P2
.LP
.ix "inner expression
.ix "sub-expression match
Sometimes, it is useful to be able to refer to text that matched part
of a regular expression. Suppose you want to replace the variable name
.CW text
with
.CW word
in a program.
You might try with
.CW s/text/word/g ,
but it would change other identifiers, which is not what you want.
.P1
; cat f.c
void
printtext(char* text)
{
	print("[%s]", text);
}
; sed 's/text/word/g' f.c
void
printword(char* word)
{
	print("[%s]", word);
}
.P2
.LP
The change is only to be done if
.CW word
is not surrounded by characters that may be part of an identifier in the
program. For simplicity, we will assume that these characters are just
.CW [a-z0-9_] .
We can do what follows.
.P1
; sed 's/([^a-z0-9_])text([^a-z0-9_])/\e1word\e2/g' f.c
void
printtext(char* word)
{
	print("[%s]", word);
}
.P2
.LP
.ix "identifier
The regular expression 
.CW [^a-z0-9_]text[^a-z0-9_]
means “any character that may not be part of an identifier”, then
.CW text ,
and then “any character that may not be part of an identifier”.
Because the substitution affects
.I all
the regular expression, we need to substitute the matched string with
another one that has
.CW word
instead of
.CW text ,
but keeping the characters matching
.CW [^a-z0-9_]
before and after the string
.CW text .
This can be done by surrounding in parentheses both
.CW [^a-z0-9_] .
Later, in the destination string, we may use
.CW \e1
to refer to the text matching the first regexp within parenthesis, and
.CW \e2
to refer to the second.
.PP
Because
.CW printtext
is not matched by
.CW [^a-z0-9_]text[^a-z0-9_] ,
it was untouched. However, “\f(CW␣text)\fP” was matched. In the destination string,
.CW \e1
was a white space,
because that is what matched the first parenthesized part. And
.CW \e2
was a right parenthesis, because that is what matched the second one.
As a result, we left those characters untouched, and used them as
.I context
to determine when to do the substitution.
.ix "match context
.PP
Regular expressions permit to clean up source files in an easy way.
In may cases, it makes no sense to keep white space at the end of lines.
This removes them.
.P1
; sed 's/[ \t]*$//'
.P2
.LP
We saw that a script
.CW t+
can be used to indent text in Acme.
Here it is.
.P1
; cat /bin/t+
#!/bin/rc
sed 's/^/\t/'
;
.P2
.LP
This other script removes one level of indentation.
.ix "text indent
.ix [t+]
.ix [t-]
.P1
; cat /bin/t-
#!/bin/rc
sed 's/^\t//'
;
.P2
.LP
How many mounts and binds are performed by the standard namespace?
How many others of your own did you add? The file
.CW /lib/namespace
.ix [/lib/namespace]
.ix "[namespace] file
is used to build an initial namespace for you. But this file has comments, on lines
starting with
.CW # ,
and may have empty lines.
The simplest thing would be to search just for what we want, and count the lines.
.P1
; sed 7q /lib/namespace
# root
mount -aC #s/boot /root $rootspec
bind -a $rootdir /
bind -c $rootdir/mnt /mnt

# kernel devices
bind #c /dev
; grep '^(bind|mount)' /lib/namespace
mount -aC #s/boot /root $rootspec
bind -a $rootdir /
bind -c $rootdir/mnt /mnt
.I ...
; grep '^(bind|mount)' /lib/namespace | wc -l
     41
; grep '^(bind|mount)' /proc/$pid/ns | wc -l
     72
.P2
.LP
We had 41 binds/mounts in the standard namespace, and the one used by
our shell (as reported by its
.CW ns
file) has 72 binds/mounts. It seems we added many ones in our profile.
.LP
There are many other useful uses of regular expressions, as you will be
able to see from here to the end of this book. In many cases, your C
programs can be made more flexible by accepting regular expressions
for certain parameters instead of mere strings. For example, an editor
might accept a regular expression that determines if the text is to be
shown using a
.CW "constant width font"
or a
.I "proportional width font" .
For file names matching, say
.CW .*\e.[ch] ,
it could use a constant width font.
.PP
It turns out that it is
.I trivial
to use regular expressions in a C program, by using the
.CW regexp
.ix [regexp]
library. The expression is
.I compiled
into a description more amenable to the machine, and the resulting
data structure (called a
.CW Reprog )
.ix [Reprog]
can be used for matching strings against the expression. This program
accepts a regular expression as a parameter, and then reads one line
at a time. For each such line, it reports if the string read matches the
regular expression or not.
.so progs/match.c.ms
.ix [match.c]
.LP
The call to
.CW regcomp
.ix [regcomp]
.ix "regular expression compiler"
.I compiles
the regular expression into
.CW prog .
Later,
.CW regexec
.I executes
the compiled regular expression to determine if it matches the string
just read in
.CW buf .
The parameter
.CW sub
points to an array of structures that keeps information about the match.
The whole string matching starts at the character pointed to by
.CW sub[0].sp
and terminates right before the one pointed to by
.CW sub[0].ep .
Other entries in the array report which substring matched the
first parenthesized expression in the regexp,
.CW sub[1] ,
which one matched the second one,
.CW sub[2] ,
etc. They are similar to
.CW \e1 ,
.CW \e2 ,
etc.
This is an example session with the program.
.P1
; 8.match '*.c'
regerror: missing operand for *	\fRThe * needs something on the left!\fP

; 8.match '\e.[123]'
!!x123
no match
!!.123
matched: '.1'
!!x.z
no match
!!x.3
matched: '.3'
.P2
.BS 2 "Sorting and searching
.LP
.ix sorting
.ix searching
One of the most useful task achieved with a few shell commands is inspecting
the system to find out things. In what follows we are going to learn how to do this,
using several assorted examples.
.PP
Running out of disk space? It is not likely, given the big disks we have today.
But anyway, which ones are the biggest files you have created at your home
directory?
.PP
The command
.CW du
(disk usage)
.ix [du]
.ix "disk usage
reports disk usage, measured in disk blocks. A disk block is usually 8 or 16 Kbytes,
depending on your file system. Although
.CW "du -a"
reports the size in blocks for each file, it is a burden to scan by yourself through the
whole list of files to search for the biggest one. The command
.CW sort
.ix [sort]
.ix "text sort
is used to sort lines of text, according to some criteria. We can ask
.CW sort
to sort the output of
.CW du
numerically (\f(CW-n\fP) in decreasing order (\f(CW-r\fP), with
biggest numbers first, and then use
.ix "[sort] flag~[-n]
.ix "[sort] flag~[-r]
.CW sed
to print just the first few lines. Those ones correspond to the biggest files, which
we are interested in.
.P1
; du -a bin | sort -nr | sed 15q
4211	bin
3085	bin/arm
864	bin/arm/enc
834	bin/386
333	bin/arm/madplay
320	bin/arm/madmix
319	bin/arm/deco
316	bin/386/minimad
316	bin/arm/minimad
280	bin/arm/mp3
266	bin/386/minisync
258	bin/rc
212	bin/arm/calc
181	bin/arm/mpg123
146	bin/386/r2bib
;
.P2
.LP
This includes directories as well, but point us quickly to files like
.CW bin/arm/enc
that seem to occupy 864 disk blocks!
.PP
But in any case, if the disk is filling up,
it is a good idea to locate the users that created files (or added data to them),
to alert them. The flag
.CW -m
for
.CW ls
lists the user name that last modified the file. We may collect user names for
all the files in the disk, and then notify them. We are going to play with commands
until we complete our task, using
.CW sed
to print just a few lines until we know how to process all the information.
The first step is to use the output
of
.CW du
as the initial data, the list of files. If we remove everything up to the file names,
we obtain a list of files to work with. 
.P1
; du -a bin | sed 's/.*	//' | sed 3q
bin/386/minimad
bin/386/minisync
bin/386/r2bib
.P2
.LP
Now we want to list the user who modified each file. We can change our data
to produce the commands that do that, and send them to a shell.
.P1
.ps -1
; du -a bin | sed 's/.*	//' | sed 's/^/ls -m /' | sed 3q
ls -m bin/386/minimad
ls -m bin/386/minisync
ls -m bin/386/r2bib
;
; du -a bin | sed 's/.*	//' | sed 's/^/ls -m /' | sed 3q | rc
[nemo] bin/386/minimad
[none] bin/386/minisync
[nemo] bin/386/r2bib
;
.ps +1
.P2
.LP
We still have to work a little bit more. And our command line is growing. Being
able to edit the text at any place in a Rio window does help, but it can be
convenient to define a
.B "shell function"
.ix [fn]
that encapsulates what we have done so far. A shell function is like a function
in any other language. The difference is that a shell function receives arguments
as any other command, in the command line. Besides, a shell function has
command lines in its body, which is not a surprise. Defining a function
for what we have done so far can save some typing in the near future.
Furthermore, the command we have just built, to list all the files within a given
directory, is useful by itself.
.P1
; fn lr {
;; du -a $1 | sed 's/.*	//' | sed 's/^/ls -m /' | rc
;; }
; 
.P2
.LP
This defined a function, named
.CW lr ,
.ix [lr]
that executes exactly the command line we developed.
In the function
.CW lr ,
we removed the
.CW "sed 3q"
because it is not reasonable for a function listing all files recursively to
stop after listing three of them. If we want to play, we can always add a
final
.CW sed
in a pipeline. Arguments given to the function are accessed like they would be
in a shell script. The difference is that the function is executed by the shell
where we call it, and not by a child shell. By the way, it is preferable to create
useful commands by creating in a shell, functions can not be edited as scripts, and
are not automatically shared among all shells like files are. Functions are handy to
make modular scripts.
.PP
.CW Rc
stores the function
definition using an
.ix "function definition
environment variable. Thus, most things said for environment variables apply for
functions as well (e.g., think about
.CW "rfork e" ).
.P1
; cat /env/'fn#lr'
fn lr {du -a $1|sed 's/.*	//'|sed 's/^/ls -m /'|rc}
;
.P2
.LP
The builtin function
.CW whatis
.ix [whatis]
is more appropriate to find out what a name is for
.CW rc .
It prints the value
associated to the name in a form that can be used as a command. For example,
here is of
.CW whatis
says about several names, known to us.
.P1
; whatis lr
fn lr {du -a $1|sed 's/.*	//'|sed 's/^/ls -m /'|rc}
; whatis cd
builtin cd
; whatis echo path
/bin/echo
path=(. /bin)
;
.P2
.LP
This is more convenient than looking through
.CW /bin ,
.CW /env ,
and the
.I rc (1)
manual page to see what a name is.
Let's try our new function.
.P1
; lr bin
[nemo] bin/386/minimad
[none] bin/386/minisync
[nemo] bin/386/r2bib
[nemo] bin/386/rc2bin
.I "...and many other lines of output..."
;
.P2
.LP
To obtain our list of users, we may remove everything but the user name.
.P1
; lr bin | sed 's/.([a-z0-9]+).*/\e1/' | sed 3q
nemo
none
nemo
;
.P2
.LP
And now, to get a list of users, we must drop duplicates. The program
.CW uniq
.ix [uniq]
.ix "remove duplicates
.ix "unique lines
knows how to do it, it reads lines and prints them,  lines showing up more than once
in the input are printed once. This program
needs an input with sorted lines. Therefore, we do what we just did, and sort
the lines and remove duplicate ones.
.P1
; lr bin | sed 's/.([a-z0-9]+).*/\e1/' | sort | uniq
esoriano
nemo
none
;
.P2
.LP
Note that we removed
.CW "sed 3q"
from the pipeline, because this command does what we wanted to do and
we want to process the whole file tree, and not just the first three ones.
It happens that
.CW sort
also knows how to remove duplicate lines, after sorting them. The flag
.CW -u
asks
.CW sort
.ix "[sort] flag~[-u]
to print a unique copy of each output line. We can optimize a little bit our command
to list file owners.
.P1
; lr bin  | sed 's/.([a-z0-9]+).*/\e1/' | sort -u
.P2
.LP
What if we want to list user names that own files at several file trees? Say,
.CW /n/fs1
and
.CW /n/fs2 .
We may have several file servers but might want to list file owners for all of them.
It takes time for
.CW lr
to scan an entire file tree, and it is desirable to process all trees in parallel. The
strategy may be to use several command lines like the one above, to produce
a sorted user list for each file tree. The combined user list can be obtained by
merging both lists, removing duplicates. This is depicted in figure [[!sort merge!]].
.LS
.PS
right
S: [
	down
	FS1: [ right ; box "lr /n/fs1" ; arrow right .2 ; box "sed" ; arrow right .2  ; box "sort" ]
	move
	FS2: [ right ; box "lr /n/fs2" ; arrow right .2  ; box "sed" ; arrow right .2  ; box "sort" ]
]
move
M: box "sort -mu" ; arrow ; box invis "sorted"
arrow from S.FS1.e to M.w+0,.1
arrow from S.FS2.e to M.w-0,.1
.PE
.LE F Obtaining a file owner list using sort to merge two lists for \f(CWfs1\fP and \f(CWfs2\fP
.PP
We define a function
.CW lrusers
.ix [lrusers]
.ix "non-linear pipe
to run each branch of the pipeline. This provides a compact way of executing it,
saves some typing, and improves readability. The output from the two pipelines
is merged using the flag
.CW -m
of
.CW sort ,
which merges two sorted files to produce a single list. The flag
.CW -u
(unique) must be added as well, because the same user could own files in both
file trees, and we want each name to be listed once.
.P1
; fn lrusers { lr $1 | sed 's/.([a-z0-9]+).*/\e1/' | sort }
; sort -mu <{lrusers /n/fs1} <{lrusers /n/fs2}
esoriano
nemo
none
paurea
;
.P2
.LP
For
.CW sort ,
each \f(CW<{\fP...\f(CW}\fP construct is just a file name (as we saw). This is a
simple way to let us use two pipes as the input for a single process.
.PP
To do something different,
we can revisit the first example in the last chapter, finding function definitions.
This script does just that, if we follow the style convention for declaring
functions that was shown at the beginning of this chapter. First, we try
to use
.CW grep
to print just the source line where the function
.CW cat
is defined in the file
.CW /sys/src/cmd/cat.c .
Our first try is this.
.P1
; grep cat /sys/src/cmd/cat.c
cat(int f, char *s)
	argv0 = "cat";
		cat(0, "<stdin>");
			cat(f, argv[i]);
.P2
.LP
Which is not too helpful. All the lines contain the string
.CW cat ,
but we want only the lines where
.CW cat
is at the beginning of line, followed by an open parenthesis. Second
attempt.
.P1
; grep '^cat\e(' /sys/src/cmd/cat.c
cat(int f, char *s)
.P2
.LP
At least, this prints just the line of interest to us. However, it is useful
to get the file name and line number before the text in the line. That
output can be used to point an editor to that particular file and line
number. Because
.CW grep
prints the file name when more than one file is given, we could use
.CW /dev/null
as a second file where to search for the line. It would not be there,
but it would make
.CW grep
print the file name. 
.P1
; grep  '^cat\e(' /sys/src/cmd/cat.c /dev/null
/sys/src/cmd/cat.c:cat(int f, char *s)
.P2
.LP
Giving the option
.CW -n
to
.CW grep
.ix "[grep] flag~[-n]
.ix "line number
makes it print the line number. Now we can really search for functions,
like we do next.
.P1
; grep -n '^cat\e(' /sys/src/cmd/*.c
/sys/src/cmd/cat.c:5: cat(int f, char *s)
.P2
.LP
And because this seems useful, we can package it as a shell script. It
accepts as arguments the names for functions to be located. The command
.CW grep
is used to search for such functions at all the source files in the current directory.
.P1
#!/bin/rc
rfork e
for (f in $*)
	grep -n '^'$f'\e('  *.[cCh]
.P2
.LP
How can we use
.CW grep
to search for
.CW -n ?
If we try,
.CW grep
would get confused, thinking that we are supplying an option. To avoid this,
the
.CW -e
option tells
.CW grep
.ix "[grep] flag~[-e]
that what follows is a regexp to search for.
.P1
; cat text
Hi there
How can we grep for -n?
Who knows!
; grep -n text
; grep -e -n text
how can we grep for -n?
.P2
.LP
This program has other useful options. For example, we may want to
locate lines in the file for a chapter of this book where we mention
figures. However, if the word
.CW figure
is in the middle of a sentence it would be all lower-case. When it is
starting a sentence, it would be capitalized. We must search both for
.CW Figure
and
.CW figure.
The flag
.CW -i
makes
.CW grep
.ix "case insensitive
.ix "[grep] flag~[-i]
become case-insensitive. All the text read is converted to lower-case
before matching the expression.
.P1
; grep -i figure ch1.ms
Each window shows a file or the output of commands.  Figure
figure are understood by acme itself. For commands
shown in the figure would be
.I "...and other matching lines
.P2
.LP
A popular searching task is determining if a file containing a mail message
is spam or not. Today, it would not work, because spammers employ heavy
.ix spam
armoring, and even send their text encoded in multiple images sent as HTML
mail. However, it was popular to see if a mail message contained certain
expressions, if it did, it was considered spam.
Because there will be many expressions, we may keep them in a file.
The option
.CW -f
for
.CW grep
.ix "[grep] flag~[-f]
takes as an argument a file containing all the expressions to search for.
.P1
; cat patterns
Make money fast!
Earn 10+ millions
(Take|use) viagra for a (better|best) life.
; if (grep -i -f patterns $mail ) echo $mail is spam
.P2
.ix "[patterns] file
.BS 2 "Searching for changes
.LP
.ix "file differences
.ix "file comparation
A different kind of search is looking for differences. There are
several tools that can be used to compare files. We saw
.CW cmp ,
.ix [cmp]
that compares two files. It
does not give much information, because it is meant to compare files that
are binary and not textual, and the program reports just which one is the first
byte that makes the files different. However, there is another tool,
.CW diff ,
.ix [diff]
that is more useful than
.CW cmp
when applied to text files. Many times,
.CW diff
is used just to compare two files to search for differences. For example, we can
compare the two files
.CW /bin/t+
and
.CW /tmp/t- ,
that look similar, to see how they differ. The tool reports what changed
in the first file to obtain the contents in the second one.
.P1
; diff /bin/t+ /bin/t-
2c2,3
< exec sed 's/^/	/'
---
> exec sed 's/^	//'
> 
.P2
.LP
The output shows the minimum set of differences between both files, here we see
just one. Each difference reported starts with a line like
.CW 2c2,3 ,
which explains which lines differ. This tool tries to show
a minimal set of differences, and it will try to aggregate runs of lines that change.
In this way, it can simply say that
several (contiguous) lines in the first file have changed and correspond to a different
set of lines in the second file. In this case, line 2 in the first file (\f(CWt+\fP)
has changed in favor of lines 2 and 3 in the second file. If we replace line 2 in
.CW t+
with lines 2 and 3 from
.CW t- ,
both files have be the same contents.
.PP
After the initial summary,
.CW diff
shows the relevant lines that differ in the first file, preceded by an initial
.CW <
sign to show that they come from the file on the left in the argument list, i.e.,
the first file. Finally, the lines that differ in this case for the second file are
shown. 
The line 3 is an extra
empty line, but for
.CW diff
that is a difference.
If we remove the last empty line in
.CW t- ,
this is what
.CW diff
says:
.P1
; diff /bin/t^(+ -)
2c2
< exec sed 's/^/	/'
---
> exec sed 's/^	//'
.P2
.LP
Let's improve the script. It does not accept arguments, and it would be better
to print a diagnostic and exit when arguments are given.
.so progs/tab.ms
.LP
This is what
.CW diff
says now.
.P1
; diff /bin/t+ tab
1a2,5
> if (! ~ $#* 0){
> 	echo usage: $0 >[1=2]
> 	exit usage
> }
;
.P2
.ix "script diagnostics
.LP
In this case, no line has to
.I change
in
.CW /bin/t+
to obtain the contents of
.CW tab .
However, we must
.I add
lines 2 to 5 from
.CW tab
after line 1 of
.CW /bin/t+ .
This is what
.CW 1a2,5
means.
Reversing the arguments of
.CW diff
produces this:
.P1
; diff tab /bin/t+
2,5d1
< if (! ~ $#* 0){
< 	echo usage: $0 >[1=2]
< 	exit usage
< }
.P2
.LP
Lines 2 to 5 of
.CW tab
must be deleted (they would be after line 1 of
.CW /bin/t+ ),
if we want
.CW tab
to have the same contents of
.CW /bin/t+ .
.PP
Usually, it is more convenient to run
.CW diff
supplying the option
.CW -n ,
.ix "[diff] flag~[-n]
which makes it print the file names along with the line numbers. This is
very useful to easily open any of the files being compared by addressing
the editor to the file and line number.
.P1
; diff -n /bin/t+ tab
/bin/t+:1 a tab:2,5
> if (! ~ $#* 0){
> 	echo usage: $0 >[1=2]
> 	exit usage
> }
.P2
.LP
Although some people prefer the
.CW -c
.ix "context [diff]
(context) flag, that makes it more clear what changed by printing a few lines
of context around the ones that changed.
.P1
; diff -n /bin/t+ tab
/bin/t+:1,2 - tab:1,6
  #!/bin/rc
+ if (! ~ $#* 0){
+ 	echo usage: $0 >[1=2]
+ 	exit usage
+ }
  exec sed 's/^/	/'
;
.P2
.LP
Searching for differences is not restricted to comparing just two files. In many cases
we want to compare two file trees, to see how they differ. For example, after installing
a new Plan 9 in a disk, and using it for some time, you might want to see if there are
changes that you made by mistake. Comparing the file tree in the disk with that used
as the source for the Plan 9 distribution would let you know if that is the case.
.PP
This tool,
.CW diff ,
can be used to compare two directories by giving their names. If works like
above, but compares all the files found in one directory with those in the other.
Of course, now it can be that a given file might be just at one directory, but not at
the other.
We are going to copy our whole
.CW $home/bin
to a temporary place to play with changes, instead of using the whole file system. 
.P1
; @{ cd ; tar c bin } | @{ cd /tmp ; tar x }
;
.P2
.LP
Now, we can change
.CW t+
in the temporary copy, by copying the
.CW tab
script we recently made. We will also add a few files to the new file tree and
remove a few other ones.
.P1
; cp tab /tmp/bin/rc/t+
; cp rcecho /tmp/bin/rc
; rm /tmp/bin/rc/^(d2h h2d)
;
.P2
.LP
So, what changed?
The option
.CW -r
asks
.CW diff
to go even further and compare two entire file trees, and not just two directories. It
descends when it finds a directory and recurs to continue the search for differences.
.P1
; diff -r ($home /tmp)^/bin
Only in /usr/nemo/bin/rc: d2h
Only in /usr/nemo/bin/rc: h2d
Only in /tmp/bin/rc: rcecho
diff /usr/nemo/bin/rc/t+ /tmp/bin/rc/t+
1a2,5
> if (! ~ $#* 0){
> 	echo usage: $0 >[1=2]
> 	exit usage
> }
;
.P2
.LP
The files
.CW d2h
and
.CW h2d
are only at
.CW $home/bin/rc ,
we removed them from the copied tree. The file
.CW rcecho
is only at
.CW /tmp/bin/rc
instead. We created it there. For
.CW diff ,
it would be the same if it existed at
.CW $home/bin/rc
and we removed
.CW rcecho
from there.
Also, there is a file that is different,
.CW t+ ,
as we could expect. Everything else remains the same.
.PP
It is now trivial to
answer questions like, which files have been added to our copy of the file tree?
.P1
; diff -r ($home /tmp)^/bin | grep '^Only in /tmp/bin'
Only in /tmp/bin/rc: rcecho
;
.P2
.LP
This is useful for security purposes. From time to time we might check that
a Plan 9 installation does not have files altered by malicious programs or by
user mistakes. If we process the output of
.CW diff ,
comparing the original file tree with the one that exists now,
we can generate the commands needed to restore the tree to its
original state. Here we do this to our little file tree. Files that are only in the new tree,
must be deleted to get back to our original tree.
.P1
.ps -2
; diff -r ($home /tmp)^/bin >/tmp/diffs
; grep '^Only in /tmp/' /tmp/diffs | sed -e 's|Only in|rm|' -e 's|: |/|'
rm /tmp/bin/rc/rcecho
.ps +2
.P2
.LP
Files that are only in the old tree have probably been deleted in the new tree,
assuming we did not create them in the old one. We must copy them again.
.P1
.ps -2
; d=/usr/nemo/bin
; grep '^Only in '^$d /tmp/diffs | 
;;  sed 's|Only in '^$d^'/(.+): ([^ ]+)|cp '^$d^'/\e1/\e2 /tmp/bin/\e1|'
cp /usr/nemo/bin/rc/d2h /tmp/bin/rc
cp /usr/nemo/bin/rc/h2d /tmp/bin/rc
.ps +2
.P2
.LP
In this command,
.CW \e1
is the path for the file, relative to the directory being compared, and
.CW \e2
is the file name. We have not used
.CW $home
to keep the command as clear as feasible. To complete our job, we must undo
any change to any file by coping files that differ.
.P1
; grep '^diff ' /tmp/diffs | sed 's/diff/cp/'
cp /usr/nemo/bin/rc/t+ /tmp/bin/rc/t+
.P2
.LP
All this can be packaged into a script, that we might call
.CW restore .
.so progs/restore.ms
.LP
And this is how we can use it.
.P1
; restore
rm /tmp/bin/rc/rcecho
cp /usr/nemo/bin/rc/d2h /tmp/bin/rc
cp /usr/nemo/bin/rc/h2d /tmp/bin/rc
cp /usr/nemo/bin/rc/t+ /tmp/bin/rc/t+
; restore|rc	\fR after having seen what this is going to do!\fP
.P2
.LP
We have a nice script, but pressing
.I Delete
while the script runs may leave an unwanted temporary file.
.ix "temporary file
.P1
; restore $home/bin /tmp/bin
\fBDelete\fP
; lc /tmp
.links				omail.11326.body
A1030.nemoacme			omail.2558.body
ch6.ms				restore.1425
;
.P2
.LP
To fix this
problem, we need to install a note handler like we did before in C. The shell
.ix "shell note~handler
.ix [sighup]
.ix [sigint]
.ix [sigalrm]
gives special treatment to functions with names
.CW sighup ,
.CW sigint ,
and 
.CW sigalrm .
A function
.CW sighup
is called by
.CW rc
when it receives a
.CW hangup
.ix "[hangup] note
note. The same happens for
.CW sigint
with respect to the
.CW interrupt
.ix "[interrupt] note
note and
.CW sigalrm
for the
.CW alarm
note. Adding this to our script makes it remove the temporary file
when the window is deleted or
.I Delete
is pressed.
.P1
fn sigint { rm $diffs }
fn sighup { rm $diffs }
.P2
.LP
This must be done after defining
.CW $diffs .
.BS 2 "AWK
.LP
.ix AWK
There is another tool is use extremely useful, which remains to be seen.
It is a programming language called
.I AWK .
Awk is meant to process text files consisting of records with multiple fields.
Most data in system and user databases, and much data generated by commands
looks like this. Consider the output of
.CW ps .
.P1
; ps | sed 5q
nemo   1    0:00   0:00     1392K Await    bns
nemo   2    1:09   0:00        0K Wakeme   genrandom
nemo   3    0:00   0:00        0K Wakeme   alarm
nemo   5    0:00   0:00        0K Wakeme   rxmitproc
nemo   6    0:00   0:00      268K Pread    factotum
.P2
.LP
We have multiple lines, which would be records for AWK.
All the lines we see contain different parts carrying different data, tabulated. In this
case, each different part in a line is delimited by white space. For AWK, each
part would be a field. This is our first AWK program. It prints the  user names for
owners of processes running in this system. Similar to what could be achieved
by using
.CW sed .
.P1
; ps | awk '{print $1}'
nemo
nemo
.I ...
; ps | sed 's/ .*//'
nemo
nemo
.I ...
.P2
.LP
The program for AWK was given as its only argument, quoted to escape it from
the shell.
AWK executed the program to process its standard input, because no file to process
was given as an argument. In this case, the program prints the first field for any
line. As you can see, AWK is very handy to cut columns of files for further processing.
There is a command in most UNIX machines named
.CW cut ,
that does precisely this, but using AWK suffices.
If we sort the set of user names and remove duplicates, we can know who is
using the machine.
.P1
; ps | awk '{print $1}' | sort -u
nemo
none
;
.P2
.LP
In general, an AWK program consists of a series of statements, of the form
.ix "AWK statement
.P1
.I "pattern \f(CW{\fP action \f(CW}\fP".
.P2
.LP
Each record is matched against the
.I pattern ,
and the
.I action
is executed for all records with a matching one. In our program, there was no
pattern. In this case, AWK executes the action for
.I all
the records. Actions are programmed using a syntax similar to C, using functions
that are either built into AWK or defined by the user. The most commonly used one
is
.CW print ,
which prints its arguments.
.PP
In AWK we have some predefined variables and we can define our own ones.
.ix "AWK variables
Variables can be strings, integers, floating point numbers, and arrays.  As a convenience,
AWK defines a new variable the first time you use it, i.e., when you initialize it.
.PP
The predefined variable
.CW $1
is a string with the text from the first field. Because the action where
.CW $1
appears is executed for a
record,
.CW $1
would be the first field of the record being processed. In our program, each time
.CW "print $1"
is executed for a line,
.CW $1
refers to the first field for that line. In the same way,
.CW $2
is the second field and so on. This is how we can list the names for the
processes in our system.
.P1
; ps | awk '{print $7}'
genrandom
alarm
rxmitproc
factotum
fossil
.I ...
.P2
.LP
It may be easier to use
.ix "line fields
AWK to cut fields than using sed, because splitting a line into fields is a natural thing
for the former. White space between different fields might be repeated to tabulate the data,
but AWK managed nicely to identify field number 7.
.PP
The predefined variable
.CW $0
represents the whole record. We can use it along with the variable
.CW NR ,
which holds an integer with the record number, to number the lines in a file.
.so progs/number.ms
.LP
We have used the AWK function
.CW printf ,
which works like the one in the C library. It provides more control for the output
format. Also, we pass the entire argument list to AWK, which would process the
files given as arguments or the standard input depending on how we call the
script.
.P1
; number number
   1 #!/bin/rc
   2 awk '{ printf("%4d %s\n", NR, $0); }' $*
;
.P2
.LP
In general, it is usual to wrap AWK programs using shell scripts. The input
for AWK may be processed by other shell commands, and the same might
happen to its output.
.PP
To operate on arbitrary records, you may specify a pattern for an action.
A pattern is a relational expression, a regular expression, or a combination of
both kinds od expressions. This example
uses
.CW NR
to print only records 3 to 5.
.ix "AWK pattern
.P1
; awk 'NR >= 3 && NR <=5 {print $0}' /LICENSE
with the following notable exceptions:

1. No right is granted to create derivative works of or
.P2
.LP
Here,
.CW "NR >=3 && NR <= 5"
is a relational expression. It does an
.I and
of two expressions. Only records with
.CW NR
between 3 and 5 match the pattern. As a result,
.CW print
is executed just for lines 3 through 5.
Because syntax is like in C, it is easy to get started. Just try.
Printing the entire record, i.e.,
.CW $0 ,
is so common, that
.CW print
prints that by default. This is equivalent to the previous command.
.P1
; awk 'NR >=3 && NR <= 5 {print}' /LICENSE
.P2
.LP
Even more, the default action is to print the entire record. This is also
equivalent to our command.
.P1
; awk 'NR >=3 && NR <= 5' /LICENSE
.P2
.LP
By the way, in this particular case, using
.CW sed
might have been more simple.
.P1
; sed -n 3,5p /LICENSE
with the following notable exceptions:

1. No right is granted to create derivative works of or
;
.P2
.LP
Still, AWK may be preferred if more complex processing is needed, because it provides
a full programming language. For example, this prints only even lines and stops at
line 6.
.P1
.ps -2
; awk 'NR%2 == 0 && NR <= 6' /LICENSE
Lucent Public License, Version 1.02, reproduced below,

   to redistribute (other than with the Plan 9 Operating System)
.ps +2
.P2
.LP
It is common to search for processes with a given name. We used grep for this task.
But in some cases, unwanted lines may get through
.P1
.ps -1
; ps | grep rio
nemo             39    0:04   0:16     1160K Rendez   rio
nemo            275    0:01   0:07     1160K Pread    rio
nemo           2602    0:00   0:00      248K Await    rioban
nemo            277    0:00   0:00     1160K Pread    rio
nemo           2607    0:00   0:00      248K Await    brio
nemo            280    0:00   0:00     1160K Pread    rio
.I ...
.ps +1
.P2
.LP
We could filter them out using a better
.CW grep
pattern.
.P1
.ps -1
; ps | grep 'rio$'
nemo             39    0:04   0:16     1160K Rendez   rio
nemo            275    0:01   0:07     1160K Pread    rio
nemo            277    0:00   0:00     1160K Pread    rio
nemo           2607    0:00   0:00      248K Await    brio
nemo            280    0:00   0:00     1160K Pread    rio
.I ...
; ps | grep ' rio$'
nemo             39    0:04   0:16     1160K Rendez   rio
nemo            275    0:01   0:07     1160K Pread    rio
nemo            277    0:00   0:00     1160K Pread    rio
nemo            280    0:00   0:00     1160K Pread    rio
.I ...
.ps +1
.P2
.LP
But AWK just knows how to split a line into fields.
.P1
.ps -1
; ps | awk '$7 ~ /^rio$/'
nemo             39    0:04   0:16     1160K Rendez   rio
nemo            275    0:01   0:07     1160K Pread    rio
nemo            277    0:00   0:00     1160K Pread    rio
nemo            280    0:00   0:00     1160K Pread    rio
.I ...
.ps +1
.P2
.LP
This AWK program uses a pattern that requires field number 7
to match the pattern
.CW /^rio$/ .
As you know, by default, the action is to print the matching record.
The operator
.CW ~
yields true when both arguments match. Any argument can be a regular
expression, enclosed between two slashes. The pattern we used required
.I all
of field number 7 to be just
.CW rio ,
because we used
.CW ^
and
.CW $
to require
.CW rio
to be right after the
.I start
of the field, and before the
.I end
of the field. As we said,
.CW ^
and
.CW $
mean the start of the text being matched and its end. Whether the
text is just a field, a line, or the entire file, it depends on the program using the
regexp.
.LP
It is easy now to list process pids for
.CW rio
that belong to user
.CW nemo .
.P1
; ps | awk '$7 ~ /^rio$/ && $1 ~ /^nemo$/ {print $2}'
39
275
277
280
.I ...
.P2
.LP
How do we kill broken processes? AWK may help.
.P1
.ps -1
; ps |awk '$6 ~ /Broken/ {printf("echo kill >/proc/%s/ctl\n", $2);}'
echo kill >/proc/1010/ctl
echo kill >/proc/2602/ctl
.ps +1
.P2
.LP
The 6th field must be
.CW Broken ,
.ix [Broken]
and to kill the process we can write
.CW kill
.ix [kill]
to the process control file. The 2nd field is the pid and can be used to
generate the file path. Note that in this case the expression matched against
the 6th field is just
.CW /Broken/ ,
which matches with any string containing
.CW Broken .
In this case, it suffices and we do not need to use
.CW ^
and
.CW $ .
.PP
Which one is the biggest process, in terms of memory consumption? The 6th
field from the output of
.CW ps
reports how much memory is using a process. We could use our known tools
to answer this question. The argument
.CW +4r
for
.CW sort
asks for a sort of lines but starting in the field 4 as the sort key. This is a lexical
sort, but it suffices. The
.ix "reverse sort
.CW r
means
.I reverse
sort, to get biggest processes first. And we can use
.CW sed
to print just the first line and only the memory usage.
.P1
.ps -1
; ps | sort +4r 
nemo           3899    0:01   0:00    11844K Pread    gs
nemo             18    0:00   0:00     9412K Sleep    fossil
.I "...and more fossils
nemo             33    0:00   0:00     1536K Sleep    bns
nemo             39    0:09   0:33     1276K Rendez   rio
nemo            278    0:00   0:00     1276K Rendez   rio
nemo            275    0:02   0:14     1276K Pread    rio
.I "...and many others.
; ps | sort +4r | sed 1q
nemo           3899    0:01   0:00    11844K Pread    gs
; ps | sort +4r | sed -e 's/.* ([0-9]+K).*/\1/' -e 1q
11844K
.ps +1
.P2
.LP
We exploited that the memory usage field terminates in an upper-case
.CW K ,
and is preceded by a white space. This is not perfect, but it works.
We can improve this by using AWK. This is more simple and works better.
.P1
; ps | sort +4r | sed 1q | awk '{print $5}'
11844K
.P2
.LP
The
.CW sed
can be removed if we ask AWK to exit after printing the 5th field for the first
record, because that is the biggest one.
.P1
; ps | sort +4r | awk '{print $5; exit}'
11844K
.P2
.LP
And we could get rid of
.CW sort
as well. We can define a variable in the AWK program to keep track of
the maximum memory usage, and output that value after all the records have
.ix "memory usage
been processed. But we need to learn more about AWK to achieve this.
.PP
To compute the maximum of a set of numbers, assuming one number per
input line, we may set a ridiculous low initial value for the maximum and update
its value as we see a bigger value. It is better to take the first value as the initial
maximum, but let's forget about it. We can use two special patterns,
.CW BEGIN ,
and
.CW END .
.ix "[BEGIN] pattern
.ix "[END] pattern
The former executes its action
.I before
processing any field from the input. The latter executes its action
.I after
processing all the input. Those are nice placeholders to put code that must
be executed initially or at the end. For example, this AWK program computes the
total sum and average for a list of numbers.
.P1
; seq 5000 | awk '
;; BEGIN { sum=0.0 }
;; { sum += $1 }
;; END { print sum, sum/NR }
;; '
12502500 2500.5
.P2
.LP
Remember that
.CW ;;
is printed by the shell, and not part of the AWK program. We have used
.CW seq
.ix [seq]
to print some numbers to test our script. And, as you can see, the syntax for actions
is similar to that of C. But note that a statement is also delimited by a newline or a closed
brace, and
we do not need to add semicolons to terminate them. What did this program do?
Before even processing the first line,
the action of
.CW BEGIN
was executed. This sets the variable
.CW sum
to
.CW 0.0 .
Because the value is a floating point number, the variable has that type. Then, field
after field, the action without a pattern was executed, updating
.CW sum .
At last, the action for
.CW END
printed the outcome. By dividing the number of records (i.e., of lines or numbers)
we compute the average.
.PP
As an aside,
it can be funny to note that there are many AWK programs with only an action
for
.CW BEGIN .
That is a trick played to exploit this language to evaluate complex expressions
from the shell. Another contender for hoc.
.P1
; awk 'BEGIN {print sqrt(2) * log(4.3)}'
2.06279
; awk 'BEGIN {PI=3.1415926; print PI * 3.7^2}'
43.0084
.P2
.LP
This program is closer to what we want to do to determine
which process is the biggest one. It computes the maximum of a list of numbers.
.ix "maximum
.P1
; seq 5000 | awk '
;; BEGIN { max=0 }
;; {	if (max < $1)
;; 		max=$1
;; }
;; END { print max }
;; '
5000	\fICorrect?\fP
.P2
.LP
This time, the action for all the records in the input updates
.CW max ,
to keep track of the biggest value. Because
.CW max
was first used in a context requiring an integer (assigned 0), it is integer. Let's try
now our real task.
.P1
; ps | awk '
;; BEGIN { max=0 }
;; {	if (max < $5)
;; 		max=$5
;; }
;; END { print max }
;; '
9412K	\fIWrong! because it should have said...\fP
; ps | sort +4r | awk '{print $5; exit}'
11844K
.P2
.LP
What happens is that
.CW 11844K
is not bigger than
.CW 9412K .
Not as a string.
.P1
; awk 'BEGIN { if ("11844K" > "9412K") print "bigger" }'
;
.P2
.LP
Watch out for this kind of mistake. It is common, as a side effect of AWK efforts
to simplify things for you, by trying to infer and declare variable types as you use
them.
We must force AWK to take the 5th field as a number, and not as a string.
.P1
; ps | awk '
;; BEGIN { max=0 }
;; {	mem= $5+0
;; 	if (max < mem)
;; 		max=mem
;; }
;; END { print max }
;; '
11844
.P2
.LP
Adding
.CW 0
to
.CW $5
forced the (string) value in
.CW $5
to be understood as a integer value. Therefore,
.CW mem
is now an integer with the numeric value from the 5th field. Where is the “\f(CWK\fP”?
When converting the string to an integer, AWK stopped when it found the “\f(CWK\fP”.
Therefore, this forced conversion has the nice side effect of getting rid of the final
letter after the memory size.
It seems simple to compute the average process (memory)
size, doesn't it? AWK lets you do many things, easily.
.ix "average process
.P1
; ps | awk '
;; BEGIN { tot=0}
;; { tot += $5+0 }
;; END { print tot, tot/NR }
;; '
319956 2499.66
.P2
.BS 2 "Processing data
.LP
.ix "data processing
.ix "student account
Each semester, we must open student accounts to let them use the machines.
This seems to be just the job for AWK and a few shell commands,
and that is the tool we use. We take the list for students in the weird format
that each semester the bureaucrats in the administration building invent just
to keep us entertained. This format may look like this list.
.so progs/list.ms
.ix "[list] AWK~script
.LP
We want to write a program, called
.CW list2usr
that takes this list as its input and helps to open the student accounts. But
before doing anything, we must get rid of empty lines and the comments nicely
placed after
.CW #
signs in the original file.
.P1
; awk '
;; /^#/	{ next }
;; /^$/	{ next }
;; 	{ print }
;; ' list
2341|Rodolfo Martínez|Operating Systems|B|ESCET
6542|Joe Black|Operating Systems|B|ESCET
23467|Luis Ibáñez|Operating Systems|B|ESCET
23341|Ricardo Martínez|Operating Systems|B|ESCET
7653|José Prieto|Computer Networks|A|ESCET
.P2
.LP
.ix "AWK program
There are several new things in this program. First, we have multiple patterns
for input lines, for the first time. The first pattern matches lines with an initial
.CW # ,
and the second matches empty lines. Both patterns are just a regular expression,
which is a shorthand for matching it against
.CW $0 .
This is equivalent to the first statement of our program.
.P1
$0 ~ /^#/	{ next }
.P2
.LP
Second, we have used
.CW next
to skip an input record. When a line matches a commentary line, AWK executes
.CW next .
This skips to the next input record, effectively throwing away the input line.
But look at this other program.
.P1
; awk '
;; 	{ print }
;; /^#/	{ next }
;; /^$/	{ next }
;; ' list
# List of students in the random format for this semester
# you only know the format when you see it.
.ix "[next] AWK~command
.ix "skip record
.I ...
.P2
.LP
It does
.I not
ignore comments nor empty lines. AWK executes the statements in the order you
.ix "ignore comment
wrote them. It reads one record after another and executes, in order,
all the statements
with a matching pattern. Lines with comments match the first and the third statement.
But it does not help to skip to the
.CW next
input record once you printed it.  The same happens to empty lines.
.ix "input record
.PP
Now that we know how to get rid of weird lines, we can proceed.
To create accounts for all students in the course in Operating Systems, group B,
we must first select lines for that course and group. This semester, fields are
delimited by a vertical bar, the course field is the 3rd, and the group field is the 4th.
This may help.
.P1
; awk '-F|' '
;; /^#/	{ next }
;; /^$/	{ next }
;; $3 ~ /Operating Systems/ && $4 ~ /B/	{ print $2 }
;; ' list
Rodolfo Martínez
Joe Black
Luis Ibáñez
Ricardo Martínez
;
.P2
.LP
We had to tell AWK how fields are delimited using
.ix "field delimiter
.ix "[awk] flag~[-F]
.CW -F| ,
quoting it from the shell. This option sets the characters used to delimit fields,
i.e., the field delimiter. Although it admits as an argument a regular expression, saying
just
.CW |
suffices for us now.
We also had to
match the 3rd and 4th fields against desired values, and print the student name
for matching records.
.PP
Our plan is a follows. We are going to assume that a program
.CW adduser
exists. If it does not, we can always create it for our own purposes. Furthermore,
we assume that we must give the desired user name and the full student name
as arguments to this program, like in
.P1
; adduser rmartinez Rodolfo Martínez
.P2
.LP
Because it is
not clear how to do all this, we experiment using the shell before placing all the
bits and pieces into our
.CW list2usr
shell script.
.ix [list2usr] [rc]~script
.PP
One way to invent a user name for each student
is to pick the initial for the first name, and add the last name.
We can use
.CW sed
for the job.
.P1
; name='Luis Ibáñez'
; echo $name |  sed 's/(.)[^ ]+[ ]+(.*)/\e1\e2/'
LIbáñez
; name='José Martínez'
; echo $name | sed 's/(.)[^ ]+[ ]+(.*)/\e1\e2/'
JMartínez
.P2
.LP
But the user name looks funny, we should translate to lower case and, to avoid
problems for this user name when used in UNIX, translate accented characters to
their ascii equivalents. Admittedly, this works only for spanish names, because other
names might use different non-ascii characters and we wouldn't be helping our UNIX
systems.
.P1
; echo LIbáñez | tr A-Z a-z | tr '[áéíóúñ]' '[aeioun]'
libanez
;
.P2
.LP
But the generated user name may be already taken by another user.
If that is the case, we might try to take the first name, and add the initial from
the last name.
If this user name is also already taken, we might try a few other
combinations, but we won't do it here.
.P1
; name='Luis Ibáñez'
; echo $name | sed 's/([^ ]+)[ ]+(.).*/\e1\e2/' |
;; tr A-Z a-z | tr '[áéíóúñ]' '[aeioun]'
luisi
.P2
.LP
How do we now if a user name is taken? That depends on the system where
the accounts are to be created. In general, there is a text file on the system that
lists user accounts. In Plan 9, the file
.CW /adm/users
lists users known to the file server machine. This is an example.
.P1
; sed 4q /adm/users
adm:adm:adm:elf,sys
aeverlet:aeverlet:aeverlet:
agomez:agomez:agomez:
albertop:albertop::
.P2
.LP
The second field is the user name, according to the manual page for our file
server program,
.I fossil (4).
.ix [fossil]
As a result, this is how we can know if a user name can be used for a new user.
.P1
.ps -1
; grep -s '^[^:]+:'^$user^':' /adm/users && echo $user exists 
nemo exists
; grep -s '^[^:]+:'^rjim^':' /adm/users && echo rjim exists 
.ps +1
.P2
.LP
The flag
.CW -s
asks
.CW grep
.ix "[grep] flag~[-s]
.ix "silent [grep]
to remain silent, and only report the appropriate exits status, which is what
we want.
In our little experiment, searching for
.CW $user
in the second field of
.CW /adm/users
succeeds, as it could be expected. On the contrary, there is no
.CW rjim
known to our file server. That could be a valid user name to add.
.PP
There is still a little bit of a problem. User names that we add can no longer
be used for new user names. What we can do is to maintain our own
.CW users
file, created initially by copying
.CW /adm/users ,
.ix [/adm/users]
and adding our own entry to this file each time we produce an output line
to add a new user name.
.PP
We have all the pieces. Before discussing this any further, let's show the resulting
script.
.so progs/list2usr.ms
.LP
We have defined several functions, instead of merging it all in a single, huge,
command line. The
.CW listusers
function is our starting point. It encapsulates nicely the AWK program to list
just the student names for our course and group. The script arguments are given
to the function, which passes them to AWK. The next couple of commands are our
translations to use only lower-case ascii characters for user names.
.PP
The functions
.CW uname1
and
.CW uname2
encapsulate our two methods for generating a user name. They receive the full
student name and print the proposed user name. But we may need to try both
if the first one yields an existing user name. What we do is to read one line
at a time the output from
.P1
listusers $* | tr A-Z a-z | tr '[áéíóúñ]' '[aeioun]'
.P2
.LP
using a
.CW while
loop and the
.CW read
command, which reads a single line from the input. Each line read is placed in
.CW $name ,
to be processed in the body of the
.CW while .
And now we can try to add a user using each method.
.PP
To try to add an account, we defined the function
.CW add .
It determines if the account exists as we saw. If it does, it sets
.CW status
to a non-null value, which is taken as a failure by the one calling the function.
Otherwise, it sets a null status after printing the command to add the account,
and adding a fake entry to our
.CW users
file. In the future, this user name will be considered to exist, even though it may not
be in the real
.CW /adm/users .
.PP
Finally, note how the script catches
.CW interrupt
and
.CW hangup
.ix "[interrupt] note
.ix "[hangup] note
.ix "[rc] note~handler
notes by defining two functions, to remove the temporary file for the user list. Note
also how we print a message when the program fails to determine a user name
for the new user. And this is it!
.P1
; list2usr list
adduser rmartinez rodolfo martinez
adduser jblack joe black
adduser libanez luis ibanez
adduser ricardom ricardo martinez
.P2
.LP
We admit that, depending on the number of students, it might be more trouble
to write this program than to open the accounts by hand. However, in
.I all
semesters to follow, we can prepare the student accounts amazingly fast!
And there is another thing to take into account. Humans make mistakes, programs
do not so as often. Using our new tool we are not likely to make mistakes by
adding an account with a duplicate user name.
.PP
After each semester, we must issue grades to students. Depending on the
course, there are several separate parts (e.g., problems in a exam) that
contribute to the total grade. We can reuse a lot from our script to prepare
a text file where we can write down grades.
.so progs/list2grades.ms
.ix "[list2grades] [rc]~script
.LP
Note how we integrated
.CW $nquestions
in the AWK program, by closing the quote for the program right before
it, and reopening it again.
This program produces this output.
.P1
; list2grades list
Name                   	Q-1	Q-2	Q-3	Total
Rodolfo Martínez        -	-	-	-
Joe Black         	-	-	-	-
Luis Ibáñez           	-	-	-	-
Ricardo Martínez       	-	-	-	-
.P2
.LP
We must just fill the blanks, with the grades. And of course, it does not pay
to compute the final (total) grade by hand. The resulting file may be processed
using AWK for doing anything you want. You might send the grades by email
to students, by keeping their user names within the list. You might convert this
into HTML and publish it via your web server, or any other thing you see fit.
Once the scripts are done after the first semesters, they can be used forever.
.PP
And what happens when the bureaucrats change the format for the input list?
You just have to tweak a little bit
.CW listusers ,
and it all will work. If this happens often, it might pay to put
.CW listusers
into a separate script so that you do not need to edit all the scripts using it.
.BS 2 "File systems
.LP
There are many other tools available. Perhaps surprisingly (or not?) they are just
file servers.
As we saw, a
.B "file server"
is just a process serving files. In Plan 9, a file server
serves a file tree
to provide some service. The tree is implemented by a particular
data organization, perhaps just kept in the memory of the file server process.
This data organization used to serve files is known as a
.B "file system" .
Before reading this book, you might think that a file system is just some way
to organize files in a disk. Now you know that it does not need to be the case.
In many cases, the program that understands (e.g., serves) a particular file system
is also called a file system, perhaps confusingly. But that is just to avoid saying
“the file server program that understands the file system...”
.PP
All device drivers, listed in section 3 of the manual, provide their interface through
the file tree they serve. Many device drivers correspond
to real, hardware, devices. Others provide a particular service, implemented with
just software. But in any case, as you saw before, it is a matter of knowing which
files provide the interface for the device of interest, and how to use them.
The same idea is applied for many other cases.
Many tools in Plan 9, listed in section 4
of the manual, adopt the form of a file server.
.PP
For example, various archive formats are understood by programs like
.CW fs/tarfs
.ix [tarfs]
(which understands tape archives with
.I tar (1)
format),
.CW fs/zipfs
.ix [zipfs]
(which understands ZIP files), etc.
Consider the tar file with music that we created some time ago,
.P1
; tar tf /tmp/music.tar
alanparsons/
alanparsons/irobot.mp3
alanparsons/whatgoesup.mp3
pausini/
pausini/trateilmare.mp3
supertramp/
supertramp/logical.mp3
.P2
.LP
We can use
.CW tarfs
to browse through the archive as if files were already extracted.
The program
.CW tarfs
reads the archive and provides a (read-only) file system that reflects the
contents in the archive. It mounts itself by default at
.CW /n/tapefs ,
but we may ask the program to mount itself at a different path using the
.CW -m
option.
.P1
; fs/tarfs -m /n/tar /tmp/music.tar
; ns | grep tar
mount -c '#|/data1' /n/tar 
.P2
.LP
The device
.CW #|
is the
.I pipe (3)
device. Pipes are created by mounting this device (this is what
.ix "pipe device
.ix "[#|] device~driver
.I pipe (2)
does). The file
.CW '#|/data1'
.ix "[#|/data1]
.ix "end~of pipe
is an end for a pipe, that was mounted by
.CW tar
at
.CW /n/tar .
At the other end of the pipe,
.CW tarfs
is speaking 9P, to supply the file tree for the archive that we have mounted.
.PP
The file tree at
.CW /n/tar
permits browsing the files in the archive, and doing anything with them
(other than writing or modifying the file tree).
.P1
; lc /n/tar
alanparsons	pausini		supertramp
; lc /n/tar/alanparsons
irobot.mp3	whatgoesup.mp3
; cp /n/tar/alanparsons/irobot.mp3 /tmp
;
.P2
.LP
The program terminates itself when its file tree is finally unmounted.
.P1
; ps | grep tarfs
nemo            769    0:00   0:00  88K Pread    tarfs
; unmount /n/tar
; ps | grep tarfs
;
.P2
.LP
The shell along with the many commands that operate on files represent
a useful toolbox to do things. Even more so if you consider the various file
servers that are included in the system.
.PP
Imagine that you have an audio CD and want to store its songs, in MP3 format,
at
.CW /n/music/album .
The program
.CW cdfs
.ix [cdfs]
.ix "CD file~system
provides a file tree to operate on CDROMs. After inserting an audio CD in the
CD reader, accessed through the file
.CW /dev/sdD0 ,
we can list its contents at
.CW /mnt/cd .
.P1
; cdfs -d /dev/sdD0
; lc /mnt/cd
a000	a002	a004	a006	a008	a010
a001	a003	a005	a007	a009	ctl
.P2
.LP
Here, files
.CW a000
to
.CW a010
correspond to
.I audio
tracks in the CD. We can convert each file to MP3 using a tool like
.CW mp3enc .
.ix "CD burn
.ix "audio CD
.P1
; !!for (track in /mnt/cd/a*) {
;;  mp3enc $track /n/music/album/$track.mp3
;; }
.I "...all tracks being encoded in MP3..."
.P2
.LP
It happens that
.CW cdfs
knows how to (re)write CDs. This example, taken from the
.I cdfs (4)
manual page, shows how to duplicate an audio CD.
.P1
.I "First, insert the source audio CD.
; cdfs -d /dev/sdD0
; mkdir /tmp/songs
; cp /mnt/cd/a* /tmp/songs
; unmount /mnt/cd
.I "Now, insert a blank CD.
; cdfs -d /dev/sdD0
; lc /mnt/cd
; ctl    wa     wd
; cp /tmp/songs/* /mnt/cd/wa	\fRto copy songs as audio tracks\fP
; rm /mnt/cd/wa			\fRto fixate the disk contents\fP
; unmount /mnt/cd
.P2
.LP
For a blank CD,
.ix "blank CD
.CW cdfs
presents two directories in its file tree:
.CW wa
and
.CW wd .
Files copied into
.CW wa
are burned as audio tracks. File copied into
.CW wd
are burned as data tracks. Removing either directory fixates the disk, closing
the disk table of contents.
.PP
If the disk is re-writable, and had some data in it, we could even get rid of
the previous contents by sweeping through the whole disk blanking it. It would
be as new (a little bit more thinner, admittedly).
.P1
; echo blank >/mnt/cd/ctl
.I "blanking in progress...
.P2
.LP
When you know that it will not be the
last time you will be doing something, writing a small shell script will save time
in the future. Copying a CD seems to be the case for a popular task.
.so progs/cdcopy.ms
.ix "CD copy
.ix "[cdcopy] [rc]~script
.PP
The script copies a lot of data at
.CW /tmp/songs.$pid .
Hitting
.I Delete ,
might leave those files there by mistake. One fix would be to define a
.CW sigint
function. However, 
provided that machines have plenty of memory, 
there is another file system that might help.
The program
.CW ramfs
.ix [ramfs]
.ix "ram file~system
supplies a read/write file system that is kept in-memory. It uses dynamic memory
to keep the data for the files created in its file tree.
.CW Ramfs
mounts itself by default at
.CW /tmp .
So, adding a line
.P1
ramfs -c
.P2
.LP
before using
.CW /tmp
in the script will ensure that no files are left by mistake in
.CW $home/tmp
(which is what is mounted at
.CW /tmp
by convention).
.PP
Like most other file servers listed in section 4 of the manual,
.CW ramfs
accepts flags
.CW -abc
to mount itself
.I after ,
.I before ,
and
allowing file
.I creation .
Two other popular options are
.CW -m
.I dir ,
to choose where to mount its file tree, and
.CW -s
.I srvfile ,
to ask
.CW ramfs
to post a file at
.CW /srv ,
for mounting it later.
Using these flags, we may able to compile programs in directories where we do not
have permission to write.
.P1
; ramfs -bc -m /sys/src/cmd
; cd /sys/src/cmd
; 8c -FVw cat.c
; 8l -o 8.cat cat.8
; lc 8.* cat.*
8.cat	cat.8	cat.c
; rm 8.cat cat.8
.P2
.LP
After mounting
.CW ramfs
with
.CW -bc
at
.CW /sys/src/cmd ,
new files created in this directory will be created in the file tree served by
.CW ramfs ,
and not in the real
.CW /sys/src/cmd .
The compiler and the loader will be able to create their output files, and we will
neither require permission to write in that directory, nor leave unwanted object
files there.
.PP
The important point here is not how to copy a CD, or how to use
.CW ramfs .
The important thing
is to note that there are many different
programs that allow you to use devices and to do things through a file interface.
.PP
When undertaking a particular task, it will prove to be useful to know which
file system tools are available. Browsing through the system manual, just to see
which things are available,
will prove to be an invaluable help, to save time, in the future.
.SH
Problems
.IP 1
Write a script that copies all the files at
.CW $home/www
terminated in
.CW .htm
to files terminated in
.CW .html .
.IP 2
Write a script that edits the HTML in those files to refer always to
.CW .html
files and not to
.CW .htm
files.
.IP 3
Write a script that checks that URLs in your web pages are not broken. Use
the
.CW hget
command to probe your links.
.IP 4
Write a script to replace duplicate empty lines with a single empty line.
.IP 5
Write a script to generate (empty) C function definitions from text containing
the function prototypes.
.IP 6
Do the opposite. Generate C function prototypes from function definitions.
.IP 7
Write a script to alert you by e-mail when there are new messages in a web
discussion group. The mail must contain a portion of the relevant text and
a link to jump to the relevant web page.
.IP 8
.I Hint:
The program
.CW htmlfmt
may be of help.
.IP 9
Improve the scripts resulting from answers to problems for the last chapter
using regular expressions.
.ds CH
.bp
 .
.bp