TGCW48
				   
   Procedure for Loading Texts Received from Data Capture Agencies
				   
			    Dominic Dunlop
				   
			     March, 1993


This note documents what to do here at OUCS on receiving a tape,
diskettes, or anything else carrying corpus files from a data-capture
agency (Chambers, Longman, OUP).  It applies both to original texts,
and to texts resubmitted, having been bounced back to the data capture
agency the first time around.  (See also TGCW49 on resubmitted texts.)
I apologise in advance for the arbitrary and less than logical nature
of the procedure -- it just grew, rather than having been designed.

I've assumed throughout that you're using a cshell.

 0.  You can do all of this except the last step logged in as yourself:
     you do not have to be su'ed to natcorp, root, or any other user ID.
     But it's OK to do it su'ed to natcorp if you want to.

 1.  Take a wild guess at the volume of data involved.  If you take the
     supposed number of words, and divide by 100, you'll have a
     conservative estimate of the number of kilobytes.

 2.  Run  cdf.  This command lists the disk partitions dedicated to
     corpus data, and shows how much free space there is in each.  For
     example:

	% ~natcorp/bin/cdf
	Filesystem            kbytes    used   avail capacity  Mounted on
	/dev/sd5b             152798  123003   14515    89%    /partitions/05
	/dev/sd1b             152798  110429   27089    80%    /partitions/09
	/dev/sd2a             152798   89390   48128    65%    /partitions/12
	/dev/sd2b             152798  111192   26326    81%    /partitions/13
	/dev/sd6a             243205   68112  150772    31%    /partitions/16
	/dev/sd6b             243205      16  218868     0%    /partitions/17
	/dev/sd6d             243205      16  218868     0%    /partitions/18
	/dev/sd6e             243205      16  218868     0%    /partitions/19
	/dev/sd6f             243205      16  218868     0%    /partitions/20
	/dev/sd6h             243205  128137   90747    59%    /partitions/22

     The fourth column gives the the number of kilobytes available.

 3.  Pick a filesystem which will not become more than 75% full if the new
     data is added to it.  (The 25% headroom is required for workspace
     during processing, and to accommodate material returned from
     Lancaster.)  Move to its top level directory.  Say you picked
     /dev/sd6a (partition a of disk drive 6, but you don't need to
     know that) you would then need to

	cd /partitions/16

     That is, cd to the name given in the ``mounted on'' column.

 4.  Execute the command

	umask 0007

     This ensures that the files you install can be read only by those
     associated with the BNC project.

 5.  Execute the command

	mkdir -p Incoming/agency_date/{As_received,Renamed}

     where agency and date are specific to the incoming data.  For example,
     if receiving material from Chambers on 1st March, 1993, the command
     would be

	mkdir -p Incoming/Chambers_930301/{As_received,Renamed}

 6.  Move to the new As_received directory.  For example,

	cd Incoming/Chambers_930301/As_received

 7.  Load the data at that point.  For example, if reading a tar archive
     from a tape cartridge:

	tar xvf /dev/rst1

 8.  Run  ls -a  to see what you've got.  If it turns out to be in
     subdirectories, move it up to the current level and delete the
     subdirectories.  If the data capture agency has given you a list of
     what's supposed to have been supplied, this is the time to check it.
     You might also like to put the original documents on the shelf,
     and check they're all there, if any have been supplied.  (They go
     in case-independent alphabetical order, right to left.)

 9.  Make the received files read-only, so that they don't get altered
     accidentally:

	chmod 0440 *

     (Plus  chmod 440 .[A-Za-z0-9]*  if any received file names begin with a
     dot.)

10.  If the files include an OUP database report with a name like
     classif.dat, move it up a level to separate it from the text files, and
     rename it as  classif, whatever its original name.  For example:

	mv classif.dat ../classif

11.  Make links (alternative names for the same files) in the Renamed
     directory for all the files in the As_received directory:

	ln * ../Renamed

     (Note: should any of the incoming files have names beginning with
     a dot, this will miss them: pick them up with

	ln .[A-Za-z0-9]* ../Renamed
     )

12.  Move to the Renamed directory:

	cd ../Renamed

13.  Rename the files as necessary so that they have 5- or 6-character
     alphanumeric names which, independent of letter case, do not clash
     with any existing corpus file names.  How you do this is largely up
     to you.  There is an ugly program, ~natcorp/bin/fixnames, which,
     when run in the As_received directory, partially automates the
     process for files received from OUP:

	cd ../As_received
	ls | ~natcorp/bin/fixnames DESTDIR=../Renamed | sh

     To help you eliminate duplicate names, run

	~natcorp/bin/clash *

     in the As_received directory.

     This will list (in upper-case) existing BNC text names which clash
     with file names in the current directory.  Amend filenames until
     the  clash  command produces no output.  You may find looking at
     the BNC text name list in ~natcorp/filenames useful in picking a
     new, non-clashing name.  Bear in mind that, for magazines, the first
     issue of a particular title that we receive gets named MagazA (or
     whatever), the second MagazB, and so on.  Thus, you may have to move
     the final letter of magazine text names some way down the alphabet to
     avoid a clash if we already have a number of issues of the same
     title.  (Note that it's not the earliest issue that gets called
     MagazA; just the first we receive.)

14.  Go to the top level of the disk partition chosen in step 3, and move
     down to the subdirectory named for the data capture agency.  For
     example:

	cd /partitions/16/chambers

15.  Make a directory named for the date on which the data was received,
     and move to it:

	mkdir 939301; cd 930301

16.  If you consider that the received data needs subdividing according
     to some criterion, make subdirectories.  For example

	mkdir books magazines

     Note: there is a control list of allowed directory names.  It is
     currently

	books		c-g	demographic	magazines	newspapers
	selective	spoken	written		r_*

     The names are used in classifying texts, and names not on the list
     give the database indigestion.  But I can easily extend the list if
     necessary.

17.  Make links from, or move, the renamed files into the target
     directory or directories.  (Do NOT copy the files -- the automatic
     database text registration procedure relies on the As_received name
     and the BNC text name being alternative names for the same file;
     having two different files with identical contents won't do.)
     As an unlikely example, suppose files with names beginning with the
     letters A to M are books, and those beginning N to Z are magazines:

	ln /partitions/16/Incoming/Chambers_930301/Renamed/[A-M]* books
	ln /partitions/16/Incoming/Chambers_930301/Renamed/[N-Z]* magazines

     (Making links using ln is better if few steps are involved; moving,
     with mv, may be a good idea if there are many target directories, and
     you want to pick off renamed files until the Renamed directory is
     empty.)

18.  Put a dot in front of all the text file names.  The  rename  command
     (see its man page to learn more) is useful for this.  If the files
     are in the current directory:

	rename 's/^/./' *

     If the files are in subdirectories:

	rename 's#/#/.#' */*

19.  If an OUP database classification report was identified in step 10,
     link it into the current directory.  For example,

	ln /partitions/16/Incoming/Chambers_930301/classif .

20.  If you want to create A_ files from the dot files using some sort of
     processing, do it now.  Document what you do in individual Z_ files
     for each text, or combined Z_ files covering all files sharing
     some characteristic.

21.  Move to /corpus/Incoming:

	cd /corpus/Incoming

22.  Make a symbolic link so as to link the incoming file directories
     created in step 5 into the /corpus/Incoming hierarchy.  For example:

	ln -s ../../16/Incoming/Chambers_930301 .

     This needs a bit of explaining.  It puts an entry named
     Chambers_930301 in /corpus/Incoming.  That entry states that the
     actual file is found by going up two levels (which takes you to the
     /partitions directory, although it doesn't look as if it should
     unless you know that /corpus is itself a symbolic link to
     /partitions/05 -- sorry about that), then down to disk partition
     16, and so on.

23.  Move to the directory under /corpus/Work named for the data capture
     agency.  For example

	cd /corpus/Work/chambers

24.  Make a symbolic link so as to link the corpus file directories
     created in step 15 into the /corpus/Work hierarchy.  For example:

	ln -s ../../../16/chambers/930301 .

     The explanation is similar to that in step 22.

25.  Edit ~natcorp/.fileids to add symbols giving a short name or names for
     the corpus directory or directories you have created.  You do need
     to be natcorp to do this.  One way to do it is

	su natcorp -c "emacs ~natcorp/.fileids"

     In the example given, you might like to add lines

	setenv chambers930301  /corpus/Work/chambers/930301/books
	setenv chambers930301m /corpus/Work/chambers/930301/magazines

     Having finished the edit, if you want to use these names at once, type

	source ~natcorp/.fileids

     otherwise the changes do not take effect until you next log in.

That's it -- as it damned well should be after all those steps.  A lot more
happens, but it happens automatically as soon as the overnight database
update software notices that new files and directories have been attached to
the /corpus/Work hierarchy.  Look at TGCW50, the man page for  overnight
(by executing  nroff -man ~natcorp/bin/overnight) if you want to know what
happens next.