TGCW48 Procedure for Loading Texts Received from Data Capture Agencies Dominic Dunlop March, 1993 This note documents what to do here at OUCS on receiving a tape, diskettes, or anything else carrying corpus files from a data-capture agency (Chambers, Longman, OUP). It applies both to original texts, and to texts resubmitted, having been bounced back to the data capture agency the first time around. (See also TGCW49 on resubmitted texts.) I apologise in advance for the arbitrary and less than logical nature of the procedure -- it just grew, rather than having been designed. I've assumed throughout that you're using a cshell. 0. You can do all of this except the last step logged in as yourself: you do not have to be su'ed to natcorp, root, or any other user ID. But it's OK to do it su'ed to natcorp if you want to. 1. Take a wild guess at the volume of data involved. If you take the supposed number of words, and divide by 100, you'll have a conservative estimate of the number of kilobytes. 2. Run cdf. This command lists the disk partitions dedicated to corpus data, and shows how much free space there is in each. For example: % ~natcorp/bin/cdf Filesystem kbytes used avail capacity Mounted on /dev/sd5b 152798 123003 14515 89% /partitions/05 /dev/sd1b 152798 110429 27089 80% /partitions/09 /dev/sd2a 152798 89390 48128 65% /partitions/12 /dev/sd2b 152798 111192 26326 81% /partitions/13 /dev/sd6a 243205 68112 150772 31% /partitions/16 /dev/sd6b 243205 16 218868 0% /partitions/17 /dev/sd6d 243205 16 218868 0% /partitions/18 /dev/sd6e 243205 16 218868 0% /partitions/19 /dev/sd6f 243205 16 218868 0% /partitions/20 /dev/sd6h 243205 128137 90747 59% /partitions/22 The fourth column gives the the number of kilobytes available. 3. Pick a filesystem which will not become more than 75% full if the new data is added to it. (The 25% headroom is required for workspace during processing, and to accommodate material returned from Lancaster.) Move to its top level directory. Say you picked /dev/sd6a (partition a of disk drive 6, but you don't need to know that) you would then need to cd /partitions/16 That is, cd to the name given in the ``mounted on'' column. 4. Execute the command umask 0007 This ensures that the files you install can be read only by those associated with the BNC project. 5. Execute the command mkdir -p Incoming/agency_date/{As_received,Renamed} where agency and date are specific to the incoming data. For example, if receiving material from Chambers on 1st March, 1993, the command would be mkdir -p Incoming/Chambers_930301/{As_received,Renamed} 6. Move to the new As_received directory. For example, cd Incoming/Chambers_930301/As_received 7. Load the data at that point. For example, if reading a tar archive from a tape cartridge: tar xvf /dev/rst1 8. Run ls -a to see what you've got. If it turns out to be in subdirectories, move it up to the current level and delete the subdirectories. If the data capture agency has given you a list of what's supposed to have been supplied, this is the time to check it. You might also like to put the original documents on the shelf, and check they're all there, if any have been supplied. (They go in case-independent alphabetical order, right to left.) 9. Make the received files read-only, so that they don't get altered accidentally: chmod 0440 * (Plus chmod 440 .[A-Za-z0-9]* if any received file names begin with a dot.) 10. If the files include an OUP database report with a name like classif.dat, move it up a level to separate it from the text files, and rename it as classif, whatever its original name. For example: mv classif.dat ../classif 11. Make links (alternative names for the same files) in the Renamed directory for all the files in the As_received directory: ln * ../Renamed (Note: should any of the incoming files have names beginning with a dot, this will miss them: pick them up with ln .[A-Za-z0-9]* ../Renamed ) 12. Move to the Renamed directory: cd ../Renamed 13. Rename the files as necessary so that they have 5- or 6-character alphanumeric names which, independent of letter case, do not clash with any existing corpus file names. How you do this is largely up to you. There is an ugly program, ~natcorp/bin/fixnames, which, when run in the As_received directory, partially automates the process for files received from OUP: cd ../As_received ls | ~natcorp/bin/fixnames DESTDIR=../Renamed | sh To help you eliminate duplicate names, run ~natcorp/bin/clash * in the As_received directory. This will list (in upper-case) existing BNC text names which clash with file names in the current directory. Amend filenames until the clash command produces no output. You may find looking at the BNC text name list in ~natcorp/filenames useful in picking a new, non-clashing name. Bear in mind that, for magazines, the first issue of a particular title that we receive gets named MagazA (or whatever), the second MagazB, and so on. Thus, you may have to move the final letter of magazine text names some way down the alphabet to avoid a clash if we already have a number of issues of the same title. (Note that it's not the earliest issue that gets called MagazA; just the first we receive.) 14. Go to the top level of the disk partition chosen in step 3, and move down to the subdirectory named for the data capture agency. For example: cd /partitions/16/chambers 15. Make a directory named for the date on which the data was received, and move to it: mkdir 939301; cd 930301 16. If you consider that the received data needs subdividing according to some criterion, make subdirectories. For example mkdir books magazines Note: there is a control list of allowed directory names. It is currently books c-g demographic magazines newspapers selective spoken written r_* The names are used in classifying texts, and names not on the list give the database indigestion. But I can easily extend the list if necessary. 17. Make links from, or move, the renamed files into the target directory or directories. (Do NOT copy the files -- the automatic database text registration procedure relies on the As_received name and the BNC text name being alternative names for the same file; having two different files with identical contents won't do.) As an unlikely example, suppose files with names beginning with the letters A to M are books, and those beginning N to Z are magazines: ln /partitions/16/Incoming/Chambers_930301/Renamed/[A-M]* books ln /partitions/16/Incoming/Chambers_930301/Renamed/[N-Z]* magazines (Making links using ln is better if few steps are involved; moving, with mv, may be a good idea if there are many target directories, and you want to pick off renamed files until the Renamed directory is empty.) 18. Put a dot in front of all the text file names. The rename command (see its man page to learn more) is useful for this. If the files are in the current directory: rename 's/^/./' * If the files are in subdirectories: rename 's#/#/.#' */* 19. If an OUP database classification report was identified in step 10, link it into the current directory. For example, ln /partitions/16/Incoming/Chambers_930301/classif . 20. If you want to create A_ files from the dot files using some sort of processing, do it now. Document what you do in individual Z_ files for each text, or combined Z_ files covering all files sharing some characteristic. 21. Move to /corpus/Incoming: cd /corpus/Incoming 22. Make a symbolic link so as to link the incoming file directories created in step 5 into the /corpus/Incoming hierarchy. For example: ln -s ../../16/Incoming/Chambers_930301 . This needs a bit of explaining. It puts an entry named Chambers_930301 in /corpus/Incoming. That entry states that the actual file is found by going up two levels (which takes you to the /partitions directory, although it doesn't look as if it should unless you know that /corpus is itself a symbolic link to /partitions/05 -- sorry about that), then down to disk partition 16, and so on. 23. Move to the directory under /corpus/Work named for the data capture agency. For example cd /corpus/Work/chambers 24. Make a symbolic link so as to link the corpus file directories created in step 15 into the /corpus/Work hierarchy. For example: ln -s ../../../16/chambers/930301 . The explanation is similar to that in step 22. 25. Edit ~natcorp/.fileids to add symbols giving a short name or names for the corpus directory or directories you have created. You do need to be natcorp to do this. One way to do it is su natcorp -c "emacs ~natcorp/.fileids" In the example given, you might like to add lines setenv chambers930301 /corpus/Work/chambers/930301/books setenv chambers930301m /corpus/Work/chambers/930301/magazines Having finished the edit, if you want to use these names at once, type source ~natcorp/.fileids otherwise the changes do not take effect until you next log in. That's it -- as it damned well should be after all those steps. A lot more happens, but it happens automatically as soon as the overnight database update software notices that new files and directories have been attached to the /corpus/Work hierarchy. Look at TGCW50, the man page for overnight (by executing nroff -man ~natcorp/bin/overnight) if you want to know what happens next.