British National Corpus
Indexing a corpus for SARA
This is a preliminary draft of a document explaining how to build a SARA index for any TEI-encoded corpus. It is not complete and should not be relied on for anything other than general indications. Definitive information will be provided in the SARA Technical Manual, currently (July 2001) in production.
SARA (SGML Aware Retrieval Application) is a system for providing reasonably fast access to very large amounts of SGML-tagged language corpus data. It is not particularly suited to small (less than 10 Mb) datasets or data which is not structured in any way, though it can be used with such data. Using the system involves three components:
To build a SARA system you need the following ingredients:
To begin with, we recommend you to create a single folder named after your corpus, and then create three subdirectories called Text, Index, and Etc within it. The following discussion assumes you have done that, and that the top level folder is called myCorps. Note that all names in SARA are case-sensitive.
Put your corpus files in the Text folder. Note carefully the following constraints:
You must also create a corpus header file. This can be placed anywhere, but it is convenient to put it in the same folder as the texts. A corpus header has exactly the same structure as any other TEI header; it is typically used to supply definitions for any code books or other encoded data used across the whole corpus. A minimal corpus header looks like this:
<teiHeader type="corpus"><fileDesc> <titleStmt><title><!-- title for your corpus here--></title> <respStmt> <resp>Corpus built by</resp><name><!-- Your Name Here--></name> </respStmt> </titleStmt> <editionStmt><p> First TEI-conformant version </p></editionStmt> <publicationStmt> <authority>Distributed by the compiler</authority> <availability status="restricted"> <p>Availability limited to compiler</p> </availability> </publicationStmt> <sourceDesc><p><!-- describe your source material here --></p></sourceDesc> </fileDesc> </teiHeader>
The behaviour of different parts of SARA is controlled by two files: the corpus parameter file and the corpus description file. You must create these files next. You can do this with any text editor you like (notepad, emacs, whatever comes to hand). Both files have roughly the same format, consisting of a series of lines each of which supplies a parameter and a value. The order of the lines is unimportant, but in other matters (case, spacing, etc.) it is safest to follow the examples closely.
This file is used by each part of the SARA system in order to locate the files to be operated on. You must specify this file either explicitly or implicitly for SARA to work, and its contents must correctly identify the location of the other SARA system files.
This file also contains values for some internal settings used by the Indexer, which must also be available to the server. For this reason it is essential that the same file is used by both server/client and indexer.
NAME=myCorps TXT=/SARA/myCorps/Text/ HDR=corphdr # name for corpus header file (within TXT path) ETC=/SARA/myCorps/Etc/ IDX=/SARA/myCorps/Index/ ACC=/SARA/Adm/ # path to Account files (only needed for networked systems) SORT=/temp/ # path to temporary sort space (needed when indexing) TMP=/temp/ # path to scratch space used by server/client #do not change the following settings! HASHSHIFT=3 HASHLAST=6 IGRAN=100 ILOC=30000 GRAN=1000000
If you name the file corpus.prm, then you won't have to specify its name to the indexer or server. Obviously, you should change the path names in the above example to correspond with those you are actually using.
This file contains a detailed description of the corpus data itself, for example what tags it contains, which elements function as words or sentences, which attribute is used to identify citations, what character entities are present, and a whole host of other things. You can use the indexer to make a default description file, and then modify it to match your requirements.
ver 101 scope P scope S wtag w typeThis assumes that your corpus has elements P and S marking paragraphs and sentences, and that POS tagging is included as the value of the type attribute on <w> elements.
The indexer is a standalone program supplied with the latest release of the SARA system. On Unix systems, it is built at the same time as the server and other utilities, and is executed from the command line in the same way. On Windows systems, you need to start up a Windows command processor or DOS prompt and then type the appropriate commands. For example, if you have put the file indexer.exe in the folder C:\sara, your corpus folder is C:\sara\myCorpus, and your corpus.prm is in the corpus folder, then you would type the following commands in the DOS window
cd \sara\myCorpus \sara\indexer
C:\sara\mycorps>..\indexer Opened index OK Found 3 texts DSC file is /SARA/myCorps/Etc/myCorps.dsc Utility menu 1. Rescan file list 2. Build index 3. Build hash index files 4. Build and merge 5. Sort hash index files 6. Compress index files 7. Build dictionary 8. Pack file index 9. Index signature 10. Index bibliography 11. Statistics 12. Build frob file 13. Delete all files 14. Complete new build 15. Test a single text 16. Register a subcorpus 17. Deregister a subcorpus 18. Exit Enter option number:
The first line should specify the number of text files found in the Text folder, together with the corpus header. Make sure this is correct. The cursor sits at the end of a list of choices: you want option 14, so type 14 and press return
Now have a look in the folder you named as your Etc folder in the corpus parameter file. You will see that it contains a large number of files which were not there previously. In particular, there will be a file called unknowns.txt which you now need to append to the corpus.dsc file, as it contains default entries for all the SGML elements, character entities, POS codes etc actually present in your data but not defined in your existing description.
Open the description file using the text editor of your choice, and append the contents of the unknowns.txt to the end of it. If you want to edit any of the lines (for example, to supply a textual description which the client can display along with the tag name) you can do so now or later. See the reference manual for full information about what you can type in the description file.
Now run the indexer again, in exactly the same way as before. This time, you should not see any error messages: if you do, you should try to resolve them. (These messages are also written to a file called corpus.log if you want to review those which have scrolled off the top of the window.)
No matter the size of the corpus, SARA will always fill up the Index folder with a large number of subdirectories used as hash buckets, each of which has a number in the same range. Within these numbered hash buckets, SARA places large numbers of files with the extension .HID and .sid. If you are short of disk space, you can safely delete the .HID files once the index run is complete.
The index you have built can be used unchanged on any platform for which a SARA server has been correctly built. This includes a variety of Unix and Linux systems as well as any Microsoft Windows 32-bit environment. You can even use it under Mac OSX, but this tutorial won't tell you how.
Start up the SARA windows client. On the first screen that appears, press Menu rather than OK. You should see a list of the servers for which your client is currently registered. You need to tell it to look at the new corpus you have just indexed. Press the ADD button. Type the name of your corpus (or some other strng) into the Name window and check the box marked "Local". Then press the Browse button, and navigate to the location where you have stored the corpus parameter file for your new corpus. Press OK. You will be returned to the list of available corpora, in which your new corpus should now be included. Select the name of your corpus in the list by clicking on it, and then press the OK button.
On Unix systems, even if you are the only user, you have to set up the server for network access. This means you must first run the corpadm program (which is also built at the same time as the server and the indexer) to create the account directories. These directories are created in the path specified by the ACC directive in your corpus parameter file. There is detailed documentation of the corpadm program on the BNC web site
% corpadm corpadm> add guest guest corpadm> quit % sarad Started server % solve fishy Connected! 2 solutions ... %