Introducing SARA
an SGML-Aware Retrieval Application for the British National Corpus

Lou Burnard
Oxford University Computing Services

This is a lightly revised version of a paper presented at the second conference on Teaching and Language Corpora, (held at Lancaster University 9-12 August 1996) in a session jointly organized with Guy Aston. Guy's paper The BNC as a language learner resource complements this one by describing how the software described here was actually used in a real learner environment. A brief report on the whole conference is also available.

Background

From the start of the BNC project in 1990, it was tacitly assumed that some kind of retrieval software would need to be delivered along with the corpus. The original project proposal talks of ``simple processing tools'' and an informal specification for an ``information search and retrieval processor'' was also drawn up by the UCREL team early on. In the event, the need to complete delivery of the corpus on time (or at least, not too late), meant that development of any such software beyond that needed for the immediate needs of the project was increasingly deferred. It was argued that the lack of such software might be only transient, since the corpus was to be delivered in SGML form, tools for which were already becoming widely available, as a result of the widespread adoption of this standard both within the language engineering research community and elsewhere.

However, a major stated goal of the project was to make the corpus available and usable as widely as possible, that is, not just at a low cost, but also within as wide a variety of environments as possible. It seemed to us that the potential user community for large scale corpora like the BNC extended considerably as far beyond the Natural Language Processing research community as it did beyond the immediate needs of commercial lexicographers, although it was largely on behalf of these groups that the project had originally been funded and largely therefore these groups which had determined the manner in which it should be delivered. This conference testifies to the growing interest of the English Language teaching community in the availability and use of large mixed corpora. In wider fields of both human and social sciences, there is an equally rapid growth of interest in the relevance of corpus-based methods for all forms of cultural and linguistic studies. Even in the general public, the amount of media attention which publication of the corpus produced cannot be attributed solely to the efforts of Longman's, Chambers' and OUP's publicity departments, coinciding as it did with such matters of public debate as Professor Aitchison's Reith Lectures, and Prince Charles' pronouncements on the deplorable decline in English language standards.

It seemed to us that the software needs of some of the potential users of the BNC would be only partially met by the generic SGML software available in late 1994 (and to a large extent still today). The choice lay amongst highly specialised, but high performance, application development toolkits which given sufficient expertise could be customised to suit the needs of niche markets in NLP or lexicography, but which were somewhat beyond the needs, comprehension, or indeed purse, of the ELT user on the Clapham omnibus; generic SGML browse and display engines, designed originally for electronic publication or delivery over the web, often with very attractive and user-friendly interfaces but generally unable to handle the full complexity and scale of the BNC; or simple concordancing tools which were equally unable to take advantage of the added value we had so painfully put into the encoding and organization of the corpus. Moreover existing software was either very expensive (being aimed at large scale electronic publishing environments), or free, but requiring considerable technical expertise for anything beyond the most trivial of applications. As discussed further below, the scale and complexity of the BNC (with its 100 million tagged words, six and a quarter million sentences, and 4124 interlinked texts) seemed likely to stretch the capacity of most simple text-based concordancers available at that time.

We were fortunate enough to obtain funding, initially from the British Library R & D Department, and subsequently from the British Academy, to produce a software package which might go some way to fill the gaps identified. Development of the system was carried out by Tony Dodd, with valuable input from members of the original BNC Consortium, and from early users of the software. The system is called SARA, for SGML-Aware Retrieval Application, to make explicit that although aware of the SGML markup present in the corpus, it is not a native SGML database. In this respect, however, it is no better or worse than a number of other current software packages.

2 The SARA system

The SARA system was designed for client/server mode operation, typically in a distributed computing environment, where one or more work-stations or personal computers are used to access a central server over a network. This is, of course, the kind of environment which is most widely current in academic (and other) computing milieux today. The success of the World Wide Web, which uses an identical design philosophy, is vivid testimony to the effectiveness of this approach.

The system has four chief components:

the indexing program, which generates an index of tokens from an SGML marked-up text;
the server program, which accepts messages in the Corpus Query Language (see below) and returns results from the SGML text;
the SARA protocol, a formally defined set of message types which determines legal interactions between the client and server programs; this protocol makes use of a high-level query language known as CQL (for Corpus Query Language);
one or more client programs, with which a user interacts in any appropriate platform-specific way, and which communicate with the server program using the protocol.

2.1 The SARA index

Computationally, the best-understood method of accessing a text the size and complexity of the BNC is to use an index file, in which search terms are associated with their location in the main text file, and into which rapid access can be obtained using hashing techniques. Such methods have been employed for decades in mainstream information retrieval systems, with the consequence that the advantages and disadvantages of the various ways of implementing the underlying technology are well known and very stable.

The SARA index is a conventional index of this type. Entries in the index are created by the indexing program, using the SGML markup to determine how the input text is to be tokenized. The tokens indexed include the content of every <w> or <c> element, together with the part of speech code allocated to it by the CLAWS program. For example, there will be one entry in the index for ``lead'' as a noun, and another for ``lead'' tagged as a verb. The index is not case-sensitive, so occurrences of `Lead' may appear in either entry. The tokenization is entirely dependent on that carried out by CLAWS, which accounts for the presence of a few oddities in the index where CLAWS failed to segment sentences entirely.

The SGML tags (other than those for individual tokens) themselves are also indexed, as are their attribute values. For example, there is an entry in the index for every <text> start- and end-tag, and for every <s> start- and end-tag, etc. This makes it possible to search for words appearing within the scope of a particular SGML element type. For some very frequent element-types (notably <s> and <p> ) whose locations are particularly important when delimiting the context of a hit, additional secondary indexes called accelerator files are maintained.

The index supplied with the first version of the BNC occupies 33,000 files and 2.5 gigabytes of disk space, i.e. slightly more than the size of the text itself. Building the index is a complex and computationally expensive process, requiring much larger amounts of disk space or several sort/merge intermediate phases. This was one reason for delivering the completed index together with the corpus itself on the first release of the BNC, even though development of the client software was not at that stage complete. More compact indexing might have been possible, at the expense of either a loss in performance or an increase in complexity: in practice, the indexing algorithm used provides equally good retrieval times for any kind of query, independent of the size of the corpus indexed. The index included on the published CDs necessarily assumes that the server accessing it has certain hardware characteristics (in particular, word length and byte addressing order). To cater for machines for which these assumptions are incorrect, a localization program is now included with the software. This can either make a once for all modification to the index or be used by the server to make the necessary modifications ``on the fly''.

The indexer program is intended to operate on generic SGML texts, that is, not just on the particular set of tags defined for use in the BNC. However, we have not yet attempted to use it for corpora using other tag sets, and there are almost certainly some features of its behaviour which are currently specific to the BNC.

2.2 The SARA server

The SARA server program was written originally in the ANSI C language, using BSD sockets to implement network connexions, with a view to making it as portable as possible. The current version, release no 928, has been implemented on several different flavours of the Unix operating system, including Solaris, Digital Unix, and Linux, which appear to be the most popular variations. The software is now delivered with detailed installation and localization instructions, and can be downloaded freely from the BNC's web site (see http://info.ox.ac.uk/bnc/sara.html), though it is not yet of much interest to anyone other than BNC licensees.

The server has several distinct functions, amongst which the following are probably the most important:

it allows registered users to log on or off and to change their passwords;
it implements the key functions required of the Corpus Query Language, in particular:
- looking for tokens in the index;
- solving a query;
- supplying bibliographic information about a text;
- displaying some or all of a text at a given location;
- thinning or filtering the result set from a query.
it handles all housekeeping, allowing concurrent access by several different users.

The server listens on a specified socket (usually 7000) for login calls from a client. When such a call is received, the server tries to create a process to accept further data packages. If it succeeds, the client is logged on and set up messages are exchanged which define for example, the names and characteristics of SGML elements in the server's database. Following this, the client sends queries in the Corpus Query Language, and receives data packets containing solutions to them. Once a connexion has been established in this way, the server expects to receive regular messages from the client, and will time out if it does not. The client can also request the server to interrupt certain transactions prematurely.

2.3 The Corpus Query Language

The Corpus Query Language (CQL) is a fairly typical Boolean style retrieval language, with a number of additional features particularly useful for corpus work. It is emphatically not intended for human use. Like many other such languages, its syntax is designed for convenience of machine processing, rather than elegance or perspicuousness. A brief summary of its functionality only is given here.

A query is made up of one or more atomic queries. An atomic query may be one of the following:

a single L-word (that is, a token as recognised by the indexer: this may or may not correspond to an orthographic word);
a wildcard character , which will match any single L-word;
a delimited string of L-words;
an L-word+POS pair, e.g. CAN=NN1;
a regular expression;
an SGML query, that is, a search for a start- or end-tag, possibly including attribute name-value pairs.

Four unary operators are allowed in CQL:

case The $ operator makes the query which is its operand case-sensitive;
header The # operator makes the query which is its operand search within headers as well as in the bodies of texts (it thus assumes that a TEI-conformant dtd is in use);
optional The ? operator matches zero or one solutions to the query which is its operand; it makes no sense unless the query is combined with another;
not The ! operator matches anything which is not a solution to the query which is its operand; it makes no sense unless the query is combined with another;

A CQL expression containing more than one query may use the following binary operators:

sequence one or more blanks between two queries matches cases where solutions to the first immediately precede solutions to the second.
disjunction The [verbar] operator between two queries matches cases where either query is satisfied.
join The [star ] operator between two queries matches cases where both queries are satisfied in the order specified; the operator between two queries matches cases where both queries are satisfied in either order.

When queries are joined, the scope of the expression may be defined in one of the following ways:

SGML element A join query followed by a operator and an SGML query matches cases where the joined query is satisfied within the scope of the SGML query.
number A join query followed by a operator and a number matches cases where the joined query is satisfied within the number of words specified.

If no scope is supplied for a join query, the default scope is a single <bncDoc> element.

2.4 SARA client programs

The standard SARA installation includes a very rudimentary client program called solve, for Unix. This provides a command line interface at which CQL expressions can be typed for evaluation, returning result sets on the standard Unix output channel, for piping to a formatter of the user's choice, or display at a terminal. This client is provided mainly for debugging purposes, and also as a model of how to construct such software. The SARA client program which has been most extensively developed and used runs in the Microsoft Windows environment, and it is this which forms the subject of the remainder of this paper.

In designing the Windows client, we attempted to make sure that as much of the basic functionality of the CQL protocol could be retained, while at the same time making the package easy to use for the novice. We also recognized that we could not implement all of the features which corpus specialists would require at the same time as providing a simple enough interface to attract corpus novices. In retrospect, there are several features and functions we would liked to have added (of which some are discussed below ); but no doubt, had we done so, there would be several aspects of the user interface we would now be equally dissatisfied with.

The SARA client follows standard Microsoft Windows application guidelines, and is written in Microsoft C++, using the standard object classes and libraries. It thus looks very similar to any other Windows application, with the same conventions for window management, buttons, menus, etc. It runs under any version of Windows more recent than 3.0, and there are both 16 and 32 bit versions. A TCP/IP stack (such as Winsock) to implement connexion to the server is essential, and a colour screen highly desirable. The software uses only small amounts of disk or memory, except when downloading or sorting result sets containg very many (more than a few hundred) or very long (more than 1Kb) hits.

The Windows client allows the user to

search the word index and check what tokens it contains;
define, save, re-use, or modify a query (effectively, a CQL expression to be evaluated);
view, sort, save, or print all or some of the results returned by a query;
configure and manipulate the display of results in a variety of ways;
view contextual and bibliographic data for any one text;
combine simple queries to form a complex one, using a visual interface.

A brief description of each of these functions is given below; more information is available from the built-in help file, the BNC Handbook (a detailed tutorial guide co-authored by Guy Aston and Lou Burnard). A brief technical summary is also available at http://info.ox.ac.uk/bnc/saradoc.html.

2.4.1 Types of Query

The Windows client distinguishes five types of query, and allows for their combination as a complex query. The basic query types are:

word query this searches the SARA word index, either by stem (right hand truncation only is performed) or by pattern (see below). All index-entries matching the string entered are returned, and the user can then select all or some of them for dispatch to the server as CQL queries against the corpus;
phrase query A phrase query behaves superficially like a word query, in that it searches for occurrences of a particular word or phrase. It differs in that it can be case-sensitive, can search text headers as well as bodies, can include punctuation, and is aware of the tokenization rules used by the CLAWS tagger. A phrase query can also include a `wild card' character to match any word in a phrase.
pattern query A pattern query allows for queries using a simple subset of UNIX-style regular expressions, for example to find variant spellings of a word. Some limitations on the kind of pattern which can usefully be searched for are imposed by the nature of the index: for example, left hand truncation of the search term always implies a scan through the entire index, and is therefore not allowed.
POS query: A part of speech (POS) query carries out a word query, further restricted by a given POS code or code, for example to find occurrences of `lead' tagged as a noun. It should be stressed that this is only feasible for a specified word, since the POS code is only a secondary key in the SARA word index --- it is not possible to search for (say) all nouns with the current system.
SGML query An SGML query carries out a search for a given SGML tag in the corpus, optionally qualified by particular combinations of attribute values, for example to find all occurrences of <event> elements in which the desc attribute has the value ``laughing'' or ``laughter''. It is particularly useful when restricting searches to texts of a particular type, since text type information is typically carried by SGML attributes in the BNC.

One or more of the above types of query may be combined to form a complex query, using the special purpose Query Builder visual interface, in which the parts of a complex query are represented by nodes of various types. A Query Builder query always has at least two nodes: one, the scope node, defines the the context within which a complex query is to be evaluated. This may be expressed either as an SGML element, or as a span of some number of words. The other nodes are known as content nodes, and correspond with the simple queries from which the complex query is built. Content nodes may be linked together horizontally, to indicate alternation, or vertically to indicate concatenation. In the latter case, different arc types are drawn, to indicate whether the terms are to be satisfied in either order, in one order only, or directly, i.e. with no intervening terms.

Query Builder enables one to solve queries such as ``find the word `fork' followed by the word `knife' as a noun, within the scope of a single <u> element''. It can be used to find occurrences of the words `anyhow' or `anyway' directly following laughter at the start of a sentence; to constrain searches to texts of particular types, or contexts, and so forth.

For completeness, the Windows client also allows the skilled (or adventurous) user to type a CQL expression directly: this is the only form of simple query which is not permitted within the Query Builder interface.

2.4.2 Display and manipulation of queries

By whatever method it is posed, any SARA query returns its results in the same way. Results may be displayed in one of line or page modes, i.e. in a conventional KWIC display, or one result at a time. The amount of context returned for each result is specified as a maximum number of characters, within which a whole sentence or paragraph will usually be displayed. Results can be displayed in one of four different formats:

plain text-only display which effectively ignores and suppresses all markup;
POS individual words are colour-coded according to their part of speech and a user-defined colour scheme;
SGML all SGML encoding in the original is displayed uninterpreted;
custom the SGML encoding is interpreted according to a simple user-supplied specification.

It will often be the case that the number of results found for a query is unmanageably large. To handle this, the SARA client offers the following facilities. A global limit is defined on the number of results to be returned. When this limit is exceeded, the user can choose

to over-ride the limit temporarily for this result set, specifying how many solutions are required, discarding any surplus from the end of the result set;
to discard all but the first solution in each text;
to take a random sample of specified size from the available solutions.

When the last of these is repeated for a given large result set, it will return a different random sample each time.

Once downloaded to the client, a set of results may be manipulated in a number of ways. It may be sorted according to the keyword which defined the query, by varying extents of the left or right context for this keyword, or by combinations of these keys. Sorting can be carried out either by the orthographic form, in case-insensitive manner, or by the POS code of words. This enables the user to group together all occurrences of a word in which it is followed by a particular POS code, for example. It is also possible to scroll through a result set, manually identifying particular solutions for inclusion or exclusion, or to thin it automatically in the same way as when the limit on the number of solutions is exceeded.

A result set may simply be printed out, or saved to a file in SGML format, for later processing by some SGML-aware formatter or further processor. Named bookmarks may be associated with particular solutions (as in other Windows applications) to facilitate their rapid recovery. The queries generating a result set, together with any associated thinning of it, any bookmarks, and any additional documentary comment, can all be saved together as named queries on the client, which can then be reactivated as required.

2.4.3 Additional features of the client

The main bibliographic information about each text from which a given concordance line has been extracted can be displayed with a single mouse click. It is also possible to browse directly the whole of the text and its associated header, which is presented as a hierarchic menu, reflecting its SGML structure. The user can either start from the position where a hit was found, expanding or contracting the elements surrounding it, or start from the root of the document tree, and move down to it.

A limited range of statistical features are provided. Word frequencies and z-scores are provided for word-form lookups, and there is a useful collocation option which enables one to calculate the absolute and relative frequencies with which a specified term co-occurs within a specified number of words of the current query focus.

3 Limitations of the current system and future plans

As noted above, the current client lacks some facilities which are widely used in particular fields of corpus-based research. This is particularly true of statistical information. There is no facility for the automatic generation of collocate lists, or any of the other forms of more sophisticated forms of statistical analysis now widely used. Neither is there any form of linguistic knowledge built into the system (other than the POS tagging): there is no lemmatized index, or lemmatizing component, though clearly it would be desirable to add one. For those sufficiently technically minded, or motivated, the construction of such facilities (whether using SGML-aware tools or not) is relatively straightforward; the problem is that no simple interface or hook exists to build them into the current Windows client.

Similarly, it is not possible to define, save and re-use subcorpora, except by saving and re-using the queries which define them. The SARA client can address only the whole of the SARA index, which indexes the whole of the BNC. This is a design issue, which has yet to be addressed. If queries become very complex, involving manipulation of many very large result streams, they may exceed the limits of what can be handled by the server. This has not yet arisen in practice however.

A more common complaint about the current system is that it cannot be used to search for patterns of POS codes, independently of the particular word forms to which they are attached. This is fundamentally an indexing problem, which may be addressed in the next major release of the system. The performance problems associated with queries containing very high frequency words are derived from the same problem, and may be addressed in the same way. And again, it is a trivial exercise for a competent programmer to write special purpose code which will search for such patterns across the whole of the BNC.

Despite these limitations, the system has attracted great enthusiasm when tested and demonstrated, despite performance problems and difficulties of access, perhaps owing largely to the intrinsic interest of the BNC data itself. At the time of writing, July 1996, the current software system appears stable enough for general release, not only to BNC licensees for their own internal use, but also to suitably qualified users wishing to access a national online service. Plans are already well advanced for the establishment of such a service as part of the British Library's ``Initiatives for Access'' programme. Plans have also been mooted for the further development of the SARA system, enabling it to be used with other SGML document type definitions, and on other platforms. Development of other SARA clients, in particular for the World Wide Web, is a further exciting possibility for the system. SARA, who came late into the BNC's world, seems likely to be equally late to leave it.

For up to date information on the availabity of the SARA system, consult our web pages at http://info.ox.ac.uk/bnc/sara.html.

Introducing SARA an SGML-Aware Retrieval Application for the British National Corpus

Lou Burnard Oxford University Computing Services

Background

Introducing SARA
an SGML-Aware Retrieval Application for the British National Corpus

Lou Burnard
Oxford University Computing Services