1 The BNC via SARA: a brief outline

1.1 The BNC

The BNC contains over 4000 texts, with a total of approximately 100 million words, of which 10% are broad transcriptions of speech. Its composition is designed to reflect the variety of users and uses of contemporary British English, in the tradition of the Survey of English Usage and the Brown and LOB corpora (Kucera and Francis 1967; Johansson 1980). Its encoding is TEI-conformant (Sperberg-McQueen and Burnard 1994), with SGML mark-up indicating such features as:

1.2 SARA

SARA (SGML-Aware Retrieval Application) functions in a Windows-client - Unix-server environment. The user formulates a query on the client and transmits it to the server, which looks through the corpus for solutions which satisfy the query. The server then returns counts and downloads the solutions (in the shape of a concordanced set of contexts) to the client. The user can display, sort and thin these solutions on the client in various ways, and can also request additional information about specific solutions from the server.

Queries can look for text, markup, or combinations of the two. As well as occurrences of single words, phrases, or markup elements, it is possible to find:

Counts of the number of solutions and the number of texts in which they have been found, and the frequency of specified collocates, can be obtained from the server without downloading the solutions themselves. The server also incorporates an index of all the word-forms in the corpus, which can be interrogated to list the forms (up to a maximum of 200) which match a particular pattern (specified as a regular expression), showing their overall frequencies in the corpus and z-scores.

The number of solutions to be downloaded can be specified in a variety of manners, including random selection. Each solution is downloaded with a text-identifier code indicating its source, and the number of the sentence within that text from which it is taken. Users can also look up the full bibliographic details for any solution from the server, and browse the full source text.


Up
Next