Text Corpora and the British National Corpus There's nothing particularly new in large collections of texts for academic research: for centuries people have been collecting manuscripts, books and newspapers for analysis which has often been very laborious. Thankfully, as technological advances make the computerized storage and access of large quantities of information easier, so the construction and use of text corpora continues to increase, and the potential for research has widened considerably. A corpus can be thought of as a collection of texts gathered according to particular principles for some particular purpose. This contrasts with an archive -- the Oxford Text Archive, for example -- which is a general collection where all kinds of texts are stored simply because each one is of interest in itself. Someone interested in `Caisleáin Óir' by Séamus Ó Grianna (1) can retrieve it from the Text Archive; a linguist might reasonably expect to find all or some of it in a corpus of (for example) twentieth century Donegal Gaelic. Linguistics is one area where text corpora are being developed with enthusiasm on a large scale, the results proving invaluable in many different ways. One of the clearest examples is lexicography, where traditional methods are being radically revised to take account of the linguistic evidence provided by text corpora. The `Collins Cobuild English Language Dictionary'(2), published in 1987, was a completely new dictionary, constructed entirely on the basis of a 7.3 million word text corpus (3). Using a concordance generated from the corpus, lexicographers were able to examine occurences of each word in context, something which helped in a number of ways. First, the concordance was a new aid in identifying the main sense or senses in which any particular word was used, a task previously left largely to the lexicographer's observation and intuition alone. Second, a suitable example could be selected for the definition written by the lexicographer -- the important difference being that the example would be a naturally-occurring usage illustrating the given definition rather than one formulated by the lexicographer. Corpora are becoming invaluable not only in the writing of traditional paper dictionaries, but also in the construction of computerized databases and lexicons. The CELEX project in the University of Nijmegen has built large databases with extensive information on Dutch, English and German (4), and corpora of texts in modern-day language were used to determine which words were finally included in the database. Moreover, parts of the corpora used (for Dutch, the Instituut voor Nederlandse Lexicologie's 40 million word corpus, and for English, Cobuild's extended 18 million word corpus) were disambiguated to provide sophisticated information on the frequency of words, both canonical forms and wordforms. The use of a corpus as an `on-tap' supply of natural language is important in many fields. In language teaching, as in lexicography, a corpus can provide examples of language as it is really used. Text books and teaching materials can be prepared using authentic examples, the advantage being that naturally-occuring language is almost invariably more vigourous and colourful than made-up examples. One problem for some computational linguists is parsing and disambiguating a text corpus, for which a reliable corpus-based computer lexicon is necessary: this is a vicious circle which the parallel development of corpora and lexical databases can begin to solve. Work of this kind will prove invaluable to those involved with artificial intelligence and natural language processing, since to enable computers to `understand' or produce real language, a rich store of lexical information has to be available. Corpora consisting of transcribed speech are of particular interest to researchers in a wide range of fields, including stylistics, dialectology, and sociolinguistics, depending on the design and contents of the corpus used. Given the many possibilities presented by corpus analysis, and the consequent demand for larger and better corpora, it is no surpise that work has begun to develop a new, national resource, called the British National Corpus. Within three years the project hopes to have constructed a corpus of about 100 million words from a wide range of subjects and styles, for general use within the academic community throughout the country as well as in commercial research and development. The British National Corpus project is a collaborative, pre-competitive initiative carried out by Oxford University Press, Longman Group UK Ltd, Chambers, Lancaster University's Unit for Computr Research in the English Language, Oxford University Computing Services and the British Library. The project receives funding from the UK Department of Trade and Industry and the Science and Engineering Research Council within their Joint Framework for Information Technology. In view of the breadth of participation in the project and the promise of a product available for general public research rather than private commercial research, the title of National Corpus seems most appropriate. The British National Corpus will be a general purpose corpus for linguistic research. Its 100 million word contents will be made up of about 90 million words of written texts (or samples from texts which are longer than 40,000 words) and about 10 million words of transcribed spoken material. Current plans for the written component of the corpus envisage the inclusion of a wide range of texts, most of which will date from the mid 1970s to the present day. Books and periodicals are the most important, but other published items such as leaflets, manuals and adverts will be included as well as unpublished things such as letters and memos, reports and essays. All levels of British English will be represented -- from the technical and highbrow through to the simple and lowbrow. For the spoken component, a special project is underway. The British Market Research Bureau has been called in to recruit volunteers throughout the UK who are willing to use a Walkman and a small microphone to record all their conversations over a few days. The areas where volunteers are sought are carefully selected to provide a representative social cross-section of the nation. Early tests have produced good results, and recordings are already being transcribed and converted into digital format by Longman UK Ltd in Harlow. Other spoken material -- recordings of meetings, speeches, sermons and broadcasts -- is also being collected and transcribed. Material collected is sent to Oxford University Computing Services, where formatting is standardized and checked. The format used is SGML or `Standard Generalized Markup Language' (5), implemented in broad accordance with the recommendations of the TEI or Text Encoding Initiative (6). The mark-up is concerned with the structure of the text -- its sections, paragraphs and sentences, for example -- and it also helps ensure that the corpus will be usable no matter what the local computational set-up. Tagging of this sort therefore helps everyone, since history and sociology researchers who use the corpus need it just as much as linguists. Specifically linguistic information is added subsequently by Lancaster University, where basic syntactic codes are generated by a modified version of the CLAWS2 parser and tagger (7). While all the SGML tags, both structural and linguistic, are intended to be an aid to research, they can be rewritten or removed according to the purpose of the researcher. Once all the formatting and sytactic tagging is complete, texts are added to the corpus on the project's computers housed in Oxford University Computing Services, and within a few years, analysis of the data will begin: new dictionaries and language reference books will appear, and new research into many linguistic and related fields will begin, all based on the National Corpus which provide fresh insights into the way the nation is currently using its language. Gavin Burnage British National Corpus Oxford University Computing Services 13 Banbury Road OXFORD OX2 6NN Tel: 0865-273280 Fax: 0865-273275 E-mail: GBURNAGE@NATCORP.OX.AC.UK REFERENCES (1) Ó Grianna, Séamus `Caisleáin Óir', Dundalk Press, 1924. (2) Sinclair, John (ed.) `Collins Cobuild English Language Dictionary', Collins Publishers, 1987. (3) For more details see Sinclair, John (ed.) `Looking Up', Collins Publishers, 1987. (4) For more details see Burnage, Gavin `CELEX -- A Guide for Users', Centre for Lexical Information, University of Nijmegen, 1990. (5) For more details see Goldfarb, Charles F. `The SGML Handbook', Oxford University Press, 1990. (6) For more details see Sperberg-McQueen, C.~Michael & Burnard, Lou (eds) `Guidelines for the Encoding and Interchange of Machine-Readable Texts' Chicago & Oxford, ALLC-ACH-ACL Text Encoding Initiative, 1990. (7) For more details see Garside, Leech & Sampson (eds) `The Computational Analysis of English', Longman, 1987.