Text Corpora and the British National Corpus


There's nothing particularly new in large collections of texts for
academic research: for centuries people have been collecting
manuscripts, books and newspapers for analysis which has often been
very laborious. Thankfully, as technological advances make the
computerized storage and access of large quantities of information
easier, so the construction and use of text corpora continues to
increase, and the potential for research has widened considerably. 

A corpus can be thought of as a collection of texts gathered
according to particular principles for some particular purpose. This
contrasts with an archive -- the Oxford Text Archive, for example --
which is a general collection where all kinds of texts are stored
simply because each one is of interest in itself. Someone interested
in `Caisleáin Óir' by Séamus Ó Grianna (1) can retrieve it from the
Text Archive; a linguist might reasonably expect to find all or some
of it in a corpus of (for example) twentieth century Donegal Gaelic.

Linguistics is one area where text corpora are being developed with
enthusiasm on a large scale, the results proving invaluable in many
different ways. One of the clearest examples is lexicography, where
traditional methods are being radically revised to take account of
the linguistic evidence provided by text corpora. The `Collins
Cobuild English Language Dictionary'(2), published in 1987, was a
completely new dictionary, constructed entirely on the basis of a
7.3 million word text corpus (3). Using a concordance generated from
the corpus, lexicographers were able to examine occurences of each
word in context, something which helped in a number of ways. First,
the concordance was a new aid in identifying the main sense or
senses in which any particular word was used, a task previously left
largely to the lexicographer's observation and intuition alone.
Second, a suitable example could be selected for the definition
written by the lexicographer -- the important difference being that
the example would be a naturally-occurring usage illustrating the
given definition rather than one formulated by the lexicographer. 

Corpora are becoming invaluable not only in the writing of
traditional paper dictionaries, but also in the construction of
computerized databases and lexicons. The CELEX project in the
University of Nijmegen has built large databases with extensive
information on Dutch, English and German (4), and corpora of texts
in modern-day language were used to determine which words were
finally included in the database. Moreover, parts of the corpora
used (for Dutch, the Instituut voor Nederlandse Lexicologie's 40
million word corpus, and for English, Cobuild's extended 18 million
word corpus) were disambiguated to provide sophisticated information
on the frequency of words, both canonical forms and wordforms.

The use of a corpus as an `on-tap' supply of natural language is
important in many fields. In language teaching, as in lexicography,
a corpus can provide examples of language as it is really
used. Text books and teaching materials can be prepared using
authentic examples, the advantage being that naturally-occuring
language is almost invariably more vigourous and colourful than
made-up examples. One problem for some computational linguists is
parsing and disambiguating a text corpus, for which a reliable
corpus-based computer lexicon is necessary: this is a vicious circle
which the parallel development of corpora and lexical databases can
begin to solve. Work of this kind will prove invaluable to those
involved with artificial intelligence and natural language
processing, since to enable computers to `understand' or produce
real language, a rich store of lexical information has to be
available. Corpora consisting of transcribed speech are of
particular interest to researchers in a wide range of fields,
including stylistics, dialectology, and  sociolinguistics, depending
on the design and contents of the corpus used. 

Given the many possibilities presented by corpus analysis, and the
consequent demand for larger and better corpora, it is no surpise 
that work has begun to develop a new, national resource,
called the British National Corpus. Within three years the project
hopes to have constructed a corpus of about 100 million words from a
wide range of subjects and styles, for general use within the
academic community throughout the country as well as in commercial
research and development. The British National Corpus project is a
collaborative, pre-competitive initiative carried out by Oxford
University Press, Longman Group UK Ltd, Chambers, Lancaster
University's Unit for Computr Research in the English Language,
Oxford University Computing Services and the British Library. The
project receives funding from the UK Department of Trade and
Industry and the Science and Engineering Research Council within
their Joint Framework for Information Technology. In view of the
breadth of participation in the project and the promise of a
product available for general public research rather than private
commercial research, the title of National Corpus seems most
appropriate. 
 
The British National Corpus will be a general purpose corpus for
linguistic research. Its 100 million word contents will be made up
of about 90 million words of written texts (or samples from texts
which are longer than 40,000 words) and about 10 million words of
transcribed spoken material. 

Current plans for the written component of the corpus envisage the
inclusion of a wide range of texts, most of which will date from the
mid 1970s to the present day. Books and periodicals are the most
important, but other published items such as leaflets, manuals and
adverts will be included as well as unpublished things such as
letters and memos, reports and essays. All levels of British English
will be represented -- from the technical and highbrow through to
the simple and lowbrow. 

For the spoken component, a special project is underway. The British
Market Research Bureau has been called in to recruit volunteers
throughout the UK who are willing to use a Walkman and a small
microphone to record all their conversations over a few days. The
areas where volunteers are sought are carefully selected to provide
a representative social cross-section of the nation. Early tests
have produced good results, and recordings are already being
transcribed and converted into digital format by Longman UK Ltd in
Harlow. Other spoken material -- recordings of meetings, speeches,
sermons and broadcasts -- is also being collected and transcribed. 

Material collected is sent to Oxford University Computing Services,
where formatting is standardized and checked. The format used is
SGML or `Standard Generalized Markup Language' (5), implemented in
broad accordance with the recommendations of the TEI or Text
Encoding Initiative (6). The mark-up is concerned with the
structure of the text -- its sections, paragraphs and sentences, for
example -- and it also helps ensure that the corpus will be usable
no matter what the local computational set-up. Tagging of this sort
therefore helps everyone, since history and sociology researchers who
use the corpus need it just as much as linguists. Specifically
linguistic information is added subsequently by Lancaster
University, where basic syntactic codes are generated by a modified
version of the CLAWS2 parser and tagger (7). While all the SGML tags,
both structural and linguistic, are intended to be an aid to
research, they can be rewritten or removed according to the purpose
of the researcher. 

Once all the formatting and sytactic tagging is complete, texts are
added to the corpus on the project's computers housed in Oxford
University Computing Services, and within a few years, analysis of
the data will begin: new dictionaries and language reference books
will appear, and new research into many linguistic and related
fields will begin, all based on the National Corpus which 
provide fresh insights into the way the nation is currently using
its language. 


Gavin Burnage
British National Corpus
Oxford University Computing Services
13 Banbury Road
OXFORD  OX2 6NN

Tel: 0865-273280
Fax: 0865-273275
E-mail: GBURNAGE@NATCORP.OX.AC.UK


                        REFERENCES


(1) Ó Grianna, Séamus   `Caisleáin Óir', Dundalk Press, 1924.


(2) Sinclair, John (ed.)  `Collins Cobuild English Language Dictionary',
Collins Publishers, 1987.


(3) For more details see  Sinclair, John (ed.)   `Looking Up', Collins 
Publishers, 1987.
 

(4) For more details see  Burnage, Gavin   `CELEX -- A Guide for Users', 
Centre for Lexical Information, University of Nijmegen, 1990.


(5) For more details see  Goldfarb, Charles F.  `The SGML Handbook', 
Oxford University Press, 1990.


(6) For more details see Sperberg-McQueen, C.~Michael & Burnard, Lou (eds)
`Guidelines for the Encoding and Interchange of Machine-Readable Texts'
Chicago & Oxford, ALLC-ACH-ACL Text Encoding Initiative, 1990.


(7) For more details see Garside, Leech & Sampson (eds)  `The
Computational Analysis of English', Longman, 1987.