BNCX29

TITLE
The British National Corpus: Problems in Producing a Large Text Corpus

(Panel session)

AUTHORS/AFFILIATIONS
Burnage, Gavin  Oxford University Computing Services
Garside, Roger  Lancaster University
Woodall, Ray    Oxford University Press

CONTACT ADDRESS
Gavin Burnage
Oxford University Computing Services
13 Banbury Road
OXFORD
OX2 6NN

E-MAIL
gburnage@natcorp.ox.ac.uk

FAX NUMBER
+44 865 273275

PHONE NUMBER
+44 865 273280


The British National Corpus (BNC) project is now over half way through
its planned development phase, the aim of which is the construction of
a 100 million word corpus of modern British English for use in
linguistic research and lexicography. Partners in this collaborative
venture are Oxford University Press (OUP), Longman Group UK Ltd., W & R 
Chambers, Oxford University Computing Services (OUCS), Lancaster
University's Unit for Computer Research in the English Language
(UCREL), and the British Library. Funding comes from the UK
government's Advanced Technology Programme under the DTI/SERC Joint
Framework for Information Technology.

The design of the BNC was drawn up with the intention of producing a
100 million word corpus covering a wide range of written and spoken
English. The intention is that it should be broadly "representative"
(in the general sense of the word) of modern British English; or, to
put it another way, that it should "characterize" modern British
English. 90 million words of the corpus will be written material, and
up to 10 million words will be transcribed spoken material.

Texts are included in the Written Corpus on the basis of four
selection features. The first is the domain of a text -- its subject
matter. The second is the date of first publication: most of the
material dates from 1975 to the present (though in the case of works
of fiction, some text dates back as far as 1960). The third is the
medium of the text: many types of text (such as books, periodicals,
brochures, leaflets, letters, reports, and so on) are to be included.
The fourth is the general level (high, middle or low): literature or
detailed technical writing is considered "high", some tabloid
newspapers or short leaflets are considered "low".  Target percentages
for the various sub-categorizations are being aimed at to help ensure
that the desired wide range of texts is covered. In addition,
information about each text, its size, author, and composition is
recorded so that inadvertent imbalances can be rectified where
necessary. In order to maximise the varieties of text within the 90
million word limit, the number of words from any one text does not
normally exceed 40,000. Using samples rather than full texts also
helps placate publishers worried about the electronic distribution of
copyright texts, though obtaining permission to use texts in the
corpus continues to be a difficult and lengthy procedure.

Half of the written material is systematically selected using sources
such as bestseller lists, public and academic library borrowing
information, lists of prize-winning books, and exhaustive reference
works such as the British National Biography and Whitaker's Books in
Print. The other half is selected at random, again using Books in
Print. The British Library has been advising the project on how to go
about applying the selection criteria with these sources of
information, and recommended the use of Dewey Decimal codes to help in
the selection and classification of the texts.

The Spoken Corpus comprises a demographic section and a context-governed
section. The demographic section is conversation recorded by volunteers
of different social backgrounds throughout the UK. This work was
carried out by Longman UK Ltd in conjunction with the British Market
Research Bureau. The context-governed section is recordings of
a wide range of talks, demonstrations, meetings, lectures and so on.
A considerable amount of material has been recorded, and transcription
of the material is continuing. There is a great deal of interest in the
provision of spoken data such as this; only the expense and time involved
prevents the BNC from supplying even more such data.

The three publishers in the BNC consortium (OUP, Longman and Chambers)
are responsible for the BNC's data capture. Three sources of
electronic data were envisaged at the start of the project: existing
electronic text, scanned text, and keyed-in text. It has become
apparent that the first source is not as plentiful as had been
thought, since either the material was encoded in formats too
difficult to unscramble, or the texts available did not match the
stipulated design criteria.  Consequently there has been a greater
reliance on scanning and keying text, activities which bring problems
of their own -- training keyborders and scanners to be consistent is
one common problem, and agreeing on the markup to be used when creating
electronic text is another. In the case of spoken data, keying in was
the only option from the start, and this has proved to be as
time-consuming as expected.

All three data suppliers use their own internal markup systems for
their data capture. The role of OUCS has been to suggest a standard
markup scheme for all BNC texts, and to convert and correct all
incoming text to that standard markup. In an attempt to make the BNC a
truly national (and international) resource, the format agreed (the
Corpus Document Interchange Format, or CDIF) is an application of the
Standard Generalized Markup Language (SGML) conforming to the
recommendations of the Text Encoding Initiative. Problems encountered
in designing and implementing CDIF have come in the level of detail
required. CDIF has also had to accommodate widely different text
types, from academic articles to illustrated leaflets to domestic
conversation. CDIF allows for a wealth of information to be recorded
about each text, but the time available precludes the application of
all the available tags.

To help ease the burden on the data suppliers, the tags available have
been classified according to their perceived usefulness and
applicability. Some -- such as headings, chapter or other division
breaks, and paragraphs -- are "required" parts of any CDIF document;
when such features occur in a text, they must be marked up. Others --
such as sub-divisions in the text, lists, poems, and notes about
editorial correction, are "recommended", and should be marked up if at
all possible. Finally, some tags are considered "optional" -- dates,
proper names and citations which are easily identifiable. OUCS accepts
data from the three suppliers on condition that it can be
automatically converted to at least the "required" level of CDIF
encoding. With this conversion in mind, the suppliers have agreed with
OUCS the format their keyborders and scanners aim for. A large part of
OUCS's work is to check that markup has been properly applied, and to
correct it and improve on it whenever possible, using an SGML parser
to test each text's conformance to the CDIF DTD and human observation
to ensure the markup has been properly applied. 

The next stage of the BNC production process takes place under UCREL
at Lancaster University, where part-of-speech tagging is carried out
on all texts intended for inclusion in the corpus using a modified
version of the Claws word-tagging system. (Here "tagging" refers to
linguistic annotation; at OUCS "tagging" refers to the annotation of
textual features.)

The tagset (C5) used for all BNC texts consists of about 60 tags.  In
addition, a more extensive tagset (C6) of about 160 tags is to be used
for a million-word sub-corpus derived from the full BNC. C6 is based
on tagsets used at Lancaster in the past for generating the
Lancaster-Oslo/Bergen and Lancaster Parsed corpora. In C5, fewer
detailed distinctions are made (between various classes of nouns, for
example) so that the accuracy of the tagging carried out remains
adequately high, given the high volume of data to be processed in a
relatively short time. Implementing a more detailed tagset would
require extensive hand-editing which cannot be financed under 
current conditions. During Claws' tagging procedures, temporary 
use is made of "process tags": they make distinctions required
during processing, typically those used in the disambiguation of
the segment structure of the texts, but do not appear in the
resulting fully-tagged corpus.

The linguistic tagging process is complicated by various factors
including the use of SGML markup, the wish to preserve all the features
recorded in the texts, and the open-ended nature of the texts themselves.
This includes the handling of personal and place names, long words,
foreign expressions, unusual orthography, and so on. The annotation
of Spoken Corpus data is particularly challenging. 

Manual post-editing of Claws output is a necessary part of the
process.  It is simplified to some extent by the use of "portmanteau"
tags which present two alternative part-of-speech codes in one tag at
points where automatic analysis could not make a final decision. 
In addition to hand-editing, there are mechanisms for checking 
output from the processing and the form of the resulting text -- the
segment structure and the part of speech markings.

Once linguistic annotation is complete, texts are returned to OUCS
for further CDIF checks, the addition of bibliographic information,
and, finally, accession to the BNC. The production phase of the 
product is due to end in April 1994, after which the corpus
will become available for academic and some commercial research. 
Some software tools for analysing the corpus will also be distributed.
 

At this panel session, three speakers working on the BNC project will
discuss the foregoing issues and problems in more detail, and in
addition, review the current state of the BNC and the effectiveness
of the processing strategies adopted.