BNCX29 TITLE The British National Corpus: Problems in Producing a Large Text Corpus (Panel session) AUTHORS/AFFILIATIONS Burnage, Gavin Oxford University Computing Services Garside, Roger Lancaster University Woodall, Ray Oxford University Press CONTACT ADDRESS Gavin Burnage Oxford University Computing Services 13 Banbury Road OXFORD OX2 6NN E-MAIL gburnage@natcorp.ox.ac.uk FAX NUMBER +44 865 273275 PHONE NUMBER +44 865 273280 The British National Corpus (BNC) project is now over half way through its planned development phase, the aim of which is the construction of a 100 million word corpus of modern British English for use in linguistic research and lexicography. Partners in this collaborative venture are Oxford University Press (OUP), Longman Group UK Ltd., W & R Chambers, Oxford University Computing Services (OUCS), Lancaster University's Unit for Computer Research in the English Language (UCREL), and the British Library. Funding comes from the UK government's Advanced Technology Programme under the DTI/SERC Joint Framework for Information Technology. The design of the BNC was drawn up with the intention of producing a 100 million word corpus covering a wide range of written and spoken English. The intention is that it should be broadly "representative" (in the general sense of the word) of modern British English; or, to put it another way, that it should "characterize" modern British English. 90 million words of the corpus will be written material, and up to 10 million words will be transcribed spoken material. Texts are included in the Written Corpus on the basis of four selection features. The first is the domain of a text -- its subject matter. The second is the date of first publication: most of the material dates from 1975 to the present (though in the case of works of fiction, some text dates back as far as 1960). The third is the medium of the text: many types of text (such as books, periodicals, brochures, leaflets, letters, reports, and so on) are to be included. The fourth is the general level (high, middle or low): literature or detailed technical writing is considered "high", some tabloid newspapers or short leaflets are considered "low". Target percentages for the various sub-categorizations are being aimed at to help ensure that the desired wide range of texts is covered. In addition, information about each text, its size, author, and composition is recorded so that inadvertent imbalances can be rectified where necessary. In order to maximise the varieties of text within the 90 million word limit, the number of words from any one text does not normally exceed 40,000. Using samples rather than full texts also helps placate publishers worried about the electronic distribution of copyright texts, though obtaining permission to use texts in the corpus continues to be a difficult and lengthy procedure. Half of the written material is systematically selected using sources such as bestseller lists, public and academic library borrowing information, lists of prize-winning books, and exhaustive reference works such as the British National Biography and Whitaker's Books in Print. The other half is selected at random, again using Books in Print. The British Library has been advising the project on how to go about applying the selection criteria with these sources of information, and recommended the use of Dewey Decimal codes to help in the selection and classification of the texts. The Spoken Corpus comprises a demographic section and a context-governed section. The demographic section is conversation recorded by volunteers of different social backgrounds throughout the UK. This work was carried out by Longman UK Ltd in conjunction with the British Market Research Bureau. The context-governed section is recordings of a wide range of talks, demonstrations, meetings, lectures and so on. A considerable amount of material has been recorded, and transcription of the material is continuing. There is a great deal of interest in the provision of spoken data such as this; only the expense and time involved prevents the BNC from supplying even more such data. The three publishers in the BNC consortium (OUP, Longman and Chambers) are responsible for the BNC's data capture. Three sources of electronic data were envisaged at the start of the project: existing electronic text, scanned text, and keyed-in text. It has become apparent that the first source is not as plentiful as had been thought, since either the material was encoded in formats too difficult to unscramble, or the texts available did not match the stipulated design criteria. Consequently there has been a greater reliance on scanning and keying text, activities which bring problems of their own -- training keyborders and scanners to be consistent is one common problem, and agreeing on the markup to be used when creating electronic text is another. In the case of spoken data, keying in was the only option from the start, and this has proved to be as time-consuming as expected. All three data suppliers use their own internal markup systems for their data capture. The role of OUCS has been to suggest a standard markup scheme for all BNC texts, and to convert and correct all incoming text to that standard markup. In an attempt to make the BNC a truly national (and international) resource, the format agreed (the Corpus Document Interchange Format, or CDIF) is an application of the Standard Generalized Markup Language (SGML) conforming to the recommendations of the Text Encoding Initiative. Problems encountered in designing and implementing CDIF have come in the level of detail required. CDIF has also had to accommodate widely different text types, from academic articles to illustrated leaflets to domestic conversation. CDIF allows for a wealth of information to be recorded about each text, but the time available precludes the application of all the available tags. To help ease the burden on the data suppliers, the tags available have been classified according to their perceived usefulness and applicability. Some -- such as headings, chapter or other division breaks, and paragraphs -- are "required" parts of any CDIF document; when such features occur in a text, they must be marked up. Others -- such as sub-divisions in the text, lists, poems, and notes about editorial correction, are "recommended", and should be marked up if at all possible. Finally, some tags are considered "optional" -- dates, proper names and citations which are easily identifiable. OUCS accepts data from the three suppliers on condition that it can be automatically converted to at least the "required" level of CDIF encoding. With this conversion in mind, the suppliers have agreed with OUCS the format their keyborders and scanners aim for. A large part of OUCS's work is to check that markup has been properly applied, and to correct it and improve on it whenever possible, using an SGML parser to test each text's conformance to the CDIF DTD and human observation to ensure the markup has been properly applied. The next stage of the BNC production process takes place under UCREL at Lancaster University, where part-of-speech tagging is carried out on all texts intended for inclusion in the corpus using a modified version of the Claws word-tagging system. (Here "tagging" refers to linguistic annotation; at OUCS "tagging" refers to the annotation of textual features.) The tagset (C5) used for all BNC texts consists of about 60 tags. In addition, a more extensive tagset (C6) of about 160 tags is to be used for a million-word sub-corpus derived from the full BNC. C6 is based on tagsets used at Lancaster in the past for generating the Lancaster-Oslo/Bergen and Lancaster Parsed corpora. In C5, fewer detailed distinctions are made (between various classes of nouns, for example) so that the accuracy of the tagging carried out remains adequately high, given the high volume of data to be processed in a relatively short time. Implementing a more detailed tagset would require extensive hand-editing which cannot be financed under current conditions. During Claws' tagging procedures, temporary use is made of "process tags": they make distinctions required during processing, typically those used in the disambiguation of the segment structure of the texts, but do not appear in the resulting fully-tagged corpus. The linguistic tagging process is complicated by various factors including the use of SGML markup, the wish to preserve all the features recorded in the texts, and the open-ended nature of the texts themselves. This includes the handling of personal and place names, long words, foreign expressions, unusual orthography, and so on. The annotation of Spoken Corpus data is particularly challenging. Manual post-editing of Claws output is a necessary part of the process. It is simplified to some extent by the use of "portmanteau" tags which present two alternative part-of-speech codes in one tag at points where automatic analysis could not make a final decision. In addition to hand-editing, there are mechanisms for checking output from the processing and the form of the resulting text -- the segment structure and the part of speech markings. Once linguistic annotation is complete, texts are returned to OUCS for further CDIF checks, the addition of bibliographic information, and, finally, accession to the BNC. The production phase of the product is due to end in April 1994, after which the corpus will become available for academic and some commercial research. Some software tools for analysing the corpus will also be distributed. At this panel session, three speakers working on the BNC project will discuss the foregoing issues and problems in more detail, and in addition, review the current state of the BNC and the effectiveness of the processing strategies adopted.