The British National Corpus

The British National Corpus Consortium member The British Library Board Consortium member W & R Chambers Ltd. Consortium member Longman Group UK Ltd. Lead partner in consortium Oxford University Press Consortium member The University of Lancaster Consortium member The University of Oxford 0.1 One hundred million words — ninety million from written sources, ten million from spoken sources. Archive site Oxford University Computing Services

13 Banbury Road, Oxford OX2 6NN U.K. Telephone: +44 491 273280 Facsimile: +44 491 273275 Internet mail: natcorp@ox.ac.uk

bnc0.1

Available at nominal charge for academic research purposes throughout the EC subject to a signed permissions agreement having been received by Oxford University Computing Services, from whom blank forms and supporting materials are available.

Availability for commercial research and exploitation only where terms have been agreed with the BNC Consortium Exploitation Committee. Apply in the first instance to Oxford University Computing Services.

1993-04-17

We have yet to think of anything to put here. But we're keeping it, just in case.

The corpus, considered as a text in its own right, has no source: it was originated in electronic form.

See the source descriptions to component texts in order to trace the sources of those texts.

The British National Corpus project is a pre- competitive collaboration between commercial and academic partners in the U.K. running from 1991 to 1994. Its purpose is to assemble a corpus of written and spoken modern British English, balanced according to agreed criteria, marked up in a manner conformant with the recommendations of the Text Encoding Initiative, and with each word tagged according to its part of speech.

The resulting resource will be available, subject to copyright restrictions, at nominal cost for use in academic research in the EC; and, subject to agreement on licensing and to copyright restrictions, for commercial use.

Funding for the British National Corpus has been provided by the commercial partners in the consortium, by the U.K. Department of Trade and Industry, by the Science and Engineering Research Council, and by the British Library

Where a source text is a book, no more than 40,000 words are sampled. This is true even where the book contains a collection of works from a selection of authors. The sample begins and ends at a chapter-level boundary unless this would make the sample unacceptably short. In such cases, the sample begins at a chapter-level boundary, and ends at an arbitrary point. The remainder of the chapter- level unit is marked as deleted with a del tag.

No more than 120,000 words from any one author, whether individual or corporate, and whether writing individually or collaboratively, appear in the corpus. Where the author of an analytic or monographic work cannot be identified, it is assumed that the author is not represented elsewhere in the corpus.

Front and back matter from written material is generally not captured. Where it is, front and back tags are used.

Where a source text is a book shorter than 40,000 words, the whole text is captured, and ten per cent is then excised. The structure is preserved, and deleted material is marked with del tags.

The corpus contains approximately equal numbers of beginning, middle and end samples.

Front and back matter from written material is generally not captured. Where it is, front and back tags are used.

Where the source text is a magazine or newspaper, the whole of the editorial text is captured.

Front and back matter from written material is generally not captured. Where it is, front and back tags are used.

The length of demographically-sampled spoken samples is limited by recording technology to ninety minutes. No word-count limit is applied.

Samples of context-governed spoken material are truncated to no longer than 40,000 words. There is no indication that material has been deleted if truncation has taken place.

When noticed during encoding, errors or suspected errors in the original text are tagged with sic.

No normalization applied.

Transcription uses standard English spelling, and does not reflect pronunciation, except for a control list of dialectal forms and vocalized pauses.

Open quote marks, however rendered in the source, are represented by the entity &bquo in the electronic text; close quote marks by &equo

Quotation in spoken material may be represented using shift tags.

Line-end hyphens are elided if they occur in a word appearing in a control list of frequent words, in a word which appears un-hyphenated elsewhere in the electronic text, or if manual editing identifies then as soft. Line-end hyphens are retained if they occur in a form that appears as hyphenated (other than at the end of a line) elsewhere in the electronic text, or if manual editing identifies them as hard. Where no decision can be made, line-end hyphens are represented by the entity &rehy.

All line-end hyphens are elided. Subsequent editing reinstates them if the resulting form is not a dictionary word, or if manual intervention is called in to correct an anomaly identified during text enrichment.

Inapplicable; or source material contains no line-end hyphenation.

Part-of-speech information corresponding to the CLAWS C5 tag set is appended to each word and punctuation mark in the content of the electronic text, whether it is part of the source text or results from editorial intervention. Part-of-speech information is not given for words appearing as attribute values (for example, as a result of regularization).

Overlapping speech is marked when two or three speakers are speaking simultaneously. The fourth and subsequent simultaneous utterances are not marked.

Part-of-speech information corresponding to the CLAWS C6 tag set is appended to each word and punctuation mark in the content of the electronic text, whether it is part of the source text or results from editorial intervention. Part-of-speech information is not given for words appearing as attribute values (for example, as a result of regularization).

Overlapping speech is marked when two or three speakers are speaking simultaneously. The fourth and subsequent simultaneous utterances are not marked.

Segmentation divides the entire electronic text into arbitrary units. These units correspond closely to orthographic sentences where the source text (as amended by editing) is sentential.

No standard values are supplied.

Text type Written Spoken Medium (written only) Books & periodicals Miscellaneous Written to be spoken Domain (written only) Imaginative Applied science Arts Belief & thought Commerce & finance Leisure Natural & pure science Social science World affairs Level (written only) High Medium Low Sample type (written only) Beginning Middle End Whole Whole less ten percent Date of origination/publication (written only) 1960–1975 1975–1993 Publication status (written miscellaneous only) Published Unpublished Selection method (written books & periodicals only) Chosen on grounds of circulation/influence Chosen at random Medium (spoken only) Demographic Context-governed Region (spoken only) North Middle South Respondent social class (spoken demographic only) AB C1 C2 DE Respondent age (spoken demographic only) 16–25 26–40 41–55 56+ Respondent gender (spoken demographic only) Female Male Domain (spoken context-governed only) Educational/informative Business Public/institutional Leisure Interaction (spoken context-governed only) Monologue Dialogue Modern British English

The language of the British National Corpus is modern British English. Words, fragments, and passages from many other languages, both ancient and modern, occur within the corpus where these may be represented using a Latin alphabet. Long passages in these languages, and material in other languages, are generally silently deleted. In no case is the lang attribute used to indicate the language of a word, phrase or passage, nor are alternate writing system definitions used.

Fred British English; East Midlands dialect Rushden, Northants, England To age 14 Retired Florence British English; East Midlands dialect Retired Steven British English; East Midlands dialect Office manager Emily British English; East Midlands dialect Student Sandra British English; East Midlands dialect Housewife Julie British English; East Midlands dialect Student 1993-05-17 DFD Extremely preliminary release Wings &mdash an electronic sample Data capture Oxford University Press Encoding, storage and distribution Oxford University Computing Services Text enrichment Unit for Computer Research into the English Language, University of Lancaster 1.0 37875 words 460 kbytes Archive site Oxford University Computing Services

13 Banbury Road, Oxford OX2 6NN U.K. Telephone: +44 491 273280 Facsimile: +44 491 273275 Internet mail: natcorp@ox.ac.uk

A73 Wingss

Additional restrictions relating to a particular work (if any) are summarized here.

Available only as part of the British National Corpus at nominal charge for academic research purposes throughout the EC subject to a signed permissions agreement having been received by Oxford University Computing Services, from whom blank forms and supporting materials are available.

1993-03-17

We haven't yet thought of anything to go here, either

Terry Pratchett Wings First paperback edition, published 1991, reprinted 1992 Corgi 1992 0 552 52649 5 13–172

See project description in corpus header for information about the British National Corpus project.

Any editorial practice specific to a single text is described here. All other practices are referenced through decls on the text tag or by default.

1993-03-17 OUP Passed to OUCS 1993-04-07 OUCS Passed to Lancaster 1993-05-30 Lancaster Passed to OUCS 1993-06-15 OUCS Accession to corpus

…

The Guardian, edition of 1989-11-08 &mdash an electronic collection of material related to world affairs Data capture Oxford University Press Encoding, storage and distribution Oxford University Computing Services Text enrichment Unit for Computer Research into the English Language, University of Lancaster 1.0 850 words 12 kbytes Archive site Oxford University Computing Services

13 Banbury Road, Oxford OX2 6NN U.K. Telephone: +44 491 273280 Facsimile: +44 491 273275 Internet mail: natcorp@ox.ac.uk

B9H GaWldA

Additional restrictions relating to a particular analytic text or texts (if any) are summarized here. The decls attribute cross-references the analytic texts affected. Further paragraphs may summarize different restrictions applying to different analytic texts.

1993-03-30 Quote&hellip [The Guardian, electronic edition of 1989-11-08&rsqb Guardian Newspapers Ltd. 0261 3077 23 The Guardian Diary Andrew Moncur [The Guardian, electronic edition of 1989-11-08&rsqb Guardian Newspapers Ltd. 0261 3077 23 The Guardian

See project description in corpus header for information about the British National Corpus project.

Any editorial practice specific to a single text is described here. All other practices are referenced through decls on the text tag or by default.

1992-02-19 OUP Passed to OUCS 1992-04-03 OUCS Passed to Lancaster 1993-03-30 Lancaster Passed to OUCS 1993-06-02 OUCS Accession to corpus Quote&hellip

…

Diary Andrew Moncur

&hellip

[Spoken material from respondent Fred, sample 026211] Data capture and transcription Longman Dictionaries Encoding, storage and distribution Oxford University Computing Services Text enrichment Unit for Computer Research into the English Language, University of Lancaster 1.0 992 words 20 kbytes Archive site Oxford University Computing Services

13 Banbury Road, Oxford OX2 6NN U.K. Telephone: +44 491 273280 Facsimile: +44 491 273275 Internet mail: natcorp@ox.ac.uk

QA0 026211

1992-03-15 17:05

Recorded by respondent on Walkman compact cassette recorder; dubbed for archival to Digital Audio tape at 48kHz sampling rate; redubbed for transcription to compact cassette.

See project description in corpus header for information about the British National Corpus project.

Rushden, Northants, U.K. 17:05 Home Making tea/playing games 1993-01-15 Longman Passed to OUCS 1993-04-04 OUCS Passed to Lancaster 1993-05-31 Lancaster Passed to OUCS 1993-06-30 OUCS Accession to corpus … … …