Available at nominal charge for academic research purposes throughout the EC subject to a signed permissions agreement having been received by Oxford University Computing Services, from whom blank forms and supporting materials are available.
Availability for commercial research and exploitation only where terms have been agreed with the BNC Consortium Exploitation Committee. Apply in the first instance to Oxford University Computing Services.
We have yet to think of anything to put here. But we're keeping it, just in case.
The corpus, considered as a text in its own right, has no source: it was originated in electronic form.
See the source descriptions to component texts in order to trace the sources of those texts.
The British National Corpus project is a pre- competitive collaboration between commercial and academic partners in the U.K. running from 1991 to 1994. Its purpose is to assemble a corpus of written and spoken modern British English, balanced according to agreed criteria, marked up in a manner conformant with the recommendations of the Text Encoding Initiative, and with each word tagged according to its part of speech.
The resulting resource will be available, subject to copyright restrictions, at nominal cost for use in academic research in the EC; and, subject to agreement on licensing and to copyright restrictions, for commercial use.
Funding for the British National Corpus has been provided by the commercial partners in the consortium, by the U.K. Department of Trade and Industry, by the Science and Engineering Research Council, and by the British Library
Where a source text is a book, no more than 40,000 words are sampled. This is true even where the book contains a collection of works from a selection of authors. The sample begins and ends at a chapter-level boundary unless this would make the sample unacceptably short. In such cases, the sample begins at a chapter-level boundary, and ends at an arbitrary point. The remainder of the chapter- level unit is marked as deleted with a del tag.
No more than 120,000 words from any one author, whether individual or corporate, and whether writing individually or collaboratively, appear in the corpus. Where the author of an analytic or monographic work cannot be identified, it is assumed that the author is not represented elsewhere in the corpus.
Front and back matter from written material is generally not captured. Where it is, front and back tags are used.
Where a source text is a book shorter than 40,000 words, the whole text is captured, and ten per cent is then excised. The structure is preserved, and deleted material is marked with del tags.
The corpus contains approximately equal numbers of beginning, middle and end samples.
No more than 120,000 words from any one author, whether individual or corporate, and whether writing individually or collaboratively, appear in the corpus. Where the author of an analytic or monographic work cannot be identified, it is assumed that the author is not represented elsewhere in the corpus.
Front and back matter from written material is generally not captured. Where it is, front and back tags are used.
Where the source text is a magazine or newspaper, the whole of the editorial text is captured.
No more than 120,000 words from any one author, whether individual or corporate, and whether writing individually or collaboratively, appear in the corpus. Where the author of an analytic or monographic work cannot be identified, it is assumed that the author is not represented elsewhere in the corpus.
Front and back matter from written material is generally not captured. Where it is, front and back tags are used.
The length of demographically-sampled spoken samples is limited by recording technology to ninety minutes. No word-count limit is applied.
Samples of context-governed spoken material are truncated to no longer than 40,000 words. There is no indication that material has been deleted if truncation has taken place.
When noticed during encoding, errors or suspected errors in the original text are tagged with sic.
No normalization applied.
Transcription uses standard English spelling, and does not reflect pronunciation, except for a control list of dialectal forms and vocalized pauses.
Open quote marks, however rendered in the source, are represented by the entity &bquo in the electronic text; close quote marks by &equo
Quotation in spoken material may be represented using shift tags.
Line-end hyphens are elided if they occur in a word appearing in a control list of frequent words, in a word which appears un-hyphenated elsewhere in the electronic text, or if manual editing identifies then as soft. Line-end hyphens are retained if they occur in a form that appears as hyphenated (other than at the end of a line) elsewhere in the electronic text, or if manual editing identifies them as hard. Where no decision can be made, line-end hyphens are represented by the entity &rehy.
All line-end hyphens are elided. Subsequent editing reinstates them if the resulting form is not a dictionary word, or if manual intervention is called in to correct an anomaly identified during text enrichment.
Inapplicable; or source material contains no line-end hyphenation.
Part-of-speech information corresponding to the CLAWS C5 tag set is appended to each word and punctuation mark in the content of the electronic text, whether it is part of the source text or results from editorial intervention. Part-of-speech information is not given for words appearing as attribute values (for example, as a result of regularization).
Part-of-speech information corresponding to the CLAWS C5 tag set is appended to each word and punctuation mark in the content of the electronic text, whether it is part of the source text or results from editorial intervention. Part-of-speech information is not given for words appearing as attribute values (for example, as a result of regularization).
Overlapping speech is marked when two or three speakers are speaking simultaneously. The fourth and subsequent simultaneous utterances are not marked.
Part-of-speech information corresponding to the CLAWS C6 tag set is appended to each word and punctuation mark in the content of the electronic text, whether it is part of the source text or results from editorial intervention. Part-of-speech information is not given for words appearing as attribute values (for example, as a result of regularization).
Part-of-speech information corresponding to the CLAWS C6 tag set is appended to each word and punctuation mark in the content of the electronic text, whether it is part of the source text or results from editorial intervention. Part-of-speech information is not given for words appearing as attribute values (for example, as a result of regularization).
Overlapping speech is marked when two or three speakers are speaking simultaneously. The fourth and subsequent simultaneous utterances are not marked.
Segmentation divides the entire electronic text into arbitrary units. These units correspond closely to orthographic sentences where the source text (as amended by editing) is sentential.
No standard values are supplied.
The language of the British National Corpus is modern British English. Words, fragments, and passages from many other languages, both ancient and modern, occur within the corpus where these may be represented using a Latin alphabet. Long passages in these languages, and material in other languages, are generally silently deleted. In no case is the lang attribute used to indicate the language of a word, phrase or passage, nor are alternate writing system definitions used.
Additional restrictions relating to a particular work (if any) are summarized here.
Available only as part of the British National Corpus at nominal charge for academic research purposes throughout the EC subject to a signed permissions agreement having been received by Oxford University Computing Services, from whom blank forms and supporting materials are available.
Availability for commercial research and exploitation only where terms have been agreed with the BNC Consortium Exploitation Committee. Apply in the first instance to Oxford University Computing Services.
We haven't yet thought of anything to go here, either
See project description in corpus header for information about the British National Corpus project.
Any editorial practice specific to a single text is described here. All other practices are referenced through decls on the text tag or by default.
…
Additional restrictions relating to a particular analytic text or texts (if any) are summarized here. The decls attribute cross-references the analytic texts affected. Further paragraphs may summarize different restrictions applying to different analytic texts.
Available only as part of the British National Corpus at nominal charge for academic research purposes throughout the EC subject to a signed permissions agreement having been received by Oxford University Computing Services, from whom blank forms and supporting materials are available.
Availability for commercial research and exploitation only where terms have been agreed with the BNC Consortium Exploitation Committee. Apply in the first instance to Oxford University Computing Services.
See project description in corpus header for information about the British National Corpus project.
Any editorial practice specific to a single text is described here. All other practices are referenced through decls on the text tag or by default.
…
&hellip
Available only as part of the British National Corpus at nominal charge for academic research purposes throughout the EC subject to a signed permissions agreement having been received by Oxford University Computing Services, from whom blank forms and supporting materials are available.
Availability for commercial research and exploitation only where terms have been agreed with the BNC Consortium Exploitation Committee. Apply in the first instance to Oxford University Computing Services.
Recorded by respondent on Walkman compact cassette recorder; dubbed for archival to Digital Audio tape at 48kHz sampling rate; redubbed for transcription to compact cassette.
See project description in corpus header for information about the British National Corpus project.