British National Corpus

British National Corpus The National Corpus Initiative Lead Organization Oxford University Press Participating Organization The British Library Chambers Ltd. Unit for Computer Research on the English Language Lancaster University Longman Group UK Ltd. Oxford University Computing Services The University of Oxford Chambers Ltd. The Department of Trade and Industry Longman UK Ltd. Oxford University Press The Science & Engineering Research Council Version 0.1 of 1991-10-17 This development version of the Corpus is not a published work This Corpus is available from British National Corpus Project Oxford University Computing Services 13, Banbury Road Oxford OX2 6NN U.K. Telephone: +44 865 273280 Facsimile: +44 865 273275 E-mail: natcorp@vax.ox.ac.uk This development version is not for distribution in whole or in part outside the National Corpus Initiative The individual texts in this Corpus contain copyright material, and may be used only in accordance with the permissions granted for their use. These permissions may vary from text to text, and are noted explicitly, or by reference to entities in this Corpus header, in individual text headers. It is the responsibility of each user of the Corpus to ensure that all such conditions are observed for Corpus materials in their possession. Texts having this permission may be used freely and without restriction. Texts having this permission may be used for the purposes of bona fide research by registered users of the Corpus anywhere in the world. Reproduction in published works must be limited to short verbatim excerpts used for illustrative purposes only. See individual text headers The primary initial application area of the British National Corpus is lexicography, but use in the following areas is also anticipated and encouraged: 1. Reference (book) publishing. 2. Academic linguistic research. 3. Language teaching. 4. Artificial intelligence. 5. Natural language processing. 6. Information retrieval. The following types of linguistic information may be derived from the Corpus: 1. Lexical. 2. Semantic/pragmatic. 3. Syntactic. 4. Morphological. 5. Graphological/written form/orthographical. Written material collected in accordance with specification given in document BNCW08. (See also sampling declarations.) Spoken material collected from sample of UK population in accordance with specification given in document TGAW14. Spoken material captured in selected contexts (situations, locations) in accordance with specification given in document TGAW14. Captured texts are converted to Corpus Document Interchange (CDIF) markup. This markup is enriched by the identification and tagging of a variety of features and structures, and then subjected to word-class tagging and segment marking. Imaginative written material constitutes 20-30% of the written part of the Corpus -- that is, 18-27 million words. The decision to classify a text as imaginative is made on the basis of information available at the time of data capture. Imaginative material is not subdivided any way -- for example, into genres. Informative written material dealing primarily with subject matter related to natural and pure science constitutes 3-7% of the written part of the Corpus -- that is, 2.7-6.3 million words. The decision to classify a text under this heading is made on the basis of information available at the time of data capture. Informative written material dealing primarily with subject matter related to applied science constitutes 3-7% of the written part of the Corpus -- that is, 2.7-6.3 million words. The decision to classify a text under this heading is made on the basis of information available at the time of data capture. Informative written material dealing primarily with subject matter related to social science constitutes 13-17% of the written part of the Corpus -- that is, 11.7-15.3 million words. The decision to classify a text under this heading is made on the basis of information available at the time of data capture. Informative written material dealing primarily with subject matter related to world affairs constitutes 13-17% of the written part of the Corpus -- that is, 11.7-15.3 million words. The decision to classify a text under this heading is made on the basis of information available at the time of data capture. Informative written material dealing primarily with subject matter related to commerce and finance constitutes 8-12% of the written part of the Corpus -- that is, 7.2-10.8 million words. The decision to classify a text under this heading is made on the basis of information available at the time of data capture. Informative written material dealing primarily with subject matter related to the arts constitutes 8-12% of the written part of the Corpus -- that is, 7.2-10.8 million words. The decision to classify a text under this heading is made on the basis of information available at the time of data capture. Informative written material dealing primarily with subject matter related to belief and thought constitutes 3-7% of the written part of the Corpus -- that is, 2.7-6.3 million words. The decision to classify a text under this heading is made on the basis of information available at the time of data capture. Informative written material dealing primarily with subject matter related to leisure constitutes 8-12% of part of the Corpus -- that is, 7.2-10.8 million words. The decision to classify a text under this heading is made on the basis of information available at the time of data capture. Material first published (or, in the case of unpublished works, apparently written) between 1960 and 1974 inclusive makes up between 23 and 27% of the imaginative material in the Corpus -- that is, 4.1-7.2 million words. This is between 4.6 and 8.1% of the written part of the Corpus. Sampling procedures do not attempt to balance the Corpus with respect to the number of texts published or written in particular years. Material first published (or, in the case of unpublished works, apparently written) between 1975 and 1993 inclusive makes up between 72 and 77% of the imaginative material in the Corpus -- that is, 13.1-20.8 million words. This is between 14.6 and 23.1% of the written part of the Corpus. All informative material in the Corpus is first published (or, in the case of unpublished works, apparently written) between 1975 and 1993 inclusive. This material makes up 70 to 80% of the written part of the Corpus -- 62-72 million words. Sampling procedures do not attempt to balance the Corpus with respect to the number of texts published or written in particular years. Between 26 and 34% of the material in the written part of the Corpus -- 23.4-30.6 million words -- is taken from published books randomly selected from comprehensive UK publication catalogues. Between 26 and 34% of the material in the written part of the Corpus -- 23.4-30.6 million words -- is taken from published books selected by OUP staff with the aim of balancing selection criteria, and of providing variety in the areas covered by classification criteria. Between 8 and 17% of the material in the written part of the Corpus -- 7.2-15.3 million words -- is taken from published periodicals randomly selected from comprehensive UK publication catalogues. Between 8 and 17% of the material in the written part of the Corpus -- 7.2-15.3 million words -- is taken from published periodicals selected by OUP staff with the aim of balancing selection criteria, and of providing variety in the areas covered by classification criteria. Between 5 and 10% of the material in the written part of the Corpus -- 4.5-9 million words is taken from miscellaneous published material (brochures, leaflets, manuals, advertisements etc.) collected ... (HOW?) Between 5 and 10% of the material in the written part of the Corpus -- 4.5-9 million words is taken from miscellaneous unpublished material (letters, memos, reports, minutes, essays etc.) collected ... (HOW?) Between 2 and 7% of the material in the written part of the Corpus -- 1.8-6.3 million words -- is taken from material written to be spoken -- speeches, plays, broadcast programme scripts etc. Between 31 and 35% of the imaginative works in the written part of the Corpus are judged, according to information available at the time of data capture, to be targeted primarily to an intellectual and highly literate audience (a ``high-brow'' audience). This category accounts for between 6.2 and 10.5% of the written part of the Corpus -- 5.6-9.5 million words. Between 28 and 32% of the informative works in the written part of the Corpus are judged, according to information available at the time of data capture, to be targeted primarily to an audience which is has a high degree of knowledge of the topic that covered. This category accounts for between 20 and 26% of the written part of the Corpus -- 18-23.4 million words. Between 31 and 35% of the imaginative works in the written part of the Corpus are judged, according to information available at the time of data capture, to be targeted primarily to an audience of average literacy and intellect (a ``middle-brow'' audience). This category accounts for between 6.2 and 10.5% of the written part of the Corpus -- 5.6-9.5 million words. Between 48 and 52% of the informative works in the written part of the Corpus are judged, according to information available at the time of data capture, to be targeted primarily to an audience which is has some knowledge of the topic covered, or of related topics. This category accounts for between 34 and 42% of the written part of the Corpus -- 30.6-37.8 million words. Between 31 and 35% of the imaginative works in the written part of the Corpus are judged, according to information available at the time of data capture, to require little in the way of literacy or intellect from their audience (a ``low-brow'' audience). This category accounts for between 6.2 and 10.5% of the written part of the Corpus -- 5.6-9.5 million words. Between 18 and 22% of the informative works in the written part of the Corpus are judged, according to information available at the time of data capture, to be targeted primarily to a popular audience and to assume little or no knowledge of the topic covered. This category accounts for between 13 and 18% of the written part of the Corpus -- 11.7-16.2 million words. Between 31 and 35% of Corpus texts which, in their original form, are longer than approximately 40,000 words, are represented in the Corpus by a sample of approximately 40,000 words starting at the beginning of the original text and continuing, without omission (except as noted under editorial principles) until a convenient structural breakpoint (for example, the end of a chapter) approximately 40,000 words later. There is no sampling criterion determining the ratio between texts sampled in this way and texts which are included in their entirety or as collections. Between 31 and 35% of Corpus texts which, in their original form, are longer than approximately 40,000 words, are represented in the Corpus by a sample of approximately 40,000 words starting at or randomly-chosen, but structurally-convenient point in the original text (for example, the beginning of a chapter other than the first) and continuing, without omission (except as noted under editorial principles) until a convenient structural breakpoint (for example, the end of a chapter, but not the end of the text) approximately 40,000 words later. There are no criteria determining the ratio between texts sampled in this way and texts which are included in their entirety or as collections. Between 31 and 35% of Corpus texts which, in their original form, are longer than approximately 40,000 words, are represented in the Corpus by a sample of approximately 40,000 words starting at a structurally-convenient point in the original text (for example, the beginning of a chapter) approximately 40,000 words from the end of the text and continuing, without omission (except as noted under editorial principles) until the end of the text. There are no criteria determining the ratio between texts sampled in this way and texts which are included in their entirety or as collections. Single written texts which are shorter than approximately 40,000 words are included in the Corpus in their entirety (except for omissions as noted under editorial principles). There are no criteria determining the ratio between full texts and texts which are represented by a 40,000 word sample, or which form part of a collection. Individual texts which are considerably shorter than 40,000 words may be combined by the Corpus compilers with other short texts which are alike with respect to selection features, and, where possible, with respect to classification features, forming collections totaling no more than approximately 40,000 words. It should be noted that texts which are themselves collections (anthologies of poetry, conference proceedings etc.) when received by the Corpus compilers are not classified under this heading. There are no criteria determining the ratio between texts which form part of a collection, full texts, and texts which are represented by a 40,000 word sample. On the basis of information available at the time of data capture, at least one author of the written text is judged to be well-known. (Has previously published successfully; is a media, sporting, political, military, academic or religious personality; has won a prize; has been the subject of media interest; has featured in a best-seller list...) While sampling procedures ensure that this author status is represented in the Corpus, there is no criterion governing the ratio of the occurence of this author status to that of others. On the basis of information available at the time of data capture, none of the authors of the written text is judged to be well-known. While sampling procedures ensure that this author status is represented in the Corpus, there is no criterion governing the ratio of the occurence of this author status to that of others. The written text is the work of a sole author. While sampling procedures ensure that works by sole authors are represented in the Corpus, there is no criterion governing the ratio of the occurence of this authorship status to that of others. The written text is the work of a multiple authors. While sampling procedures ensure that multiply-authored works are represented in the Corpus, there is no criterion governing the ratio of the occurence of this authorship status to that of others. The authorship of the written text is attributed to a company, government organization, or other corporate body, rather than to an individual or individuals. While sampling procedures ensure that works authored by corporate bodies are represented in the Corpus, there is no criterion governing the ratio of the occurence of this authorship status to that of others. The author of the text, and whether authorship is sole, multiple, or corporate, is unknown. The sole author of a written text is male; or all authors of a multiply-authored written text are male. While sampling procedures ensure that works written by males are represented in the Corpus, there is no criterion governing the ratio of the occurence of this author gender to that of others. The sole author of a written text is female; or all authors of a multiply-authored written text are female. While sampling procedures ensure that works written by females are represented in the Corpus, there is no criterion governing the ratio of the occurence of this author gender to that of others. The multiply-authored written text is known to have both male and female authors, and may have authors of unknown gender. The gender of the sole author (or genders of all the multiple authors) of a written text is unknown. The age of the author (or the ages of all multiple authors) of a written text at the time of publication (or, for unpublished works, at the time of writing) is 20 years or less. While sampling procedures ensure that works by authors in this age group are represented in the Corpus, there is no criterion governing the ratio of the occurence of this age group to that of others. The age of the author (or the ages of all multiple authors) of a written text at the time of publication (or, for unpublished works, at the time of writing) is between 21 and 35 years. While sampling procedures ensure that works by authors in this age group are represented in the Corpus, there is no criterion governing the ratio of the occurence of this age group to that of others. The age of the author (or the ages of all multiple authors) of a written text at the time of publication (or, for unpublished works, at the time of writing) is between 36 and 50 years. While sampling procedures ensure that works by authors in this age group are represented in the Corpus, there is no criterion governing the ratio of the occurence of this age group to that of others. The age of the author (or the ages of all multiple authors) of a written text at the time of publication (or, for unpublished works, at the time of writing) is between 51 and 65 years. While sampling procedures ensure that works by authors in this age group are represented in the Corpus, there is no criterion governing the ratio of the occurence of this age group to that of others. The age of the author (or the ages of all multiple authors) of a written text at the time of publication (or, for unpublished works, at the time of writing) is 66 years or greater. While sampling procedures ensure that works by authors in this age group are represented in the Corpus, there is no criterion governing the ratio of the occurence of this age group to that of others. The text is written by multiple authors with ages falling into two or more of the alternative author age categories. The age of the author (or, in the case of multiple authorship, at least one of the authors) of a written text at the time of publication (or, for unpublished works, at the time of writing) is unknown. The sole author of the text, or all authors of a multiply-authored text, was brought up in the south-west of England. While sampling procedures ensure that works by authors of this ethnic origin are represented in the Corpus, there is no criterion governing the ratio of the occurence of this ethnic origin to that of others. The sole author of the text, or all authors of a multiply-authored text, was brought up in the south-east of England. While sampling procedures ensure that works by authors of this ethnic origin are represented in the Corpus, there is no criterion governing the ratio of the occurence of this ethnic origin to that of others. The sole author of the text, or all authors of a multiply-authored text, was brought up in Greater London, England. While sampling procedures ensure that works by authors of this ethnic origin are represented in the Corpus, there is no criterion governing the ratio of the occurence of this ethnic origin to that of others. The sole author of the text, or all authors of a multiply-authored text, was brought up in Wales. While sampling procedures ensure that works by authors of this ethnic origin are represented in the Corpus, there is no criterion governing the ratio of the occurence of this ethnic origin to that of others. The sole author of the text, or all authors of a multiply-authored text, was brought up in the west Midland area of England. While sampling procedures ensure that works by authors of this ethnic origin are represented in the Corpus, there is no criterion governing the ratio of the occurence of this ethnic origin to that of others. The sole author of the text, or all authors of a multiply-authored text, was brought up in the east Midland area of England. While sampling procedures ensure that works by authors of this ethnic origin are represented in the Corpus, there is no criterion governing the ratio of the occurence of this ethnic origin to that of others. The sole author of the text, or all authors of a multiply-authored text, was brought up in the the East Anglian area of England. While sampling procedures ensure that works by authors of this ethnic origin are represented in the Corpus, there is no criterion governing the ratio of the occurence of this ethnic origin to that of others. The sole author of the text, or all authors of a multiply-authored text, was brought up in Northern Ireland. While sampling procedures ensure that works by authors of this ethnic origin are represented in the Corpus, there is no criterion governing the ratio of the occurence of this ethnic origin to that of others. The sole author of the text, or all authors of a multiply-authored text, was brought up in the north-west of England. While sampling procedures ensure that works by authors of this ethnic origin are represented in the Corpus, there is no criterion governing the ratio of the occurence of this ethnic origin to that of others. The sole author of the text, or all authors of a multiply-authored text, was brought up in the north-east of England. While sampling procedures ensure that works by authors of this ethnic origin are represented in the Corpus, there is no criterion governing the ratio of the occurence of this ethnic origin to that of others. The sole author of the text, or all authors of a multiply-authored text, was brought up in the north of England. While sampling procedures ensure that works by authors of this ethnic origin are represented in the Corpus, there is no criterion governing the ratio of the occurence of this ethnic origin to that of others. The sole author of the text, or all authors of a multiply-authored text, was brought up in Scotland. While sampling procedures ensure that works by authors of this ethnic origin are represented in the Corpus, there is no criterion governing the ratio of the occurence of this ethnic origin to that of others. The sole author (or all multiple authors) of the text was brought up in England (exact region unknown). The sole author (or all multiple authors) of the text was brought up in Great Britain. The multiple authors of the text have a variety of ethnic backgrounds. The ethnic origin of the sole author (or all multiple authors) of the text is unknown The sole author (or all multiple authors) of the text was brought up in the Indian subcontinent. The sole author (or all multiple authors) is a second-generation member of the British Asian community. The sole author (or all multiple authors) is a member of the Jewish community. The mother tongue of the sole author (or all multiple authors) of the text is French. The sole author (or all multiple authors) of the text was brought up in the USA. The sole author of the text, or all authors of a multiply-authored text, lived in the south-west of England at the time the text was published (or, in the case of an unpublished text, written). While sampling procedures endeavour to ensure that works by authors of this domicile are represented in the Corpus, there is no criterion governing the ratio of the occurence of this domicile to that of others. The sole author of the text, or all authors of a multiply-authored text, lived in the south-east of England at the time the text was published (or, in the case of an unpublished text, was written). While sampling procedures endeavour to ensure that works by authors of this domicile are represented in the Corpus, there is no criterion governing the ratio of the occurence of this domicile to that of others. The sole author of the text, or all authors of a multiply-authored text, lived in Greater London, England at the time the text was published (or, in the case of an unpublished text, was written). While sampling procedures endeavour to ensure that works by authors of this domicile are represented in the Corpus, there is no criterion governing the ratio of the occurence of this domicile to that of others. The sole author of the text, or all authors of a multiply-authored text, lived in Wales at the time the text was published (or, in the case of an unpublished text, was written). While sampling procedures endeavour to ensure that works by authors of this domicile are represented in the Corpus, there is no criterion governing the ratio of the occurence of this domicile to that of others. The sole author of the text, or all authors of a multiply-authored text, lived in the west Midland area of England at the time the text was published (or, in the case of an unpublished text, was written). While sampling procedures endeavour to ensure that works by authors of this domicile are represented in the Corpus, there is no criterion governing the ratio of the occurence of this domicile to that of others. The sole author of the text, or all authors of a multiply-authored text, lived in the east Midland area of England at the time the text was published (or, in the case of an unpublished text, was written). While sampling procedures endeavour to ensure that works by authors of this domicile are represented in the Corpus, there is no criterion governing the ratio of the occurence of this domicile to that of others. The sole author of the text, or all authors of a multiply-authored text, lived in the The east Anglian area of England at the time the text was published (or, in the case of an unpublished text, was written). While sampling procedures endeavour to ensure that works by authors of this domicile are represented in the Corpus, there is no criterion governing the ratio of the occurence of this domicile to that of others. The sole author of the text, or all authors of a multiply-authored text, lived in northern Ireland at the time the text was published (or, in the case of an unpublished text, was written). While sampling procedures endeavour to ensure that works by authors of this domicile are represented in the Corpus, there is no criterion governing the ratio of the occurence of this domicile to that of others. The sole author of the text, or all authors of a multiply-authored text, lived in the north-west of England at the time the text was published (or, in the case of an unpublished text, was written). While sampling procedures endeavour to ensure that works by authors of this domicile are represented in the Corpus, there is no criterion governing the ratio of the occurence of this domicile to that of others. The sole author of the text, or all authors of a multiply-authored text, lived in the north-east of England at the time the text was published (or, in the case of an unpublished text, was written). While sampling procedures endeavour to ensure that works by authors of this domicile are represented in the Corpus, there is no criterion governing the ratio of the occurence of this domicile to that of others. The sole author of the text, or all authors of a multiply-authored text, lived in the north of England at the time the text was published (or, in the case of an unpublished text, was written). While sampling procedures endeavour to ensure that works by authors of this domicile are represented in the Corpus, there is no criterion governing the ratio of the occurence of this domicile to that of others. The sole author of the text, or all authors of a multiply-authored text, lived in Scotland at the time the text was published (or, in the case of an unpublished text, was written). While sampling procedures endeavour to ensure that works by authors of this domicile are represented in the Corpus, there is no criterion governing the ratio of the occurence of this domicile to that of others. The sole author (or all multiple authors) of the text was brought up in England (exact region unknown) at the time the text was published (or, in the case of an unpublished text, was written). The sole author (or all multiple authors) of the text was Great Britain (country and region unknown) at the time the text was published (or, in the case of an unpublished text, was written). The domiciles of the multiple authors of the text corresponded to a variety of the alternative categories at the time the text was published (or, in the case of an unpublished text, was written). The domicile of the sole author (or some of the multiple authors) of the text at the time of its publication (or, in the case of an unpublished text, at the time of its writing) is unknown. The sole author of the text, or all authors of a multiply-authored text, lived in Europe, but outside Great Britain, at the time the text was published (or, in the case of an unpublished text, was written). The sole author of the text, or all authors of a multiply-authored text, lived in Africa at the time the text was published (or, in the case of an unpublished text, was written). The sole author of the text, or all authors of a multiply-authored text, lived in Asia at the time the text was published (or, in the case of an unpublished text, was written). The sole author of the text, or all authors of a multiply-authored text, lived in North America at the time the text was published (or, in the case of an unpublished text, was written). The sole author of the text, or all authors of a multiply-authored text, lived in South America at the time the text was published (or, in the case of an unpublished text, was written). The text is targeted at an audience below school age. While sampling procedures endeavour to ensure that works for a target audience in this age group are represented in the Corpus, there is no criterion governing the ratio of the occurence of this target age group to that of others. The text is targeted at an audience of primary school age (5 to 11 years inclusive) While sampling procedures endeavour to ensure that works for a target audience in this age group are represented in the Corpus, there is no criterion governing the ratio of the occurence of this target age group to that of others. The text is targeted at children. While sampling procedures endeavour to ensure that works for a target audience in this age group are represented in the Corpus, there is no criterion governing the ratio of the occurence of this target age group to that of others. The text is targeted at an audience in its early to mid teens (12 to 16 years inclusive). While sampling procedures endeavour to ensure that works for a target audience in this age group are represented in the Corpus, there is no criterion governing the ratio of the occurence of this target age group to that of others. The text is targeted at teenagers. While sampling procedures endeavour to ensure that works for a target audience in this age group are represented in the Corpus, there is no criterion governing the ratio of the occurence of this target age group to that of others. The text is targeted at an audience in its late teens or early adulthood (17-24 years inclusive). While sampling procedures endeavour to ensure that works for a target audience in this age group are represented in the Corpus, there is no criterion governing the ratio of the occurence of this target age group to that of others. The text is targeted at young adults. While sampling procedures endeavour to ensure that works for a target audience in this age group are represented in the Corpus, there is no criterion governing the ratio of the occurence of this target age group to that of others. The text is targeted at adults. While sampling procedures endeavour to ensure that works for a target audience in this age group are represented in the Corpus, there is no criterion governing the ratio of the occurence of this target age group to that of others. The text is targeted at a predominantly middle-aged audience. While sampling procedures endeavour to ensure that works for a target audience in this age group are represented in the Corpus, there is no criterion governing the ratio of the occurence of this target age group to that of others. The text is targeted at older adults. While sampling procedures endeavour to ensure that works for a target audience in this age group are represented in the Corpus, there is no criterion governing the ratio of the occurence of this target age group to that of others. The text is targeted at pensionable adults. While sampling procedures endeavour to ensure that works for a target audience in this age group are represented in the Corpus, there is no criterion governing the ratio of the occurence of this target age group to that of others. Printed material converted to machine-readable format by means of Optical Character Recognition techniques. Rekeyed printed material. Existing machine-readable text from the Oxford Pilot Corpus. Existing machine-readable materials from the Longman-Lancaster Corpus. Recorded spoken material transcribed to machine-readable format. Body text is included. Front and back matter are omitted. Chapter, section and other headings are included, appropriately tagged, at the point at which they appear in the text. Fixed captions (those which appear to belong at a particular point in a text) are included at the point at which they appear. Floating captions (those which appear not to belong at any particular point in a text) are included at the end of the text. Tables containing English sentences or phrases are included with appropriate tagging. Table of contents material is omitted. Credits, acknowledgements and copyright notices are omitted. Foot- and end-notes, and indication of the points at which they are referenced, are omitted. Pictures, figures, formulas, and any table not containing English phrases or sentences, are omitted, being replaced with a note tag which states what has been omitted. Running titles and other running material are omitted. Advertising material is omitted, unless otherwise stated in the header of a particular text. Following capture using Optical Character recognition techniques or rekeying, texts are proof-read with the intention of correcting errors which are a consequence of the data capture method. The MS-DOS 5.1 WordPerfect British English spelling checker is used to detect words which are misspelled in the captured data. Errors which are not in the original text are corrected. Where a clear typographic error, or a spelling error -- whether unintentional, or intended as a point of style, is detected in the original text during data capture or proof-reading, it is tagged with a sic tag. No corrected version is suggested. The proof-reading is not exhaustive: residual transcription errors remain in the texts. The intention is that the spelling in the texts is not normalized; it is as it appears in the original text. However, the proof-reading process may introduce a few inadvertent normalizations. During data capture, all forms of opening quotation mark are replaced with a single character (`); all forms of closing quotation mark with another character ("). Subsequent processing replaces as many of these pairs as possible with tagging. Some residual marks may remain after this process, however. Similarly, processing is intended to detect and replace rendition tags used to indicate quotation, but some residual uses of rendition tags for this purpose may remain. Soft hyphens in original texts are silently omitted from captured data. Hard hyphens are retained. Standard numeric values are not provided. Standard date values are not provided. Part of speech tagging has been added by the CLAWS2 program. Segment demarcation has been applied by the CLAWS2 program, except inside structures, such as tables, where such analysis does not produce useful results. The intent of the analysis is to tag sentences. Variant encoding is not applicable to the material in the Corpus. Page numbers are tagged and appear ahead of any text taken from the page, irrespective of where the page number is printed (if at all) on the page. Foreign words identifiable as such during data capture or subsequent processing are tagged. This process is not exhaustive: untagged foreign words are likely to remain in the Corpus. Where technically feasible, changes to and from bold, underlined and italic rendition in body text are captured. Other typeface changes -- for example, to small caps or a different font -- are ignored, as is the rendition of headings, headlines, captions and the like. Subsequent processing attempts to replace rendition tags with structural tags indicating the reason for the change in rendition (quotation, citation, emphasis etc.). This process is not exhaustive: rendition tags may remain in the text. All text is included, with the exceptions listed below. Article, paragraph, and other headings are included, appropriately tagged, at the point at which they appear in the text. Fixed captions (those which appear to belong at a particular point in a text) are included at the point at which they appear. Floating captions (those which appear not to belong at any particular point in a text) are included at the end of the text. Tables containing English sentences or phrases are included with appropriate tagging. Cover material from periodicals having covers is included, tagged appropriately. Table of contents material from periodicals having tables of contents is included, tagged appropriately. Foot- and end-notes, and indication of the points at which they are referenced, are omitted. Pictures, figures, formulas, and any table not containing English phrases or sentences, are omitted, being replaced with a note tag which states what has been omitted. Running titles and other running material are omitted. Advertising material is omitted, unless otherwise stated in the header of a particular text. Following capture using Optical Character recognition techniques or rekeying, texts are proof-read with the intention of correcting errors which are a consequence of the data capture method. The MS-DOS 5.1 WordPerfect British English spelling checker is used to detect words which are misspelled in the captured data. Errors which are not in the original text are corrected. Where a clear typographic error, or a spelling error -- whether unintentional, or intended as a point of style, is detected in the original text during data capture or proof-reading, it is tagged with a sic tag. No corrected version is suggested. The proof-reading is not exhaustive: residual transcription errors remain in the texts. The intention is that the spelling in the texts is not normalized; it is as it appears in the original text. However, the proof-reading process may introduce a few inadvertent normalizations. During data capture, all forms of opening quotation mark are replaced with a single character (`); all forms of closing quotation mark with another character ("). Subsequent processing replaces as many of these pairs as possible with tagging. Some residual marks may remain after this process, however. Similarly, processing is intended to detect and replace rendition tags used to indicate quotation, but some residual uses of rendition tags for this purpose may remain. Soft hyphens in original texts are silently omitted from captured data. Hard hyphens are retained. Standard numeric values are not provided. Standard date values are not provided. Part of speech tagging has been added by the CLAWS2 Variant encoding is not applicable to the material in the Corpus. Page numbers are tagged and appear in at the head of each article starting on that page. Where an article extends over several pages, tagged page numbers appear at the points at which page boundaries are crossed. Foreign words identifiable as such during data capture or subsequent processing are tagged. This process is not exhaustive: untagged foreign words are likely to remain in the Corpus. Where technically feasible, changes to and from bold, underlined and italic rendition in body text are captured. Other typeface changes -- for example, to small caps or a different font -- are ignored, as is the rendition of headings, headlines, captions and the like. Subsequent processing attempts to replace rendition tags with structural tags indicating the reason for the change in rendition (quotation, citation, emphasis etc.). This process is not exhaustive: rendition tags may remain in the text. All text is included, except as noted below. References in speech of the names of people who, given information available at the time of transcription, are considered to be public personalities (media personalities, politicians, ``captains of industry'', clerics, published authors and academics etc.), are preserved. Names, addresses and telephone numbers which might identify individual speakers, their places of work, or other private individuals are omitted, being replaced with a note tag describing the omission. Speech which cannot be transcribed because it is unclear or inaudible is omitted, being replaced by a note tag stating the reason for omission. Telephone conversations are omitted unless it is possible to transcribe both sides. In general, such transcription is possible when special recording equipment has been used. Recordings consisting entirely of one-sided telephone conversations are not transcribed, and do not appear in the Corpus; one-sided telephone conversations which form part of an interaction with other people who are present and can be heard may be transcribed if the telephone interaction appears to the transcriber to be incidental. Alternatively, the whole of an interaction consisting primarily of a one-sided telephone conversation may be replaced by a note tag describing the omission. Not applicable to spoken material. Where words which appear to be a part of standard English are spoken either with standard pronunciation, or with a regional or ethnic accent, their spelling is normalized to that suggested by the MS-DOS WordPerfect 5.1 British English spelling checker. No attempt is made to represent the actual pronunciation. (Researchers interested in this aspect of the material are referred to the sound recordings, which the British National Corpus Project hopes to be able to publish.) Words which are not a part of standard English, such as slang and dialect, and words with which the transcriber is unfamiliar, such as technical terms or place names, are represented with a phonetic spelling. Control lists of acceptable and normalized forms are used in the transcription process, and transcribers should tag words not appearing in these lists so as to indicate uncertain spelling. The representation of words not appearing in the lists can be expected to vary from transcriber to transcriber (that is, from text to text), and may also vary within a single text. However, all utterances of a given person are transcribed by a single transcriber, making it likely that there will be little variation in the representation of non-standard words uttered by a given person. Word fragments are spelled as the corresponding fragment of the standard English word of which they appear to be part, or, if they do not appear to be part of such a word, phonetically. Fragments are terminated with a hyphen followed by a space or end of line. Conventional punctuation is applied to texts as they are transcribed: -- commas are used to mark short pauses in utterances; -- apostrophes are used to mark possessives and contractions; -- full stops are used to mark the perceived end of utterances; -- exclamation marks are used to mark the perceived end of exclamatory utterances; -- question marks are used to mark the perceived end of interrogative utterances. Pauses which cannot appropriately be indicated through punctuation -- typically because they are too long, or do not occur in a situation where punctuation is conventionally used, are represented by entities corresponding to pauses which, given the normal speed of speech of the speaker, are short, medium, or long. Very long pauses, during which nobody speaks for an extended period, are tagged with their duration, provided that it is possible to capture this information at the time the material is transcribed. (It is not possible to capture it if, for example, the recorder is turned off for some or all of the pause.) Vocalized pauses are tagged, and are represented orthographically. Paralingustic features, such as coughing, laughing, and sneezing, are tagged. Hard hyphens appear as appropriate. Hard hyphens never appear as the last character on a line. Soft hyphens are not applicable to spoken material. Numeric values are always spelled out as spoken -- for example, eleven hundred, one thousand one hundred. No standard representation is given. Dates are always spelled out as spoken -- for example, the fifth of May, nineteen seventy-two; May five, seventy-two. No standard representation is given. Overlapping speech is represented using the TEI timeline approach, using relative, not absolute times. Segment demarcation has been applied by the CLAWS2 program. This demarcation, which is intended to identify sentences, is dependent to some extent upon the conventional punctuation applied as part of the normalization process. Not applicable to spoken material. Foreign words identifiable as such during transcription or subsequent processing are tagged. This process is not exhaustive: untagged foreign words are likely to remain in the Corpus. The primary reference system for the written part of the Corpus operates at the segment level, a unique segment being identified by its id attribute. A secondary, hierarchical, reference system to the segment level also exists. A unique segment may be identified by combining its n attribute with those of all enclosing divs, and that of the text of which it is part. While page numbers from the source text are preserved if any pages in the source text are numbered, line numbers are not recorded, and line breaks are not preserved. Users are advised against using page numbers as a reliable reference system, and are reminded that text may have been captured from popular, rather than definitive, editions of works. The primary reference system for the spoken part of the Corpus operates at the utterance level, a unique utterance being identified by its id attribute. A secondary, hierarchical, reference system to the utterance level also exists. A utterance may be identified by combining its n attribute with that of the text of which it is part. To be supplied... The Wimbledon Poisoner British paperback edition <date>1991</date> ISBN 0-571-16131-6 Nigel Williams Accession to Oxford Pilot Corpus Jeremy Clear,OUP Automatic conversion to prototype CDIF format Gavin Burnage, OUCS Incorporation into straw man for British National Corpus Dominic Dunlop, OUCS Version 0.1, accession 91-11-01 37,404 orthgraphic words Faber & Faber London, England 1990 Maura Less Dominic Dunlop 91-11-01 Straw man version 0.1 PART ONE Innocent Enjoyment “When a felon's not engaged in his employment Or maturing his felonious little plans, His capacity for innocent enjoyment, Is just as great as any honest man's!”

~~W. S. Gilbert, Pirates of Penzance~~

CHAPTER ONE

~~Henry Farr did not, precisely, decide to murder his wife.~~ ~~It was simply that he could think of no other way of prolonging her absence from him indefinitely.~~

~~He had quite often, in the past, when she was being more...~~