BNC User Reference Guide

5 The header

Up: Contents Previous: 4 Spoken texts Next: 6 Wordclass Tagging in BNC XML

The header of a TEI-conformant text provides a structured description of its contents, analogous to the title page and front matter of a book. The component elements of a TEI header are intended to provide in machine-processable form all the information needed to make sensible use of the Corpus.

Every separate text in the British National Corpus (i.e. each <bncDoc> element) has its own header, referred to below as a text header. In addition, the corpus itself has a header, referred to below as the corpus header, containing information which is applicable to the whole corpus. Both corpus and text headers are represented by <teiHeader> elements.

The corpus header is supplied in a separate file called bncHdr.xml, whereas text headers are prefixed to each file in the Texts directory. In the remainder of this section, we describe the components of the <teiHeader> element, as used within the BNC. A TEI header contains a file description (section 5.1 The file description ), an encoding description (section 5.2 The encoding description), a profile description (section 5.3 The profile description ) and a revision description (section 5.4 The revision description), represented by the following four elements:

5.1 The file description

The file description (<fileDesc>) is the first of the four main constituents of the header. It is intended to document an electronic file i.e. (in the case of a corpus header) the whole corpus, or (in the case of a text header) any characteristics peculiar to an individual file within it. In each case, it contains the following five subdivisions:
  • <titleStmt> (title statement) groups information about the title of a work and those responsible for its intellectual content.
  • <editionStmt> (edition statement) groups information relating to one edition of a text.
  • <extent> specifies the approximate size of the text, in orthographic words, w elements, and s elements .
  • <publicationStmt> (publication statement) groups information concerning the publication or distribution of an electronic or other text.
  • <sourceDesc> supplies a description of the source text(s) from which an electronic text was derived or generated.

Further detail for each of these is given in the following subsections.

5.1.1 The title statement

The title statement (<titleStmt>) element of a BNC text contains one or more <title> elements, optionally followed by <author>, <editor>, or <respStmt> elements. These sub-elements are used throughout the header, wherever the title of a work or a statement of responsibility are required.

For the corpus header, the title statement looks like this:
  <title>The British National Corpus: XML Edition</title>
   <resp>Lead partner in consortium</resp>
   <name>Oxford University Press</name>
   <resp>Text selection for miscellaneous and unpublished written materials</resp>
   <name>W R Chambers</name>
   <resp>Text selection, data capture and transcription for spoken texts and for 14% of published written texts</resp>
   <name>Longman ELT</name>
   <resp>Text selection for 86% published written texts</resp>
   <name>Oxford University Press</name>
   <resp>Data capture and transcription for all miscellaneous and unpublished written texts and for 86% of published written texts</resp>
   <name>Oxford University Press</name>
   <resp>XML conversion, encoding, storage and distribution</resp>
   <name>Oxford University Computing Services</name>
   <resp>Text enrichment</resp>
   <name>Unit for Computer Research into the English Language, University of Lancaster</name>
In individual corpus texts, the title statement follows a pattern like the following:
  <title>The National Trust Magazine. Sample containing about 21015 words from a periodical (domain: arts) </title>
   <resp> Data capture and transcription </resp>
   <name>Oxford University Press </name>

The content of the <title> element includes the title of the source, followed by the phrase "Sample containing about", the approximate word count for the sample, and further information about the text type and domain, all extracted from other parts of the header. This is followed by responsibility statements showing which of the BNC Consortium members was responsible for capturing the text originally.

Here are some typical examples:
 <title> How we won the open: the caddies' stories. Sample containing about 36083 words from a book (domain: leisure) </title>
<!-- ASA-->
 <title> Harlow Women's Institute committee meeting. Sample containing about 246 words speech recorded in public context</title>
 <title> The Scotsman: Arts section. Sample containing about 48246 words from a periodical (domain: arts) </title>
 <title>32 conversations recorded by `Frank' (PS09E) between 21 and 28 February 1992 with 9 interlocutors, totalling 3193 s-units, 20607 words, and 3 hours 22 minutes 23 seconds of recordings.</title>
 <title>[Leaflets advertising goods and products]. Sample containing about 23409 words of miscellanea (domain: commerce)</title>
A <respStmt> element is used to indicate each agency responsible for any significant effort in the creation of the text. Since responsibilities for data encoding and storage, and for enrichment, are the same for all texts, they are not repeated in each text header. The responsibility for original data capture and transcription varies text by text, and is therefore stated in each text header, in the following manner:
  <resp> Data capture and transcription </resp>
  <name> Longman ELT </name>

Author and editor information for the source from which a text is derived (e.g. the author of a book) is not included in the <filedesc> element but in the <sourceDesc> element discussed below (5.1.5 The source description ).

5.1.2 The edition statement

The <editionStmt> element is used to specify an edition for each file making up the corpus. It takes the same form in both the corpus header and individual text headers:
  <edition>BNC XML Edition, January 2007</edition>

5.1.3 The extent statement

The <extent> element is used in each text header to specify the size of the text to which it is attached, as in the following example:
 <extent> 21015 tokens; 21247 w-units; 957 s-units </extent>
These counts do not include the size of the header itself. The number of ‘tokens’ is generated by the Unix wc utility, which simply counts blank delimited strings; the other figures give the number of <w> and <s> elements respectively.

5.1.4 The publication statement

The <publicationStmt> element is used to specify publication and availability information for an electronic text. It contains the following three elements:
  • <distributor> supplies the name of a person or other agency responsible for the distribution of a text.
  • <availability> supplies information about the availability of a text, for example any restrictions on its use or distribution, its copyright status, etc.
  • <idno> (identifying number) supplies an identifying code for a text.
Individual text headers contains the following fixed text for the first two of these:
 <distributor>Distributed under licence by Oxford University Computing Services on behalf of the BNC Consortium.</distributor>
 <availability> This material is protected by international copyright laws and may not be copied or redistributed in any way. Consult the BNC Web Site at for full licencing and distribution conditions.</availability>
For contractual reasons, the corpus header includes a somewhat longer rehearsal of the terms and conditions under which the BNC is made available.
For individual text headers, two identification numbers are supplied, distinguished by the value of their type attribute.
 <idno type="bnc">A0A</idno>
 <idno type="old">CAMfct</idno>

The second identifier (of type old) is the old-style mnemonic or numeric code attached to BNC texts during the production of the corpus, and is still used to label the original printed source materials in the BNC Archive. The first three character code (of type bnc) is the standard BNC identifier. It is also used both for the filename in which the text is stored and as the value supplied for the xml:id attribute on the <bncDoc> element containing the whole text, and should always be used to cite the text. The code is a completely arbitrary identifier, and does not indicate anything about the nature of the text.

5.1.5 The source description

The <sourceDesc> element is used to supply bibliographic details for the original source material from which an electronic text derives. In the case of a BNC text, this might be a book, pamphlet, newspaper etc., or a recording. One of the following elements available within the <sourceDesc> will be used, as appropriate:
  • <recordingStmt> (recording statement) describes a set of recordings used in transcription of a spoken text.
  • <bibl> (bibliographic citation) contains any bibliographic reference, occurring either within the header of a written corpus text in which case it has a fixed substructure, or within the body of a corpus text, in which case it contains only s elements.

These elements are not used within the corpus header, which simply contains a note about the sources from which the corpus was derived, tagged as a <para> (paragraph). The headers of individual texts each contain one of the above elements to specify their source.

Context-governed spoken texts derived from broadcast or similar ‘published’ material may have either a recording statement or a bibliographic record as their source.

All bibliographic data supplied in the individual text headers is collected together and reproduced in section 10 List of Sources below. The recording statement
The recording statement (<recordingStmt>) element contains one or more <recording> elements:
  • <recording> (recording event) details of an audio or video recording event used as the source of a spoken text, either directly or from a public broadcast.
    n tape number.
    date date of the recording in standardized form.
    time time of day the recording was made.
    type kind of recording.
    dur duration of the recording in seconds.

The value of the n attribute here provides the number of the audio tape holding the original recording, as deposited with the British Library's Sound Archive in London.

In the following simple example, typical of most of the ‘context-governed’ parts of the BNC, the <recording> element has no content at all:

When, as is often the case for the spoken demographic parts of the BNC, a text has been made up by transcribing several different recordings made by a single respondent over a period of time, each such recording will have its own <recording> element, as in the following example:


<!-- ... -->


<!-- ... -->
Note the presence of an xml:id attribute on each of the above recordings. The value given here is used to indicate the recording from which a given part of the text was transcribed. Each recording is transcribed as a distinct <div> (division) element within an <stext>. In that element, the identifier of the source recording is supplied as the value of a decls attribute. Thus, in the spoken text derived from the above mentioned recordings, there will be a <div> element starting as follows:
 <div decls="KB7RE0077"> ...</div>
which will contain the part of text transcribed from that recording. As noted above the identifier supplied on the n attribute is quite distinct, and identifies the tape on which the original recording was made, and by which it is referenced in the British Library's Sound Archive. Structured bibliographic record
In addition to its usage within the corpus texts (see 3.2.7 Bibliographic references), the <bibl> element is also used to record bibliographic information for each non-spoken component of the BNC. In this case, its structure is constrained to contain only the following elements in the order specified:
  • <title> contains the full title of a work of any kind.
  • <editor> secondary statement of responsibility for a bibliographic item, for example the name of an individual, institution or organization, (or of several such) acting as editor, compiler, translator, etc.
  • <author> in a bibliographic reference, contains the name of the author(s), personal or corporate, of a work; the primary statement of responsibility for any bibliographic item.
  • <imprint> groups information relating to the publication or distribution of a bibliographic item.
  • <pp> supplies page numbers for a bibliographic citation.

During production of the BNC, the n attribute was used with both <author> and <imprint> elements to supply a six-letter code identifying the author or imprint concerned. The values used should be unique across the corpus, but this is not validated in the current release of the DTD.

The <imprint> element is supplied for published texts only and contains the following elements in the order given:
  • <pubPlace> contains the name of the place where a bibliographic item was published.
  • <publisher> provides the name of the organization responsible for the publication or distribution of a bibliographic item.
  • <date> contains a date in any format.
The following example demonstrates how these elements are used to record bibliographic details for a typical book:
  <title>It might have been Jerusalem. </title>
  <author n="HealyT1" domicile="Scotland">Healy, Thomas</author>
  <imprint n="POLYGO1">
   <publisher>Polygon Books</publisher>
   <date value="1991">1991</date>
<!-- BNC -->
The following example is typical of the case where a collection of leaflets or newsletters has been treated as a single text:
  <title>[Potato Marketing Board leaflets]</title>
  <imprint n="POTATO1">
   <publisher>Potato Marketing Board</publisher>
   <date value="1991">1991</date>
<!-- EEA -->
Occasionally, a bibliographic item has two titles, for example a series title as well as an individual title, or multiple authors. In the BNC such cases are treated simply by repeating the element concerned, sometimes using the level attribute to distinguish the bibliographic ‘level’ of the title:
  <title>Damages for personal injury and death: </title>
  <title level="a">Damages on death</title>
  <author n="SauntT1">Saunt, Thomas</author>
  <editor>Kemp, David</editor>
  <imprint n="LONGMA1">
   <publisher>Longman Group UK Ltd</publisher>
   <date value="1993">1993</date>
<!-- J6W -->

Where ‘series’ information is available for a given title, this is not normally tagged distinctly. Instead the series title is given as part of the monographic title, usually preceded by a colon.

This level of bibliographic description has not been carried out with complete consistency across the current release of the corpus.

5.2 The encoding description

The second major component of the TEI header is the encoding description (<encodingDesc>). This contains information about the relationship between an encoded text and its original source and describes the editorial and other principles employed throughout the corpus. It also contains reference information used throughout the corpus.

The BNC <encodingDesc> element has the following six components:
  • <projectDesc> (project description) describes in detail the aim or purpose for which an electronic file was encoded, together with any other relevant information concerning the process by which it was assembled or collected.
  • <samplingDecl> (sampling declaration) contains a prose description of the rationale and methods used in sampling texts in the creation of a corpus or collection.
  • <editorialDecl> (editorial practice declaration) provides details of editorial principles and practices applied during the encoding of a text.
  • <tagsDecl> (tagging declaration) provides information about the XML elements actually used within a BNC text.
  • <refsDecl> (references declaration) provides documentation for the reference system applicable to the corpus.
  • <classDecl> (classification declarations) contains one or more taxonomies defining any classificatory codes used elsewhere in the text.
  • <xairaSpecification> specifies additional information needed by XAIRA.

In the BNC, one of each of these elements appears in the corpus header. Only the <tagsDecl> element appears in the individual text headers.

5.2.1 Documentary components of the encoding description

The <projectDesc> element for the corpus gives a brief description of the goals, organization and results of the BNC project. The <samplingDecl>, <editorialDecl> and <refsDecl> elements similarly supply brief prose descriptions describing the sampling procedures used in the project and the referencing system applied. This information is also summarized elsewhere in this documentation.

5.2.2 The tagging declaration

The tagging declaration (<tagsDecl>) element is used slightly differently in corpus and in text headers. In the corpus header, it is used to list every element name actually used within the corpus, together with a brief description of its function. In text headers, it is used to specify the number of elements actually tagged within each text. In either case it consists of a <namespace> element, containing a number of <tagUsage> elements, defined as follows:
  • <namespace> supplies the formal name of the namespace to which the elements documented by its children belong.
  • <tagUsage> (tag usage) supplies information about the usage of a specific element within a text.
    gi the name (generic identifier) of the element indicated by the tag.
    occurs specifies the number of occurrences of this element within the text.
In the corpus header, each <tagUsage> element contains a brief description of the element specified by its <gi> element; the occurs attribute is not supplied, as in the following extract:
 <tagUsage gi="event"> Non-verbal event in spoken text </tagUsage>
 <tagUsage gi="gap"> Point where source material has omitted </tagUsage>
 <tagUsage gi="head"> Header or headline in written text </tagUsage>
In text headers, the <tagUsage> elements are empty, but the occurs attribute is always supplied, and indicates the number of such elements which appear within the text, as in the following example, taken from a typical written text:
  <namespace name="">
   <tagUsage gi="c" occurs="5750"/>
   <tagUsage gi="corr" occurs="1"/>
   <tagUsage gi="div" occurs="115"/>
   <tagUsage gi="gap" occurs="3"/>
   <tagUsage gi="head" occurs="156"/>
   <tagUsage gi="hi" occurs="147"/>
   <tagUsage gi="l" occurs="2"/>
   <tagUsage gi="lg" occurs="1"/>
   <tagUsage gi="mw" occurs="256"/>
   <tagUsage gi="p" occurs="680"/>
   <tagUsage gi="quote" occurs="3"/>
   <tagUsage gi="s" occurs="2415"/>
   <tagUsage gi="w" occurs="41799"/>

5.2.3 The reference and classification declarations

The <refsDecl> element for the corpus header defines the approved format for references to the corpus. It takes the following form
  <para>Canonical references in the British National Corpus are to text segment (s) elements, and are constructed by taking the value of the xml:id attribute of the bncDoc element containing the target text, and concatenating a dot separator, followed by the value of the n attribute of the target s element. </para>
The standard TEI <classDecl> element is used in the BNC Corpus Header to formally define several text classication schemes which are used in the corpus. Each scheme or taxonomy defines a number of code/description pairs, applicable to a text in the corpus. For example, the written domain taxonomy defines twelve subject domains ("Imagination", "Informative: natural science", "Informative: applied science" etc.) and each written text is assigned to one of them. Each taxonomy is defined in the corpus header, using the following elements:
  • <taxonomy> (taxonomy) defines a typology used to classify texts either implicitly, by means of a bibliographic citation, or explicitly by a structured taxonomy.
  • <desc> (description) supplies explanatory text associated with a category or other component defined in the corpus header.
  • <category> (category) defines a single category within a taxonomy of texts.
  • <bibl> (bibliographic citation) contains any bibliographic reference, occurring either within the header of a written corpus text in which case it has a fixed substructure, or within the body of a corpus text, in which case it contains only s elements.
Here, for example, is the start of the <taxonomy> element defining the Written domain classification system as it appears in the corpus header:
 <taxonomy xml:id="WRIDOM">
  <desc>Written Domain</desc>
  <category xml:id="WRIDOM1">
  <category xml:id="WRIDOM2">
   <catDesc>Informative: natural & pure science</catDesc>
  <category xml:id="WRIDOM3">
   <catDesc>Informative: applied science</catDesc>

For a complete list of the taxonomies used in the BNC and the number of texts etc. classified according to them, refer to the corpus header and to chapter 1 Design of the corpus.

The classification categories applicable to a given text are specified by the <catRef> element within the associated text header. Its target lists the identifiers of all <category> elements applicable to that text. For example, the header of a written text assigned to the social science domain which has a corporate author will include a <catRef> element like the following:
 <catRef target="... WRIATY1 WRIDOM4..."/>
(The dots above represent the identifiers of all other category codes applicable to this text).

A full list of all category codes can be found in a separate document, and the numbers of texts so classified in the current release of the corpus is provided in section 9.6 Text and genre classification codes.

Further information about the classification and categorization of an individual texts is provided within the <textClass> element discussed below (5.3.5 Text classification )

5.2.4 The Xaira Specification

The Xaira Specification element is used by the XAIRA indexing software to index the BNC. A brief description of its components is provided in xairaspec below; for full information, consult the Xaira documentation available from

5.3 The profile description

The third component of a TEI header is the profile description. In the BNC this is used to provide the following elements:
  • <creation> contains information about the creation of a text.
  • <langUsage> (language usage) describes the languages, sublanguages, registers, dialects etc. represented within a text.
  • <particDesc> (participation description) describes the identifiable speakers, voices, or other participants in a linguistic interaction.
    n in demographic texts, supplies the respondent number used to identify the batch of tapes.
  • <settingDesc> (setting description) describes the setting or settings within which a language interaction takes place, either as a prose description or as a series of setting elements.
  • <textClass> (text classification) groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc.

5.3.1 The creation element

This element is provided to record the date of publication for texts originally published separately, and any details concerning the origination of any spoken or written texts, whether or not covered elsewhere. It is supplied in every text header, although the details provided vary. As a minimum, a date (tagged with the standard <date> element) will be included; this gives the date the content of this text was first created. For a spoken text, this will be the same as the date of the recording; for a written text, it will normally be the date of first publication of the edition, which may not be the same as the date of publication of the copy used.

Here are two typical examples:
  <date>1971</date>: originally published by Jonathan Cape.

Note that the BNC contains modernized editions of some classic texts such as Defoe's Robinson Crusoe (FRX); the creation date specified here is that of the creation of the modernized version rather than the 17th c. original.

For imaginative works, the creation date is also the date used to classify the text (by means of the WRITIM category). For other written works, such as textbooks, which are likely to have been extensively revised since their first publication, the date used to classify the text will be that of the edition described in the <sourceDesc>, but the original date will also be recorded within the <creation> element.

5.3.2 The <langUsage> element

Unlike the other elements of the profile description, the language usage element occurs only in the corpus header. It contains the following text:
 <langUsage> The language of the British National Corpus is modern British English. Words, fragments, and passages from many other languages, both ancient and modern, occur within the corpus where these may be represented using a Latin alphabet. Long passages in these languages, and material in other languages, are generally silently deleted. In no case is the lang attribute used to indicate the language of a word, phrase or passage, nor are alternate writing system definitions used. </langUsage>

5.3.3 The participant description

The participant description (<particDesc>) element is used to provide information about speakers of texts transcribed for the BNC. It appears only within individual spoken text headers to define the participants specific to those texts.

It contains a series of <person> elements describing the participants whose speech is transcribed in this text. The person element
Each <person> element describes a single participant in a language interaction. It carries a number of attributes which are used to provide encoded values for some key aspects of the person concerned:
  • <person> provides information about an identifiable individual, for example a participant in a language interaction, or a person referred to in a historical source.
    ageGroup specifies the age group to which the participant belongs.
    dialect specifies the dialect or accent of a participant's speech, as identified by the respondent.
    firstLang specifies the country of origin of the participant, as identified by the respondent.
    n internal identifier.
    educ specifies the age at which the participant ceased full-time education.
    soc specifies the social class of the participant.
    sex specifies the sex of the participant.
    role describes the relationship or role of this participant with respect to the respondent.
    xml:id provides the unique identifier for this element.

The xml:id attribute is required for each participant whose speech is included in a text, and its value is unique within the corpus. Although a given individual will always have the same identifier within a single text, there is no way of identifying the same individual should they appear in different texts. Since all demographically sampled conversations collected by a single respondent are treated together as a single text, and respondents were recruited from many different social contexts, the probability of the same person being recorded by different respondents is rather low, though not completely impossible.

On many occasions the speaker of a given utterance cannot be identified. A special code is used to indicate an unknown speaker, but, for consistency, this is also made unique to each text. Thus, an "unknown speaker" in one text will have different identifying code from an "unknown speaker" in another. As far as possible, different speakers are given different identifying codes, even where they cannot be identified with any confidence; thus there may be more than one "unidentified" speaker in the same text.

Where several speakers speak together, if they are identified, then all of the relevant codes are given; if however they are not, then a special "unknown speaker group" code is used.

Where it is available, additional information about a participant is provided by one or more of the following elements, appearing within the <person> element:
  • <persName> (personal name) contains a proper noun or proper-noun phrase referring to a person, possibly including any or all of the person's forenames, surnames, honorifics, added names, etc.
  • <age> specifies the age in years of a recorded participant at the time of the recording in which they participate.
  • <occupation> contains an informal description of a person's trade, profession or occupation.
  • <dialect> contains an informal description of the regional variety of English used by a participant in a spoken text.
  • <persNote> contains any additional information supplied about a participant in a spoken text.

In each case, the information provided is that given by the respondent and is taken from the log books issued to all participants in the demographic part of the corpus. It has not been normalized.

Here is a typical example from the demographic part of the corpus:

Here is a typical example from the context-governed part of the corpus:

  <persName>frank harasikwa</persName>
  <persNote>Euro candidate presenting self for selection</persNote>
Any recorded relationship between speakers in the demographically sampled part of the corpus is specified by means of the role attribute, which indicates how the speaker concerned is related to the respondent, for example as a friend, colleague, brother, wife, etc. For example, the participant information recorded in the header for a text (KSU) comprising conversations between four participants: Michael and Steve (who are brothers), their mother Christine and their aunt Leslie is as follows:
 <particDesc n="708">


   <occupation>credit controller</occupation>



In the context-governed part of the corpus however, there is no respondent and relationship information must be deduced from the other information provided. The role attribute for <person> elements in these texts will usually have the value unspecified.

5.3.4 The setting description

The <settingDesc> element is used to describe the context within which a spoken text takes place. It appears once in the header of each spoken text, and contains one or more <setting> elements for each distinct recording.
  • <setting> (setting) describes one particular setting in which a language interaction takes place.
    who indicates the person, or group of people, to whom the element content is ascribed.
    n an internal identifier for a setting.
    xml:id provides the unique identifier for this element.
The content of each <setting> element supplies additional details about the place, time of day, and other activities going on, using the following additional elements:
  • <date> contains a date in any format.
  • <locale> contains a brief informal description of the nature of a place for example a room, a restaurant, a park bench etc.
  • <activity> contains a brief informal description of what a participant in a language interaction is doing other than speaking, if anything.
    spont level of spontaneity or informality of the context as assessed by transcriber.
  • <placeName> contains an absolute or relative place name.
Some typical examples follow:
 <setting n="020901" who="PS000 DCJPS000 DCJPS001">
  <name>Essex: Harlow </name>
  <locale> Harlow College</locale>
  <activity spont="M"> A'level lecture </activity>
 <setting xml:id="KDFSE002" n="063505" who="PS0M6">
  <name>Lancashire: Morecambe </name>
  <locale> at home </locale>
  <activity spont="H"> watching television </activity>

5.3.5 Text classification

The TEI provides a number of ways in which classification or text-type information may be specified for a text, grouped together within a <textClass> element, which appears once in the header of each text. Classifications may be represented using references to internally defined classications provided in the <classCode> element (such as the BNC classification scheme described in section 5.2.3 The reference and classification declarations), by reference to some other predefined classification system, or by an open set of keywords. All three methods are used in the BNC, using the following elements:
  • <catRef> (category reference) provides a list of codes identifying the categories to which this text has been assigned, each code referencing a category element declared in the corpus header (list available as a separate document).
  • <classCode> contains the classification code used for this text in some standard classification system.
    scheme identifies the classification system or taxonomy in use.
  • <keywords> contains a list of keywords or phrases identifying the topic or nature of a text.

A <catRef> element is provided in the header of each text. Its target attribute contains values for each of the classification codes defined in the corpus header. In each case, the classification code consists of a code used as the identifier of a <category> element within a <taxonomy> element defined in the corpus header. For example: ALLTIM1 indicates ‘dated 1960-1974’. A list of the values used is given in section 9.6 Text and genre classification codes below.

This taxonomy is that originally defined for selection and description of texts during the design of the corpus, as further discussed elsewhere. It is of course possible to classify the texts in many other ways, and no claim is made that this method is universally applicable or even generally useful, though it does serve to identify broadly distinct sub-parts of the corpus for investigation. The reader is also cautioned that, although an attempt has been made in the current edition of the corpus to correct the more egregious classification errors noted in the first edition, unquestionably many errors and inconsistencies remain. In particular, the categories WRILEV (perceived level of difficulty) and WRISTA (estimated circulation size) were incorrectly differentiated during the preparation of the corpus and cannot be relied on.

A <classCode> element is also provided for every text in the corpus. This contains the code assigned to the text in a genre-based analysis carried out at Lancaster University by David Lee since publication of the first edition of the BNC. Lee's scheme classes the texts more delicately in most cases, since it takes into account their topic or subject matter (see further 9.6 Text and genre classification codes below).

Lee's scheme is also used as the basis of a very simple categorization for each text, which is provided by means of the type attribute on its <text> or <stext> element. This categorization distinguishes six categories for written text (fiction, academic prose, non-academic prose, newspapers, other published, unpublished), and two for spoken text (conversation, other); It may be found a convenient way of distinguishing the major text types represented in the corpus: see further 9.1 XML tag usage by text type.

In the first release of the BNC, most texts were assigned a set of descriptive keywords, tagged as <term> elements within the <keywords> element. These terms were not taken from any particular descriptive thesaurus or closed vocabulary; the words or phrases used are those which seemed useful to the data preparation agency concerned, and are thus often inconsistent or even misleading. They have been retained unchanged in the present version of the BNC, pending a more thorough revision. In the World (second) Edition this set of keywords was complemented for most written texts by a second set, also tagged using a <keywords> element, but with a value for its source attribute of COPAC, indicating that the terms so tagged are derived from a different source. The source used was a major online library catalogue service (see Like other public access catalogue systems, COPAC uses a well-defined controlled list of keywords for its subject indexing, details of which are not further given here.

Here is an example showing how one text (BND) is classified in each of these ways:
 <teiHeader>... <textClass>

   <classCode scheme="DLee">W_religion</classCode>
   <keywords scheme="COPAC">
    <term>Marriage - Religious aspects - Christianity</term>
    <term>Marriage - Christian viewpoints</term>
    <term>Christian guide to marriage</term>
 <wtext type="NONAC">...</wtext>

5.4 The revision description

The revision description (<revisionDesc>) element is the fourth and final element of a standard TEI header. In the BNC, it consists of a series of <change> elements.
  • <change> summarizes a particular change or correction made to a particular version of an electronic text which is shared between several researchers.
    date supplies the date of the change in standard form, i.e. yyyy-mm-dd.
    who indicates the person, or group of people, to whom the element content is ascribed.
Here is part of a typical example:
  <change date="2006-10-21" who="#OUCS">Tag usage updated for BNC-XML</change>
  <change date="2000-12-13" who="#OUCS">Last check for BNC World first release</change>
 <change date="1999-12-25" who="#OUCS">corrected tagUsage</change>
  <change date="1999-09-13" who="#UCREL">POS codes revised for BNC-2; header updated</change>
  <change date="1994-11-24" who="#dominic">Initial accession to corpus</change>

Up: Contents Previous: 4 Spoken texts Next: 6 Wordclass Tagging in BNC XML

edited by Lou Burnard. Date: January 2007
This page is copyrighted