BNC

British National Corpus User Reference Guide

5. The header

  Author: edited by Lou Burnard (revised LB) Date: (revised 19-22 Nov 2003)

Up: Contents Previous: 4. Descriptive tagging Next: 6. Miscellaneous code tables

The header of a TEI-conformant text generally provides a highly structured description of its contents, analogous to the title page and front matter provided for conventional printed books. Such information is all too often missing in electronic texts; or if supplied, provided only in the form of external documentation such as this manual. The component elements of a TEI header are intended to provide in machine-processable form all the information needed to make sensible use of the corpus.

Every separate text in the BNC-baby (i.e. each <bncDoc> element) has its own header, referred to below as a text header. In addition, the corpus itself has a header, referred to below as the corpus header, containing information which is applicable to the whole corpus. Both corpus and text headers are represented by <teiHeader> elements. This element carries the following attributes:

type
specifies the kind of document to which the header is attached. Legal values are:

corpus
the header is attached to the corpus.
text
the header is attached to a single text.

creator
specifies the agency responsible for creating the header.
status
specifies the revision status of the associated document. Legal values are:

new
for the first release of the corpus
update
for all subsequent releases.

update
specifies the date on which the header content was last changed or created.

In the remainder of this section, we describe the components of the <teiHeader> element, as used within the BNC-baby. A TEI header contains a file description (section 5.1. The file description), an encoding description (section 5.2. The encoding description), a profile description (section 5.3. The profile description) and a revision description (section 5.4. The revision description), represented by the following four elements:

<fileDesc>
contains a full bibliographic description of the corpus itself or of a text within it.
<encodingDesc>
documents the relationship between an electronic text and the source or sources from which it was derived.
<profileDesc>
provides further information about various aspects of a text, specifically the language used, the situation and date of its production, the participants and their setting, and a descriptive classification for it.
<revisionDesc>
summarizes the revision history of a file.

5.1. The file description

The file description (<fileDesc>) is the first of the four main constituents of the header. It documents the electronic file itself, i.e. (in the case of a corpus header) the whole corpus, or (in the case of a text header) any characteristics peculiar to an individual file within it. In each case, it contains the following five subdivisions:

<titleStmt>
contains title information, identifying the corpus, or a text within it.
<editionStmt>
contains additional information relating to a particular version of the corpus (not used with individual corpus texts).
<extent>
describes the approximate size of the electronic file as stored on some carrier medium.
<publicationStmt>
formally describes the publication or distribution of the corpus and its constituent texts.
<sourceDesc>
supplies a bibliographic description for the copy text(s) from which a particular corpus text was derived or generated.

Further detail for each of these is given in the following subsections.

5.1.1. The title statement

As used in the BNC, the title statement (<titleStmt>) element has a simpler structure than the equivalent TEI element: it contains one or more <title> elements, followed by zero or more <author>, <editor>, or <respStmt> elements. These sub-elements are used throughout the header, wherever the title of a work or a statement of responsibility are required.

<title>
the title or chief name of a work, including any alternative titles or subtitles.
<respStmt>
supplies information about any person or institution responsible for the intellectual content of a text, edition, or electronic transcription, using the following two sub-elements:

<resp>
contains a phrase describing the nature of a person's or institution's intellectual responsibility.
<name>
proper name of a person, place or institution.

A standardized form of words has been used for the <title> elements which supplies the following components:

  • for written texts, a (possibly shortened) version of the original source title, or, if there is none, a descriptive phrase enclosed in square brackets
  • an indication of the size and type of the document
  • a note indicating the domain or subject matter of the document

Here are some typical examples:

<title>  Man at the sharp end. Sample containing 
  about 37622 words from a book (domain: imaginative) </title> 

<title>  Belfast Telegraph: Religious affairs stories. Sample 
containing about 2180 words from a periodical (domain: 
belief and thought) </title>        

<title>   6 conversations recorded by `Pauline' (PS0N3) between 
21 and 24 February 1992 with 8 interlocutors, totalling 1668 
s-units, 16234 words, and over 1 hour 49 minutes 24 seconds 
of recordings.  </title>
A <respStmt> element is used to indicate each agency responsible for any significant effort in the creation of the text. Since responsibilities for data encoding and storage, and for enrichment, are the same for all texts, they are not repeated in each text header. The responsibility for original data capture and transcription varies text by text, and is therefore stated in each text header, in the following manner:
 <respStmt> 
  <resp>Data capture and transcription </resp> 
  <name> Longman ELT </name>
</respStmt> 

Author and editor information for the source from which a text is derived (e.g. the author of a book) is not included in the <filedesc> element but in the <sourceDesc> element discussed below (5.1.5. The source description).

5.1.2. The edition statement

The standard TEI <editionStmt> element is used to specify an edition for each file making up the corpus. For the corpus header this takes the following form:

<editionStmt>
<edition>First Edition</edition>
</editionStmt>

Since the header of each text has been taken unchanged from the BNC World edition, the form of words used for each text header in the current release of the BNC-baby corpus is as follows:

<editionStmt><para>BNC World Edition: 
Header automatically generated by mkhdr 0.30
</para></editionStmt>

5.1.3. The extent statement

The standard TEI <extent> element is used to specify the size of the whole corpus, in the corpus header, or of an individual text, in each text header as in the following example:

<extent>Approximately 88 Kbytes running text, 
containing about 5890 orthographically-defined words; 
for encoding details see &lt;tagUsage&gt; element.
</extent>
The specified size does not include the size of the header itself. As the text specifies, the size in Kbytes is only approximate (and may vary on different operating systems). The number of words is calculated according to a simple algorithm which defines words as blank-delimited strings. It is not identical to the number of <w> elements actually tagged in the text. For example, the sequence ‘she's’ would count as one word for the purposes of calculating the extent since it does not contain a blank, but it would be tagged as two distinct <w> elements, whereas the sequence ‘in spite of’ counts as three orthographic words, although this sequence is treated as a single <w> element.

Counts are provided for each element actually tagged in a text, as further discussed below (5.2.2. The tagging declaration

5.1.4. The publication statement

The standard TEI <publicationStmt> element is used to specify publication and availability information for an electronic text. It contains information about the name and address of the distributor, identification numbers etc., notes on availability and publication dates.

The TEI <distributor> and <address> elements are used to record information about the publisher; for BNC-baby in both corpus and text headers, the name and address given is as follows:

<distributor>
Oxford University Computing Services
</distributor>
<address>
<addrLine>13 Banbury Road, Oxford OX2 6NN U.K.</addrLine>
<addrLine>Telephone:     +44 1865 273221</addrLine>
<addrLine>Facsimile:     +44 1865 273275</addrLine>
<addrLine>Internet mail: natcorp@oucs.ox.ac.uk</addrLine>
</address>

The standard TEI <idno> element is used to identify the item being published. For the corpus header only one such element is specified, as follows:

<idno type="bnc">BNC-baby</idno>
For individual text headers, two identification numbers are supplied, distinguished by the value for the type attribute.
<idno type="bnc">A0A</idno>
<idno type="old">CAMfct</idno>
The second identifier (of type old) is the old-style mnemonic or numeric code attached to BNC texts in early releases of the corpus, and used to label original printed source materials in the BNC Archive. The first three character code (of type bnc) is the standard BNC identifier. It is used both for the filename in which the text is stored and as the value supplied for the id attribute on the <bncDoc> element containing the whole text.

As regards availability, for contractual reasons, the corpus header includes a brief rehearsal of the terms and conditions under which the BNC is made available; this is reproduced in section 7.3. The BNC corpus header below. A similar brief notice is also provided in the same place for each individual text:

<availability status="restricted"><para>
Available worldwide
THIS TEXT IS AVAILABLE THROUGHOUT THE WORLD only as part of
the British National Corpus at nominal charge FOR ACADEMIC
RESEARCH PURPOSES SUBJECT TO A SIGNED END USER LICENCE
HAVING BEEN RECEIVED BY OXFORD UNIVERSITY COMPUTING
SERVICES, from whom forms and supporting materials are
available.
THIS TEXT IS NOT AVAILABLE FOR COMMERCIAL RESEARCH AND
EXPLOITATION unless terms have first been agreed with the
BNC Consortium Exploitation Committee. Apply in the first
instance to Oxford University Computing Services.
It is your responsibility, as a user, to ensure that an End
User Licence is in place. For your information, the Terms
of the End User Licence are set out in the corpus header, 
which is likely to have a file name similar to "corphdr" or 
"CORPHDR".
Distribution of any part of the corpus under the terms of
the Licence must include a copy of the corpus header.
Distribution of this corpus text under the terms of the
Licence must include this header embodying this notice.
Permissions grantor for World:
CAMRA (imprint) of St Albans
</para></availability>
Note the inclusion at the end of the notice of the name and address of the agency owning rights in the text concerned, which has granted permission for its inclusion in the corpus. If no such agency is named, permission for rights additional to those explicitly given by the licencing arrangements in place should be sought from the BNC Consortium in the first instance. BNC-baby includes only texts for which world rights have been cleared by the BNC Consortium.

5.1.5. The source description

The standard TEI <sourceDesc> element is used to supply bibliographic details for the original source material from which an electronic text derives. In the case of a BNC text, this might be a book, pamphlet, newspaper etc., or a recording. One of the following two elements available within the <sourceDesc> will be used, as appropriate:

<recordingStmt>
describes a set of recordings used in transcription of a spoken text, either as a series of paragraphs or as a formally structured recording element (5.1.5.1. The recording statement).
<biblStruct>
contains a structured bibliographic citation, in which only bibliographic sub-elements appear and in a specified order. (5.1.5.2. Structured bibliographic record)

These elements are not used within the corpus header, which simply contains a note about the sources from which the corpus was derived, tagged as a <para> (paragraph). The headers of individual texts each contain one of the above elements to specify their source.

5.1.5.1. The recording statement

The recording statement (<recordingStmt>) element contains one or more <recording> elements, defined as follows:

<recording>
details of a particular audio recording used as the source of a spoken text, either directly or from a public broadcast. Attributes include:

date
specifies the date of the recording e.g. 1991-03-06
dur
specifies the duration of the recording, in seconds

time
specifies the time of day when the recording was made, e.g. 11:30+
type
characterizes the recording in terms of the equipment used to make it. In BNC-baby all recordings were made on a Walkman.

The standard TEI global attribute n is used (for this element only) to provide the number of the audio tape holding the original recording, as deposited with the National Sound Archive in London.

The standard TEI global attribute id is used to provide a unique identifier for this recording, which is then also used on the n attribute <settDesc> element to link recordings made in the same setting, and on the decls attribute of a <div> element within the transcription, as further described below, in section 5.3.4. The setting description.

The BNC version of this TEI element has two additional attributes (date and time) and it may contain only character data, rather than the more complex substructure permitted by the TEI equivalent.

When, as is often the case for the spoken demographic parts of the BNC, a text has been made up by transcribing several different recordings made by a single respondent over a period of time, each such recording will have its own <recording> element, as in the following example:

<recordingStmt>
  <recording n="018201" dur="322" date="1991-11-28" 
        time="18:15+" type="Walkman" id="KB7RE000"></recording>
  <recording n="018202" dur="253" date="1991-11-28" 
        time="18:15+" type="Walkman" id="KB7RE001"></recording>
  <!-- ... -->
  <recording n="018207" dur="630" date="1991-11-29" 
        time="10:15+" type="Walkman" id="KB7RE006"></recording>
  <recording n="018301" dur="75" date="1991-11-29" 
        time="12:15+" type="Walkman" id="KB7RE007">
  <!-- ... -->
</recordingStmt>

Note the presence of an id attribute on each of the above recordings. The value given here is used to indicate the recording from which a given part of the text was transcribed. Each recording is transcribed as a distinct <div> (division) element within an <stext>, with its identifier supplied as the value of a decls attribute. Thus, in the body of the text from which the above example was taken, there will be a <div> element starting as follows:

<div decls="KB7RE0077">
which will contain the part of text transcribed from that recording. As noted above the identifier supplied on the n attribute is quite distinct, and specifies the original tape on which the recording was made.

5.1.5.2. Structured bibliographic record

The standard TEI <biblStruct> element is used to record bibliographic information for each non-spoken component of the BNC. As defined in the TEI, this element has a complex structure designed to support a wide range of standard bibliographic practices. In the BNC, its structure is restricted as further described below.

At the highest level, all BNC <biblStruct> element will contain a <monogr> element holding other elements that describe the item in question.

At least one <monogr> element must be present in a <biblStruct> element. It may contain the following elements:

<title>
the title or chief name of a work, including any alternative titles or subtitles; this must be given first. In several cases, a generated title or descriptive paraphrase is used, generally enclosed within square brackets. In the current version of the corpus, subtitles, alternative or series titles are not distinguished from the main title, other than by the use of conventional punctuation.
<author>
the name of an author (personal or corporate) of a work; names are generally given in canonical form, with surnames preceding forenames. Unlike the TEI equivalent element of the same name, the BNC version has two additional attributes:

domicile
specifies the author's domicile, as established for the purposes of the BNC ‘Britishness’ test.
born
specifies the author's year of birth, where available.

<editor>
the name of the editor (personal or corporate) for a work.
<imprint>
groups information relating to the publication or distribution of a bibliographic item.
<biblScope>
defines the scope of a bibliographic reference, for example as a list of page numbers: the only value used in BNC-baby is pp for `page numbers'.

A <title> element must be present and is always given first. None of the other components is mandatory, but if any of them are supplied, they must be in the following order, following the title:

  • one or more statements of intellectual responsibility (i.e. <author> or <editor> elements)
  • one or more <imprint> elements

The n attribute is used with both <author> and <imprint> elements to supply a six-letter code identifying the author or imprint concerned. The values used should be unique across the corpus, but this is not validated by the current release of the DTD.

For published texts at least one <imprint> element is supplied, containing the following elements in the order given:

<publisher>
name of a publisher.
<pubPlace>
place of publication.
<date>
date of publication of the edition transcribed, usually given in normalized format. Note that this may not be the same as the date specified by the <creation> element. Attributes include:

value
specifies standard value for this date in ISO 8601 format

The following example demonstrates how these elements are used to record bibliographic details for a typical book:

<biblStruct default="NO">
          <monogr>
            <title>Matrices and engineering dynamics. </title>
            <author n="SimpsA1">Simpson, A</author>
            <author n="CollaA1">Collar, A R</author>
            <imprint n="ELLISH1">
              <publisher>Ellis Horwood Ltd</publisher>
              <pubPlace>Chichester</pubPlace>
              <date value="1987">1987</date>
            </imprint>
            <biblScope type="pp">11-195</biblScope>
          </monogr>
        </biblStruct>

The following example is typical of the case where a collection of newspaper pages has been treated as a single text:

<biblStruct default="NO">
          <monogr>
            <title>Belfast Telegraph: Business section.</title>
            <imprint>
              <publisher>u.p.</publisher>
            </imprint>
          </monogr>
        </biblStruct>

Where `series' information is available for a given title, this is not normally tagged distinctly. Instead the series title is given as part of the monographic title, usually preceded by a colon.

5.2. The encoding description

The second major component of the TEI header is the encoding description (<encodingDesc>). This contains information about the relationship between an encoded text and its original source and describes the editorial and other principles employed throughout the corpus. It also contains reference information used throughout the corpus.

The standard TEI <encodingDesc> element has the following six components:

<projectDesc>
describes in detail the purpose for which an electronic file was encoded, together with any other relevant information concerning the process by which it was assembled or collected.
<samplingDecl>
contains a prose description of the rationale and methods used in sampling texts in the creation of the corpus.
<editorialDecl>
provides details of editorial principles and practices applied during the encoding of a text.
<tagsDecl>
provides detailed information about the tagging applied to a corpus text.
<refsDecl>
specifies how canonical references are constructed for a text.
<classDecl>
contains a series of <category> elements, defining the classification codes used for texts within the corpus.

In the BNC, one of each of these elements appears in the corpus header, with the exception of the <tagsDecl> element which is also given in the individual text headers.

5.2.1. Documentary components of the encoding description

The <projectDesc> element for the corpus gives a brief description of the goals, organization and results of the original BNC project. It is reproduced in section 7.3. The BNC corpus header below.

In the original BNC header, the <samplingDecl> and <editorialDecl> elements were used to document a range of different sampling and editorial policies. This was necessary because the corpus was originally constructed by different people using slightly different principles. The policies applicable to a given text or part of a text are encoded as a series of five letter codes, supplied as the value of an decls (declarations) attribute on the element concerned, and the significance of each code is given in the appropriate part of the corpus header.

Although of largely historical interest, these policy codes have been retained in BNC-baby itself, and the original declarations for them have therefore been retained in the corpus header. The subset of the defined codes actually used in BNC-baby is listed here for convenience:

CN000
Errors tagged with <sic> when seen; no normalization
CN001
Errors tagged with <sic> if seen; normalisation with <corr>
CN002
Normalized to standard British English or control list member
CN004
Corrections and normalizations applied silently
HN000
Smart elision of line-end hyphens; &rehy used for remainder
HN001
Dumb elision of line-end hyphens; true hyphens hand-reinstated
HN002
Line-end hyphens removed by hand where appropriate
QN000
Open and close quote normalized to &bquo, &equo
QN001
Open and close quote normalized to &quo
QN002
Quotation may be represented using <shift>
SN000
Segmentation carried out by CLAWS5.
TN006
Recording transcribed into Longman format; automatically translated to SGML

5.2.2. The tagging declaration

The standard tagging declaration (<tagsDecl>) element is used slightly differently in corpus and in text headers. In the corpus header, it is used to list every element name actually used within the corpus, together with a brief description of its function. In text headers, it is used to specify the number of elements actually tagged within each text. In either case it consists of a number of <tagUsage> elements, defined as follows:

<tagUsage>
supplies information about the usage of a specific element within a <text>. Attributes include:

gi
the name (generic identifier) of the element indicated by the tag.
occurs
the number of occurrences of this element within the text.

In the corpus header, each <tagUsage> element contains a brief description of the element specified by its <gi> element; the occurs attribute is not supplied, as in the following extract:

<tagUsage gi="div4">
Written text division, level 4
</tagUsage>
<tagUsage gi="event">
Non-verbal event in spoken text
</tagUsage>
<tagUsage gi="gap">
Point where source material has omitted
</tagUsage>
<tagUsage gi="head">
Header or headline in written text 
</tagUsage>
In text headers, the <tagUsage> elements are empty, but the occurs attribute is always supplied, and indicates the number of such elements which appear within the text, as in the following example, taken from a typical written text:
<tagsDecl>
<tagUsage gi="body" occurs="1"></tagUsage>
<tagUsage gi="c" occurs="4649"></tagUsage>
<tagUsage gi="caption" occurs="77"></tagUsage>
<tagUsage gi="div1" occurs="2"></tagUsage>
<tagUsage gi="div2" occurs="10"></tagUsage>
<tagUsage gi="div3" occurs="49"></tagUsage>
<tagUsage gi="gap" occurs="1"></tagUsage>
<tagUsage gi="head" occurs="55"></tagUsage>
<tagUsage gi="hi" occurs="50"></tagUsage>
<tagUsage gi="item" occurs="26"></tagUsage>
<tagUsage gi="list" occurs="4"></tagUsage>
<tagUsage gi="note" occurs="3"></tagUsage>
<tagUsage gi="p" occurs="378"></tagUsage>
<tagUsage gi="pb" occurs="65"></tagUsage>
<tagUsage gi="ptr" occurs="79"></tagUsage>
<tagUsage gi="s" occurs="1713"></tagUsage>
<tagUsage gi="sic" occurs="1"></tagUsage>
<tagUsage gi="text" occurs="1"></tagUsage>
<tagUsage gi="w" occurs="40394"></tagUsage>
</tagsDecl>

5.2.3. The reference and classification declarations

The <refsDecl> element for the corpus header defines the approved format for references to the corpus. It takes the following form

<refsDecl>
<para>Canonical references in the British National Corpus
are to text segment (&lt;s&gt;) elements, and
are constructed by taking the value of the n attribute
of the &lt;cdif&gt; element containing the target text,
and concatenating a dot separator, followed by the value
of the n attribute of the target &lt;s&gt; element.
</para></refsDecl>

The standard TEI <classDecl> element is used in the BNC Corpus Header to formally define the text classication scheme applied to the corpus, and the particular codes within it. It consists of a set of <category> elements, each representing a particular textual classification feature and a value for that feature. The classification scheme defined here is that used for the original design of the BNC: consequently, not all the categories defined here are actually used in the BNC-baby.

<category>
contains an individual descriptive category or feature-value pair.
<catDesc>
describes some category within a taxonomy or text typology, in the form of a brief prose description.

For example, the following <category> elements appear within the BNC <classDecl> element in the header:

<category id="wridom">
<catDesc>Domain for written corpus texts</catDesc>
<category id="wridom1">
<catDesc>Imaginative</catDesc>
</category>
<category id="wridom2">
<catDesc>Informative: natural &amp; pure science</catDesc>
</category>
<category id="wridom3">
<catDesc>Informative: applied science</catDesc>
</category>
<category id="wridom4">
<catDesc>Informative: social science</catDesc>
</category>
<category id="wridom5">
<catDesc>Informative: world affairs</catDesc>
</category>
<category id="wridom6">
<catDesc>Informative: commerce &amp; finance</catDesc>
</category>
<category id="wridom7">
<catDesc>Informative: arts</catDesc>
</category>
<category id="wridom8">
<catDesc>Informative: belief &amp; thought</catDesc>
</category>
<category id="wridom9">
<catDesc>Informative: leisure</catDesc>
</category>
</category>
The <catDesc> element contained by the outer <category> element here (that with identifier wridom) is understood to apply also to each <catDesc> contained by each of its constituent (daughter) <category> elements. That is, the full description for category wridom3 is ‘Domain for written corpus texts : informative: natural science’.

The category descriptions applicable to a given text are specified by the <catRef> element within its header, as described above. Its target lists the identifiers of all <category> elements applicable to that text. Thus, the header of a written text assigned to the social science domain which has a corporate author will include a <catRef> element like the following:

<catref target='... wriaty1 wridom4...'>
The dots above represent the identifiers of all other category codes applicable to this text.

A full list of all category codes, and the numbers of texts so classified in BNC-baby is provided in section 6.4. Text and genre classification codes; see also the classification tables in 2. Design of BNC-baby.

Information about the classification and categorization of an individual text is held within the <textClass> element discussed below (5.3.5. Text classification)

5.3. The profile description

The third component of the TEI header is the profile description (<profileDesc>) element, which has the following components:

<creation>
contains information about the creation of a text.
<langUsage>
describes the languages, sublanguages, registers, dialects etc. represented within a text.
<particDesc>
describes the identifiable participants in a linguistic interaction together with their relationships, where known.
<settingDesc>
describes the setting or settings within which a language interaction takes place.
<textClass>
groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc.

These elements are all used in the BNC, as further described in the following sections.

5.3.1. The creation element

This element is provided to record the date of first publication of individual published texts, and any details concerning the origination of any spoken or written texts, whether or not covered elsewhere. It is supplied in every text header, although the details provided vary. As a minimum, a date (tagged with the standard <date> element) will be included; this gives the date the content of this text was first created. For a spoken text, this will be the same as the date of the recording; for a written text, it will normally be the date of first publication.

Here are two typical examples:

<creation><date>1992-02-11</date>: 
</creation> 

<creation><date>1971</date>:
originally published by Jonathan Cape.
</creation>

For imaginative works, the creation date is also the date used to classify the text (by means of the writim category). For other written works, such as textbooks, which are likely to have been extensively revised since their first publication, the date used to classify the text will be that of the edition described in the <sourceDesc>, but the original date will also be recorded within the <creation> element.

5.3.2. The <langUsage> element

Unlike the other elements of the profile description, the language usage element occurs only in the corpus header. It contains the following text:

<langUsage>
The language of the British National Corpus is modern
British English.  Words, fragments, and passages from many
other languages, both ancient and modern, occur within the
corpus where these may be represented using a Latin
alphabet.  Long passages in these languages, and material
in other languages, are generally silently deleted.  In no
case is the lang attribute used to indicate the language
of a word, phrase or passage, nor are alternate writing
system definitions used.
</langUsage>

5.3.3. The participant description

The participant description (<particDesc>) element is used to provide information about speakers of texts transcribed for the BNC. In its basic structure it is close to the element defined by the TEI but it has been modified to include some more specific elements provided for the BNC. It appears both within the corpus header, to define the generic ‘unknown participant’, and also within individual spoken text headers to define the participants specific to those texts.

It contains a series of <person> elements describing the participants whose speech is transcribed in this text, followed by an optional <particLinks> element describing any relationships or links amongst them.

5.3.3.1. The person element

Each <person> element describes a single participant in a language interaction and may take one or more of the following attributes:

id
(mandatory) supplies a unique code used to identify this speaker and their utterances in the transcription.
role
specifies the role of this participant with respect to the respondent, as specified by the respondent.
sex
specifies the sex of the participant. Possible values are:

m
male
f
female
u
unknown or inapplicable

age
specifies the age group to which the participant belongs. Possible values are:

0
Under 15 years
1
15 to 24 years
2
25 to 34 years
3
35 to 44 years
4
45 to 59 years
5
Over 59 years
X
Unknown

soc
specifies the social class of the participant. Legal values are:

AB
AB (top or middle management, administrative or professional)
C1
C1 (junior management, supervisory or clerical)
C2
C2 (skilled manual)
DE
DE (semi-skilled or unskilled)
UU
Class unknown

The global id attribute is required for each participant whose speech is included in a text, and its value is unique within the corpus. Although a given individual will always have the same identifier within a single text, there is no way of identifying the same individual appearing in different texts. This is one reason why all demographically sampled conversations collected by a single respondent are treated together as a single text.

The value for the dialect attribute is a three-letter code: a full list of codes used and their meanings is given in section 6.3. Regional codes.

Where available, any additional information about a participant will be provided as text within the <person> element, enclosed within a <para> element. Thee following extra elements are also provided for this purpose:

age
specified more exactly than by the age attribute, which groups respondents into age bands.
name
a proper name used for the person.
occupation
characterization of the person's occupation.
dialect
characterization of the person's dialect.

In each case, the information provided is that given by the respondent and is taken from the log books issued to all participants in the demographic part of the corpus. It has not been normalized.

Here is a typical example from the demographic part of the corpus:

<person id="PS028" role="self" sex="m" soc="C2" age="1" dialect="XLO">
          <name>Andrew</name>
          <age>16</age>
          <occupation>student</occupation>
          <dialect>London</dialect>
        </person>

5.3.4. The setting description

The TEI <settDesc> element is used to document the context within which a spoken text takes place. It appears once in the header of each spoken text, and contains one or more <setting> elements for each distinct recording.

<setting>
describes one particular setting in which a language interaction takes place. The following attributes are used:

who
supplies the identifiers of the participants at this setting.
id
supplies a unique identifier for this setting.
n
supplies the identifier used for the <recording> element corresponding with this setting.

The content of each <setting> element supplies additional details about the place, time of day, and other activities going on, using the following additional elements:

<name>
contains a place name, usually prefixed by the name of the English county in which it is located.
<locale>
contains a brief informal description of the nature of a place, for example a room, a restaurant, a park bench etc.
<activity>
contains a brief informal description of what a participant in a language interaction is doing other than speaking, if anything. Bears an additional attribute:

spont
indicates the degree of spontaneity associated with the activity as either H (high) M (medium) or L (low)

Thus, the following example provides additional information about the setting in which the recording on tape number 063505 was made. The person recorded (PS0M6) is watching television at home in Morecambe, Lancs.

        <setting id="KDFSE002" n="063505" who="PS0M6">
          <name>Lancashire:  Morecambe </name>
          <locale> at home </locale>
          <activity spont="H"> watching television </activity>
        </setting>

5.3.5. Text classification

The TEI provides a number of ways in which classification or text-type information may be specified for a text, grouped together within a <textClass> element, which appears once in the header of each text. Classifications may be represented using references to internally defined classications provided in the <classCode> element (such as the BNC classification scheme described in section 5.2.3. The reference and classification declarations), by reference to some other predefined classification system, or by an open set of keywords. All three methods are used in the BNC, using the following elements:

<catRef>
specifies one or more defined categories within some taxonomy or text typology. Attributes include:

target
identifies all the categories concerned.

<classCode>
contains the code used for this text in an externally-defined classification system: in this release of the BNC, the genre codes defined by David Lee are used.
<keywords>
contains a list of keywords or phrases identifying the topic or nature of a text, each of which is tagged as a <term> element.

A <catRef> element is provided in the header of each text. Its target attribute contains values for each of the classification codes listed in the following table and defined in the corpus header. In each case, the classification code consists of an alphabetic prefix (e.g. alltim) identifying the category (e.g. "date"), followed by a single digit indicating a value for that category. Thus the code alltim1 indicates ‘dated 1960-1974’. The value 0 is always used to indicate missing or unknown values. A list of the values used is given in section 6.4. Text and genre classification codes below.

This taxonomy is that originally defined for selection and description of texts during the design of the corpus, as further discussed elsewhere. It is of course possible to classify the texts in many other ways, and no claim is made that this method is universally applicable or even generally useful, though it does serve to identify broadly distinct sub-parts of the corpus for investigation. The reader is also cautioned that, although an attempt has been made in the current edition of the corpus to correct the more egregious classification errors noted in the first edition, unquestionably many errors and inconsistencies remain. In particular, the categories wrilev (perceived level of difficulty) and wrista (estimated circulation size) were incorrectly differentiated during the preparation of the corpus and cannot be relied on.

A <classCode> element is also provided for every text in the corpus. It contains the code assigned to this text in David Lee's genre-based analysis carried out at Lancaster University since publication of the first edition of the BNC.

In the first release of the BNC, most texts were assigned a set of descriptive keywords, tagged within the <keywords> element. These terms were not taken from any particular descriptive thesaurus or closed vocabulary; the words or phrases used are those which seemed useful to the data preparation agency concerned, and are thus often inconsistent or even misleading. They have been retained unchanged in the present version of the BNC, pending a more thorough revision.

In this edition of the BNC, a second set of keywords has been supplied for the majority of written texts. These keywords are also tagged using a <keywords> element, but with a value for the source attribute of COPAC, indicating that the terms so tagged are derived from a different source. The source used is a major online library catalogue service (see http://www.copac.ac.uk), from which we have taken the subject keywords provided for each title identifiable as forming part of the BNC. Like other public access catalogue systems, COPAC uses a well-defined controlled list of keywords for its subject indexing, details of which are not further given here.

Here is an example showing how one text (EWW) is classified in each of these three ways:

<textClass default="NO">
        <catRef target="alltim3 allava0 alltyp3 wriaag0 
           wriad0 wriase1 wriaty2 wriaud3 wridom2 wrilev2 
           wrimed1 wripp1 wrisam2 wrista0 writas0"/>
        <classCode scheme="DLee">W ac tech engin</classCode>
        <keywords scheme="COPAC">
          <term>Applied dynamics. Applications of matrices</term>
        </keywords>
        <keywords>
          <term> maths </term>
        </keywords>
      </textClass>

5.4. The revision description

The revision description (<revisionDesc>) element is the fourth and final element in the standard TEI header. In the BNC, it consists of a series of <change> elements, each containing a <date>, a <respStmt>, and a <para> element. A new <change> element was added at the start of the list for each major change made in the text or header during preparation of the BNC World edition, and also during each stage of the XML conversion of the files making up the BNC-baby

Here is the start of a typical example:

<change>
<date>2003-08-10</date>
<respStmt><resp>ed</resp><name>OUCS</name></respStmt>
<para>Replace all character entities; fix revisionDesc</para>
</change>
<change>
<date>2003-04-12</date>
<respStmt><resp>ed</resp><name>OUCS</name></respStmt>
<para>XML conversion by pretty print stylesheet</para>
</change>
<change>
<date>2000-12-13</date>
<respStmt><resp>ed</resp><name>OUCS</name></respStmt>
<para>Last check for BNC World first release</para>
</change>

When any significant change is made to any component of the corpus, the following steps are taken:

Up: Contents Previous: 4. Descriptive tagging Next: 6. Miscellaneous code tables


Date: (revised 19-22 Nov 2003) Author: edited by Lou Burnard (revised LB).
British National Corpus.