Basic structure

Up: Contents Previous: 2. Design of BNC-baby Next: 4. Descriptive tagging

The original British National Corpus was delivered as an application of ISO 8879, the Standard Generalized Mark-Up Language (SGML). This international standard provides, amongst other things, a method of specifying an application-independent document grammar, in terms of the elements which may appear in a document, their attributes, and the ways in which they may legally be combined. SGML is also a superset of XML, the extensible markup language defined by the World Wide Web Consortium for general use on the World Wide Web.

BNC-baby is delivered in an XML format which is documented in this manual in section 3.1. Markup conventions below; more detailed information about XML itself is readily available in many places.

The original BNC encoding format was strongly influenced by the proposals of the Text Encoding Initiative (TEI). This international research project resulted in the development of a set of comprehensive guidelines for the encoding and interchange of a wide range of electronic texts amongst researchers. An initial report appeared in 1991, and a substantially revised and expanded version in early 1994. A conscious attempt was made to conform to TEI recommendations, where these had already been formulated, but in the first version of the BNC there were a number of differences in tag names, and models. In the second edition of the BNC (BNC World), the tagging scheme was changed to conform as far as possible with the published Recommendations of the TEI, and this has been followed in BNC-baby. Unless otherwise stated, elements in the markup scheme with the same name as those in the published TEI scheme have the same meaning.

Section 3.1. Markup conventions describes the basic structure of BNC-baby, in terms of the XML elements and attributes distinguished and the tags used to mark them. Section 4.1. Written texts describes features which are peculiar to written texts, and section 4.2. Spoken texts those peculiar to spoken texts. In each case, a distinction is made between those elements which are marked up in all texts and those which (for technical or financial reasons) are not always so distinguished, and hence appear in some texts only.

Section 5. The header describes the structure of the <teiHeader> element attached to each component of the corpus, and also to the whole corpus itself. Sections 4.1. Written texts and 4.2. Spoken texts informally describe the elements specific to written and to spoken texts respectively. It should be noted that by no means all of the features described here will be present in every text of the corpus, nor, if present, will they necessarily be tagged. A list of elements actually used in the whole corpus is given below in 6.1. Elements defined by the BNC DTD.

3.1. Markup conventions

BNC-baby is marked up in XML and encoded in Unicode. These formats are now so pervasive as to need little explication here; for the sake of completeness however, we give a brief summary of their chief characteristics. We strongly recommend the use of XML-aware processing tools to process the corpus.

An XML document, such as BNC-baby consists of a single root element, within which are nested occurrences of other element types. All element occurrences are delimited by tags. There are two forms of tag, a start-tag, marking the beginning of an element, and an end-tag marking its end. Tags are delimited by the characters < and >, and contain the name of the element (its gi, for generic identifier), preceded by a solidus (/) in the case of an end-tag.

For example, a heading or title in a written text will be preceded by a tag of the form <head> and followed by a tag in the form </head>. Everything between these two tags is regarded as the content of an element of type <head>.

Attributes applicable to element instances, if present, are also indicated within the start-tag, and take the form of an attribute name, an equals sign and the attribute value, in the form of a quoted literal. Attribute values are used for a variety of purposes, notably to represent the part of speech codes allocated to particular words by the CLAWS tagging scheme.

For example, the <head> element may take an attribute type which categorizes it in some way. A main heading will thus appear with a start tag <head type="main">, and a subheading with a start tag <head type="sub">.

The names of elements and attributes are case-significant. The style adopted throughout BNC-baby is to uses lower-case letters for identifiers, unless they are derived from more than one word, in which case the first letter of the second and any subsequent word is capitalized.

Unless it is empty, every occurrence of an element must have both a start-tag and an end-tag. Empty elements use a special syntax in which start and end-tags are combined together: for example, the point at which a page break occurs in an original source is marked <pb/> rather than <pb></pb>

BNC-baby is delivered in UTF-8 encoding: this means that almost all characters in the corpus are represented directly as the appropriate Unicode character. The chief exceptions are the ampersand (&) which is always represented by the special string &, the double quotation mark, which is always represented by the special string ", and the arithmetic less-than sign, which always appears as <. These `named entity references' are a syntactic convention of XML which is followed by this version of the corpus.

Finally, although this is not mandated by either XML or SGML, in the present form of the corpus, tags are never broken across linebreaks. Additionally, an attempt has been made to avoid linebreaks within the content of a single <s> element, so as to simplify processing of the text.

3.2. An example

As an example, here is the opening of text J10 (a novel by Michael Pearce):

  <text decls="CN001 QN000 SN000">
    <body>
      <pb n="5"/>
      <div1 type="u">
        <head><s n="1">
          <w type="NN1" lemma="chapter">CHAPTER </w>
          <w type="CRD" lemma="1">1</w></s>
        </head>
        <p><s n="2"><c type="PUQ">‘</c>
          <w type="CJC" lemma="but">But</w>
          <c type="PUN">,</c>
          <c type="PUQ">’ </c>
          <w type="VVD" lemma="say">said </w>
          <w type="NP0" lemma="owen">Owen</w>
          <c type="PUN">, </c>
          <c type="PUQ">‘</c>
          <w type="AVQ" lemma="where">where </w>
          <w type="VBZ" lemma="be">is </w>
          <w type="AT0" lemma="the">the </w>
          <w type="NN1" lemma="body">body</w>
          <c type="PUN">?</c>
          <c type="PUQ">’</c></s>
        </p>
    ....
 </body></text>

(To aid legibility of the example, line breaks have been introduced after each <w> and <c> element: in the original source, each <s> element is on a single line)

This example shows the start tag for a <text> element, which bears a decls attribute; inside the <text>, is the start of a <body> element, which contains as its first child an empty <pb> element, followed by the start of a <div1> element. As the definitions elsewhere in this manual make clear, this tagging simply indicates the start of a new written text, which in its original source form began on a page numbered 5. The start of the first subdivision of this text is shown in the example, and consists of a heading (marked by a <head> element) and a paragraph (marked by the <p> element).

Each distinct word and punctuation mark in the text, as identified by the CLAWS tagger, has been separately tagged with a <w> or <c> element as appropriate. These elements both bear a type attribute, used to indicate the POS code for the word; <w> elements also bear a lemma attribute, which is used to indicate the root form of the word. The sequence of words and punctuation marks making up a complete segment is tagged as an <s> element, and bears an mayn attribute, which supplies its sequence number within the text. A combination of text identifier (the three letter code) and <s> number may be used to reference any part of the corpus: the example above contains J10 1 and J10 2.

This is not, of course, a complete text: in particular, it lacks the TEI header which is prefixed to each text file making up the corpus. Its purpose is to indicate how the corpus is encoded. Any XML aware processing software, including common Web browsers, should be able to operate directly on BNC texts in XML format.

The remainder of this manual describes in more detail the intended semantics for each of the XML elements used in the corpus, with examples of their use. To aid legibility of these examples, the <w> and <c> tags may be omitted in some cases: however, they are present throughout the corpus itself.

3.3. Global attributes

Three global attributes are defined in the TEI scheme, each of which may potentially be specified for any element. In practice their use is limited to certain specific functions, which are discussed at the appropriate place below, but for convenience their use is also summarized here:

id: system-generated identifier of an item, unique within the corpus
n: any name or identifier for an element, not necessarily unique within the corpus
rend: the rendition or appearance of an element.

3.4. Corpus and text elements

BNC-baby contains a large number of text samples, some spoken and some written. Each such sample has some associated descriptive or bibliographic information particular to it, and there is also a large body of descriptive information which applies to the whole corpus.

In XML terms, the corpus consists of a single element, tagged <bnc>. This element contains a single <teiHeader> element, containing metadata which relates to the whole corpus, followed by a sequence of <bncDoc> elements. Each such <bncDoc> element contains its own <teiHeader>, containing metadata relating to that specific text, followed by either a <text> element (for written texts) or an <stext> element (for spoken texts). The last named element is an extension of the TEI scheme, but the others are all standard TEI elements, possibly renamed as permitted by the TEI scheme.

The components of the header are fully documented in section 5. The header.

Note that different elements are used for spoken and written texts because each has a different substructure; this represents a departure from TEI recommended practice.

Both <text> and <stext> elements take the following attributes in addition to the attributes globally available:

org

specifies how the content of the text is organised. Legal values are:

composite: composite content: i.e. no claim is made about the sequence in which elements inferior to this one are to be processed, or their inter-relationships
seq: sequential content: i.e. the elements contained by this one form a logical unit, to be processed in the sequence given

decls

supplies the identifiers of any specific encoding or editorial conventions defined in the corpus header and applicable to this specific text

The org attribute is used to characterize the internal organization of written texts. All demographically collected spoken texts have the same internal organization: each <stext> element collects together all the conversations for a given respondent, each distinct conversation being represented by a <div> element (see further 4.2.1. Basic structure). Since the order of these <div> elements is not significant, the org attribute always has the value ‘composite’.

3.5. Segments and words

At the lowest level, the corpus consists of <w> (word) and <c> (punctuation) elements, grouped into <s> (segment) elements:

<s>

a segment of spoken or written text as identified by the CLAWS segmentation scheme. The global n attribute is always supplied for <s> elements.

<w>

represents a grammatical (not necessarily orthographic) word. Note that the CLAWS definition of a `word' does not correspond with the conventional orthogaphic definition. Attributes include:

type: specifies the word class assigned to this form by the CLAWS system.
lemma: specifies the root form of this word.

<c>

represents a punctuation character. Attributes include:

type: specifies the class assigned to this character by the CLAWS system.

A detailed description of the tagging procedures and their application in this and other versions of the BNC is provided by the Manual to accompany The British National Corpus (Version 2) with Improved Word-class Tagging by Geoffrey Leech and Nicholas Smith, which is distributed with BNC World, and is also available from the BNC web site at ../../docs/bnc2postag_manual.htm. A short list of the POS codes used for the type attribute on <w> and <c> is also provided in section 6.5. Word class codes below.

The <s> element is the basic organizational principle for the whole corpus: every text, spoken or written, may be regarded as an end-to-end sequence of <s> elements, possibly grouped into higher-level constructs, such as paragraphs or utterances.

In most cases, <s> elements will correspond with ordinary orthographic sentences, and <w> elements with conventional orthographic words. However, it should be noted that several common phrases are treated as single <w> elements, typically prepositional phrases such as ‘in spite of’, while some single orthographic forms such as ‘can't’ and possessive forms such as ‘man's’ are decomposed into two <w> elements. Further discussion of these non-orthographic word forms is given in the accompanying Manual to accompany The British National Corpus (Version 2) with Improved Word-class Tagging by Geoffrey Leech and Nicholas Smith.

Fragmentary sentences such as headings or labels in lists are also encoded as <s> elements, as in the following example:

      <div1 type="u">
        <head type="MAIN">
        <s n="835">Serious fit of giggles</s>
        </head>
        <p><s n="836">A PAIR of TV newsreaders have 
        been suspended from duty after they burst out 
        laughing during the evening news.</s></p>
<!-- CBE -->

Note that in this and subsequent examples the <w> and <c> tags present in the original have been suppressed and extra white space has been introduced in order to aid legibility. A comment has been added to give the three character identifier (CBE in this case) of the text from which the example has been taken.

3.6. Editorial indications

Editorial changes made to the texts during transcription are recorded using the following elements:

<gap>

marks the spot where some part of the original source text has been omitted for some reason. Attributes include:

desc: brief description of the material omitted e.g. "name and address".
reason: brief explanation for the omission: if specified, this will be either "anonymization", or "sampling strategy".
resp: code identifying the agency responsible for marking up the omission: if specified, this will be either OUCS, or OUP.

The <gap> element is typically used to indicate where words identifying persons or places have been removed during transcription, where labels etc. have been suppressed for ease of processing, or where material has simply not been transcribed because it is inaudible, illegible or not transcribable (e.g. figures, graphs).

<corr>

any editorial correction or regularization, e.g. of material obviously mistranscribed or misspelled, or of variant spellings. Attributes include:

sic: supplies the original form of the word or phrase marked.
resp: code identifying the agency responsible for making the correction.

<sic>

a word or phrase which has not been corrected, but which is in doubt; for example, a spoken word which the transcribers cannot recognise, or a dubious spelling. Attributes include:

corr: supplies a corrected form for the word or phrase marked.
resp: code identifying the agency responsible for making the correction (either OUCS or LONGMAN).

In general, the <corr> element is used wherever a word appears to be misspelled in the source, and the <sic> element where the transcriber is uncertain of the correction, but believes the original to be erroneous. The <sic> element is also used to mark words which are intentionally misspelled, for example to indicate non-standard pronunciation; in such cases, the corr attribute is always used to supply a standard spelling.

Slightly different transcription policies have been followed by different transcribers, and consequently these elements may not appear in all texts. The <editorialDecl> element of the header described in section 5.2.1. Documentary components of the encoding description gives further details of the editorial principles applied across the corpus. The value of the decls attribute for an individual text will indicate which principle or set of principles applies to it. The <tagsDecl> element in each text's header may also be consulted for an indication of the usage of these and other elements within it (see further section 5.2. The encoding description).

Users are cautioned that the corpus contains a significant number of errors, both in transcription and encoding. Every attempt has been made to reduce the incidence of such errors to an acceptable level, using a number of automatic and semi-automatic validation and correction procedures, but exhaustive proof-reading of a corpus of this size was not economically feasible. The corrections indicated by the tags discussed above are included only where errors have been detected, and no claim should be inferred that no other errors remain.

3.6.1. Some examples

In the following example, the first three chapters have been omitted for sampling reasons:

<body>
<div1 n="1" type="u">
      <head><s n="1">Friday 16 September to Tuesday 20
      September</s></head>
      <gap desc="Chapters 1-3 of Book 1" reason="sampling strategy"/>
      <pb n="17"/>
      <div2 n="4" type="u">
      <p><s n="2">Once free of the knotted tentacles of the eastern
      suburbs, Dalgliesh made good time and by three he was driving 
      through Lydsett village.</s>
<!-- C8T -->

In the following example, a proper name has been omitted:

<u who="PS1AD"><s n="2203">Er</w>, you know 
Wayne <gap desc="last or full name" reason="anonymization"/>?</s>
<!-- KBC -->

In the following example, a telephone number has been omitted:

<p><s n="34">They can be contacted on 
Rhyl 35 <gap desc="telephone number"/>.</s></p>
<!-- K3C -->

In the following example, a typographic error in the original has been corrected:

<s n="48">From such a position, deviant behaviour 
could be extremely good or <corr sic="herioc"> heroic </corr>
behaviour, such as risking one's life to help others 
or showing courage over and above the line of duty.</s>
<!-- B17 -->

In the following example, typographic variation in the original has been regularized:

<p><s n="1380">He used the telephone to ring 
his own number and Celia's, on the 
<corr sic="offchance" resp="OUCS"> off chance </corr>
that Dougal had gone to Primrose Hill.</s>
<!-- GUU -->

In the following example, the transcriber has expressed a doubt as to the spelling of the word `tableclothes', but no correction has been made:

<s n="1348">I was petrified because all my sheets and 
<sic> tableclothes </sic>and all my cutlery and silver 
were all stamped with the Embassy crest.</s>
<!-- B17 -->

3.7. Pointers

Parts of a text are normally transcribed in the same order as they appear in the source text. In certain circumstances, however, parts of a text have been moved from the position in which they appear in the source to simplify linguistic processing. There are two common situations where this is necessary:

where a caption or note appears in the middle of a syntactic unit
where speakers overlap

Where re-ordering of the first type has occurred, the moved element is generally re-located to the end of the paragraph or similar element in which it appears. Its original position is recorded using a pointer element (<ptr>), an empty tag whose target attribute supplies the identifier of the relocated element.

This mechanism is also used to represent captions or notes which interrupt the normal reading sequence. By far the commonest use of the <ptr> element, however, is to represent alignment of synchronous speech; see further section 4.2.4. Alignment of overlapping speech.