Up: Contents Previous: 1 Design of the corpus Next: 3 Written texts
The original British National Corpus was provided as an application of ISO 8879, the Standard Generalized Mark-Up Language (SGML). This international standard provides, amongst other things, a method of specifying an application-independent document grammar, in terms of the elements which may appear in a document, their attributes, and the ways in which they may legally be combined. SGML was a predecessor of XML, the extensible markup language defined by the World Wide Web Consortium and now in general use on the World Wide Web. XML was originally designed as a means of distributing SGML documents on the web.
This XML edition of the BNC is delivered in an XML format which is documented in this manual in section 2.1 Markup conventions below; more detailed information about XML itself is readily available in many places.1
The original BNC encoding format was also strongly influenced by the proposals of the Text Encoding Initiative (TEI). This international research project resulted in the development of a set of comprehensive guidelines for the encoding and interchange of a wide range of electronic texts amongst researchers. An initial report appeared in 1991, and a substantially revised and expanded version in early 1994. A conscious attempt was made to conform to TEI recommendations, where these had already been formulated, but in the first version of the BNC there were a number of differences in tag names, and models. In the second edition of the BNC (BNC World), the tagging scheme was changed to conform as far as possible with the published Recommendations of the TEI ([23]). In the XML edition, this process has continued, and the corpus schema is now supplied in the form of a TEI customization: see further 12 Formal Specification of the BNC XML schema.
The BNC XML edition is marked up in XML and encoded in Unicode. These formats are now so pervasive as to need little explication here; for the sake of completeness however, we give a brief summary of their chief characteristics. We strongly recommend the use of XML-aware processing tools to process the corpus; see further 7 Software for the BNC.
An XML document, such as the BNC consists of a single root element, within which are nested occurrences of other element types. All element occurrences are delimited by tags. There are two forms of tag, a start-tag, marking the beginning of an element, and an end-tag marking its end (in the case of ‘empty elements’, the two may be combined; see below). Tags are delimited by the characters < and >, and contain the name of the element (its gi, for generic identifier), preceded by a solidus (/) in the case of an end-tag.
For example, a heading or title in a written text will be preceded
by a tag of the form <head>
and followed by a tag in the form
</head>
. Everything between these two tags is regarded as the
content of an element of type <head>
.
Attributes applicable to element instances, if present, are also indicated within the start-tag, and take the form of an attribute name, an equals sign and the attribute value, in the form of a quoted literal. Attribute values are used for a variety of purposes, notably to represent the part of speech codes allocated to particular words by the CLAWS tagging scheme.
For example, the <head>
element may take an attribute
type which categorizes it in some way. A main heading
will thus appear with a start tag <head type="MAIN">
, and a
subheading with a start tag <head type="SUB">
.
The names of elements and attributes are case-significant, as are
attribute values. The style adopted throughout the BNC scheme is to
use lower-case letters for identifiers, unless they are derived from
more than one word, in which case the first letter of the second and
any subsequent word is capitalized: examples include
<teiHeader>
or <particDesc>
(for ‘participant description’).
Unless it is empty, every occurrence of an element
must have both a start-tag and an end-tag. Empty elements may use a
special syntax in which start and end-tags are combined together: for
example, the point at which a page break occurs in an original source
is marked <pb/>
rather than <pb></pb>
The BNC is delivered in UTF-8 encoding: this means that almost all characters in the corpus are represented directly by the appropriate Unicode character. The chief exceptions are the ampersand (&) which is always represented by the special string &, the double quotation mark, which is sometimes represented by the special string ", and the arithmetic less-than sign, which always appears as <. These ‘named entity references’ use a syntactic convention of XML which is followed by this version of the corpus. All other characters, including accented letters such as é or special characters such as —, are represented directly.
<s>
element begins on a new line<s>
element, rather than (as here) at the
start of each element. The original files also lack the extra white
space at the start of each line, used in the above example to indicate
how the XML elements nest within one another.The example begins with the start tag for a <wtext>
(written text) element,
which bears a type attribute, the value of which is
FICTION, the code used for texts derived from published
fiction. The start tag is followed by an empty <pb>
element,
which provides the page number in the original source text. This in
turn is followed by the start of a <div>
element, which
contains the first subdivision (chapter) of this text. This first
chapter begins with a heading (marked by a <head>
element)
followed by a paragraph (marked by the <p>
element). Further
details and examples are provided for all of these elements and their
functions elsewhere in this documentation.
Each distinct word and punctuation mark in the text, as identified
by the CLAWS tagger, has been separately tagged with a <w>
or
<c>
element as appropriate. These elements both bear a
c5 attribute, which indicates the code from the CLAWS
C5 tagset allocated to that word by the CLAWS POS-tagger; <w>
elements also bear a pos attribute, which provides a less
fine-grained part of speech classification for the word, and an
hw attribute, which indicates the root form of the
word. For example, the word ‘said’ in this example has the CLAWS
5 code VVD, the simplified POS tag VERB, and the
headword say. The sequence of words and punctuation marks
making up a complete segment is tagged as an <s>
element, and
bears an n attribute, which supplies its sequence
number within the text. A combination of text identifier (the three
letter code) and <s>
number may be used to reference any part
of the corpus: the example above contains J10 1 and J10 2.
This is not, of course, a complete text: in particular, it lacks the TEI header which is prefixed to each text file making up the corpus. Its purpose is to indicate how the corpus is encoded. Any XML aware processing software, including common Web browsers, should be able to operate directly on BNC texts in XML format.
The remainder of this manual describes in more detail the intended semantics for each of the XML elements used in the corpus, with examples of their use.
The BNC contains a large number of text samples, some spoken and some written. Each such sample has some associated descriptive or bibliographic information particular to it, and there is also a large body of descriptive information which applies to the whole corpus.
In XML terms, the corpus consists of a single
element, tagged <bnc>
. This element contains a single
<teiHeader>
element, containing metadata which relates to the
whole corpus, followed by a sequence of <bncDoc>
elements. Each such <bncDoc>
element contains its own
<teiHeader>
, containing metadata relating to that specific
text, followed by either a <wtext>
element (for
written texts) or an <stext>
element (for spoken texts).
The components of the TEI header are fully documented in section 5 The header.
Note that different elements are used for spoken and written texts because each has a different substructure; this represents a departure from TEI recommended practice.
The <s>
element is the basic organizational principle for
the whole corpus: every text, spoken or written, is represented as a
sequence of <s>
elements, possibly grouped into
higher-level constructs, such as paragraphs or utterances. Each
<s>
element in turn contains <w>
or <c>
elements
representing words and punctuation marks.
The n attribute is used to provide a sequential
number for the <s>
element to which it is attached. To identify any part of the corpus
uniquely therefore, all that is needed is the three character text
identifier (given as the value of the attribute xml:id
on the <bncDoc>
containing the text, followed by the value of
the n attribute of the <s>
element containing
the passage to be identified.
These numbers are, as far as possible, preserved across versions of
the corpus, to facilitate referencing. This implies that the sequence
numbering may have gaps, where duplicate sequences or segmentation
errors have been identified and removed from the corpus. In a few
(about 700) cases, sequences formerly regarded as a single <s>
have subsequently been split into two or more <s>
units. For
compatibility with previous versions of the corpus, the same number is
retained for each new <s>
, but it is suffixed by a fragment
number. For example, in text A18, the <s>
formerly numbered
1307, has now been replaced by two <s>
elements, numbered
1307_1 and 1307_2 respectively.
<s>
elements, as in the following example from
text CBE:
<w>
(word) and <c>
(punctuation) elements, grouped into
<s>
(segment) elements. Each <w>
element contains three
attributes to indicate its morphological class or part of speech, as
determined by the CLAWS tagger, a simplified form of that POS code,
and an automatically-derived root form or lemma. Each <c>
element also carries codes for part of speech, but not for lemma. For
example, the word ‘corpora’ wherever it appears in
the BNC is presented like this: <w>
tag, as in the previous example. White
space is not added if no space is present in the source, as in the
following example:
<w>
element encloses a single token as identified by
the CLAWS tagger. Usually this willl correspond with a word as
conventionally spelled; there are however two important
exceptions. Firstly, CLAWS regards certain common abbreviated or
enclitic forms such as ‘'s’ in
‘he's’ or ‘dog's’ as distinct tokens, thus
enabling it to distinguish them as being an auxiliary verb in the
first case, and a genitive marker in the second. For example,
‘It's’ is encoded as follows:
<w>
elements
in earlier versions of the corpus; in the present version however a
new element <mw>
(for multiword) has been introduced to mark
them explicitly. The individual components of a <mw>
sequence
are also tagged as <w>
elements in the same way as
elsewhere. Thus, the phrase ‘in terms of’, which
in earlier editions of the BNC would have been encoded as a single
<w>
element, is now encoded as follows:
n | sequence number . |
pos | supplies a simplified part-of-speech code. | |
c5 | supplies the CLAWS 5 code associated with this word. | |
hw | specifies the headword under which this lexical unit is conventionally grouped, where known. |
c5 | the CLAWS 5 code associated with this punctuation mark. |
c5 | supplies the CLAWS 5 code associated with this word. |
Despite the best efforts of its creators, any corpus as large as
the BNC will inevitably contain many errors, both in transcription and
encoding. Every attempt has been made to reduce the incidence of such
errors to an acceptable level, using a number of automatic and
semi-automatic validation and correction procedures, but exhaustive
proof-reading of a corpus of this size remains economically
infeasible. Editorial interventions in the marked up texts take three forms. On a
few occasions, where markup or commentary introduced by transcribers
during the process of creating the corpus may be helpful to subsequent
users, it has been retained in the form of an XML comment. On some
occasions, encoders have decided to correct material evidently wrong
in their copy text: such corrections are marked using the
<corr>
element. And on several occasions, sampling,
anonymization or other concerns, have led to the omission of
significant parts of the original source; such omissions are marked by
means of the <gap>
element.
The transcription and editorial policies defined for the corpus may
not have been applied uniformly by different transcribers and
consequently the usage of these elements is not consistent across all
texts. The <tagsDecl>
element in each text's header may be
consulted for an indication of the usage of these and other elements
within it (see further section 5.2 The encoding description). Their absence
should not be taken to imply that the text is either complete or
perfectly transcribed.
desc | briefly describes the material which has been omitted. | |
reason | gives further details of the reason for omission. | |
resp | indicates the agency responsible for the intervention or interpretation, for example an editor or transcriber. |
sic | contains verbatim text which has been corrected, or an empty string if the correction consists of an addition. | |
rend | a code briefly characterising the way the element content was originally presented. | |
resp | a code identifying the agency responsible for making the correction. |
<sic>
element used in preceding editions of the
BNC is no longer used.Up: Contents Previous: 1 Design of the corpus Next: 3 Written texts