Practical Considerations in the Use of TEI headers in a Large Corpus
BNCX35
Dominic Dunlop
Oxford University Computing Services
Submission for CHum
Draft of 17 June, 1993
Abstract
As the first large corpus developed using mark-up conforming to the
guidelines of the Text Encoding Initiative (TEI), the British National
Corpus is a test-bed for many TEI-developed mechanisms that they
describe. This is particularly true in the case of the TEI header,
which has three intended applications -- to describe a corpus, to
describe an individual text, and as a free-standing bibliographic
record -- all of them used by the corpus.
This paper describes the application of the TEI header to the British
National Corpus. It is intended that this information should, through a
description of experience on a practical project, serve as a guide for
those wishing to use TEI headers in the documentation and management of
other corpora and collections of texts.
Biographical statement
Dominic Dunlop is project manager for the British National Corpus at
Oxford University Computing Services. Prior to assuming this position, he
worked in a variety of positions related to development and support of
the UNIX operating system, and was active in the POSIX initiative for the
standardization of UNIX.
Keywords
bibliographic records, electronic texts, electronic title page, large
corpora, Standard Generalized Markup Language, Text Encoding
Initiative
1. Introduction
The British National Corpus (BNC) project is currently constructing a 100
million word corpus of modern British English for use in linguistic
research. It is a collaborative, pre-competitive initiative carried out
by Oxford University Press (OUP), Longman Group UK Ltd., W R Chambers,
Lancaster University's Unit for Computer Research in the English Language
(UCREL), Oxford University Computing Services (OUCS), and the British
Library. The project receives funding from the UK Department of Trade and
Industry and the Science and Engineering Research Council within their
Joint Framework for Information Technology, and from the British Library.
The constitution of the BNC is described in [Burnage, Dunlop, 1993]. The
same paper provides examples of the application of TEI-conformant mark-up
to written and spoken texts, and discusses the processing steps necessary
to transform a source text into a form which may be included in the
corpus. These issues are not discussed further here, except where they
bear on the contents of text and corpus headers.
The Text Encoding Initiative (TEI) header, described in [Giordano, 1993],
is required in all TEI-conformant texts. While its default structure is
adequate to the needs of many applications, there are two groups of
optional additions or considerations. These cover the needs of language
corpora, and of the electronic exchange of bibliographic information.
Both of these additions are required to meet the needs of the British
National Corpus.
The overall SGML [Goldfarb, 1990] structure of the header and its
immediate descendants are shown in figure 1. Sections 4-7 of this paper
discuss each part of the header as it applies to the corpus header and to
three example text headers. Section 8 briefly describes applications of
free-standing headers in the British National Corpus.
1.1. Notes on the figures
The figures in this paper consist of fragments of a preliminary version
of the BNC corpus header and three example text headers. Their contents
are subject to change prior to publication of the corpus. Indentation is
used to show structure, and, with the exception of
, end-tags are
shown, even where their omission is allowed. Ellipses (...) show where
material has been omitted from some of the figures for reasons of
brevity. Explanatory text is set in italic type; it is not part of the
headers.
2. General Issues
2.1. Portability
[Burnard, Sperberg-McQueen, 1993] allows conformant materials to exist in
two forms: a local storage format and an interchange format. The former
will typically use a richer character set than the latter, and may use
SGML options such as end-tag omission [Goldfarb, 1990]. In the interests
of portability across networks and between computer architectures, an
interchange format often uses a restricted character repertoire, and,
making few assumptions about the capabilities of processing software,
does not use end-tag omission or other SGML options. The promotion of
portability in this manner makes for larger files, and may compromise
intelligibility to humans.
Because a primary goal of the BNC is that it should easily be
interchangeable between archive and user sites, no local storage format
has been specified: the corpus exists only in an interchange format which
uses the character repertoire of the International Reference Version of
[ISO, 1991]. All examples given in this paper are presented in this form.
Those building corpora for which interchange is a less important
consideration may wish to specify a richer, more compact, local storage
format.
2.2. Language
The language of the British National Corpus is modern British English. As
no other language is explicitly accommodated, and because the corpus does
not provide information about, for example, prosody, which might require
a repertoire of special characters, a single writing system declaration
[Burnard, Sperberg-McQueen, 1993] suffices. This declaration is
comparatively simple, as differences between the character repertoire of
the interchange format (see previous subsection) and that of the source
text for the corpus can, for the most part, be handled by means of SGML
public entity sets [Goldfarb, 1990]. The BNC writing system declaration
is not discussed further here.
Compilers of corpora of source texts written in languages requiring a
richer character repertoire than English, compilers of multi-lingual
corpora, and those using special marks for annotation will need to devote
greater effort to the writing system and language usage declarations (see
4.3) than has been necessary for the BNC.
3. Issues Specific to Large or Modern Corpora
3.1. Accommodating practical SGML processors and computing platforms
Conceptually, the BNC is a single, SGML-compliant, document of around two
gigabytes in extent. As such, it should be possible to submit it as a
whole to SGML-aware software in order to accomplish user-specified
processing. In practice, current SGML-aware software and computing
hardware cannot handle the corpus as if it were a single document.
Further, [ISO, 1986] specifies that conforming software must meet or
exceed capacity limits which are very low: in some cases limits are
unavoidably exceeded by individual texts in the BNC. Consequently, the
BNC must be decomposable into manageable pieces in order to be processed
at all, and steps must be taken to ensure that capacity limit
requirements placed on processing software are not gratuitously inflated.
The likely needs of users of the corpus suggest that such a manageable
piece should consist of the corpus header, other global structures such
as the writing system declaration (see 2.2), and a sub-corpus selected
according to arbitrary user criteria from the totality of texts in the
BNC. Such a scheme implies that no text in the corpus may require the
presence of any other text (for example because of a cross-reference
using the TEI element) in order that it can be processed: such
requirements would constrain the composition of user-specified
sub-corpora. Instead, a text may reference only information in its own
header or in the corpus header.
While [Burnard, Sperberg-McQueen, 1993] requires a greater value for the
NAMELEN limit -- thirty-two -- than the minimum specified in [ISO, 1986]
-- eight, the BNC has elected to stay with the lower limit. This has two
consequences. Firstly, those TEI element and attribute names longer than
eight characters are renamed in the BNC. [Burnard, Sperberg-McQueen,
1993] describes how this may be done. This paper uses TEI names for
attributes and elements, rather than the shortened versions. Secondly,
the identifiers through which one SGML element may reference another are
limited in length to eight characters, and so are necessarily terse. The
scheme used is described in the next subsection.
3.2. Ensuring the uniqueness of identifiers
Identifiers are heavily used in the BNC, and are likely to be so in any
large corpus. As stated in the previous subsection, they are limited in
length to eight characters. Rather than pick names at random, a
name-generation scheme is used. Identifiers for corpus header elements
have two fields encoding the element name and its instance; identifiers
in a text prepend a third field unique to that text. Table I describes
the scheme.
3.3. Minimizing file size
Some aspects of the interchange format of [Burnard, Sperberg-McQueen,
1993]are likely to increase file size. This is unfortunate since, other
things being equal, it is easier to transport a smaller file than a
larger over a network or on some physical medium.
Happily, SGML provides a means by which files can be shortened through
the replacement of repeated sequences of text by entity references which
expand when processed to the replaced sequence. As subsequent sections
show, several parts of a typical text header consist of "boilerplate"
which may conveniently be replaced with entity references. This paper
points out such situations as they occur, but the figures show the full
text in such cases, rather than abbreviating it to an entity reference.
A second means of reducing file size is to ensure that text headers
contain a minimum of redundant information. This means that, where
identical information would otherwise be be repeated in a number of text
headers, the information is instead moved into the corpus header and
simply referenced by the text headers concerned. Again, the following
sections draw the reader's attention to situations in which this
mechanism is used.
3.4. Copyright
All the material in the BNC is subject to copyright considerations. After
a text has been selected for inclusion in the corpus, but before it is
converted to electronic form, considerable effort is put into tracing
copyright holders and obtaining permission for a number of types of
world-wide use.
In order that users of the corpus are reminded of the responsibilities
set out in [BNC, 1992b], copyright information is included in the header
of each text and of the corpus itself. The information appears as plain
text so as to be human-readable, in spite of the duplication this
engenders. (See also 5.1.)
4. The Corpus Header
4.1. The file description
The TEI file description has two main functions: firstly, it serves as a
bibliographic record an electronic text as a work in its own right; and
secondly, it describes the source text or texts from which an electronic
text was derived. The second function is not applicable to a corpus as a
whole, unless it has appeared elsewhere in alternative electronic forms.
This is not the case for the BNC, so the relevant part of the corpus
header -- the source description, shown in figure 2 -- consists simply of
a statement that the corpus as a whole has no source text.
The source description is the final element of the header. The elements
which precede it are applicable to the corpus header. The first of these
is the title statement, shown in figure 3. It gives the name of the
corpus, and lists those responsible for its intellectual content. It is
followed by the edition statement, which gives the release identifier of
the corpus. This will change as the corpus is revised, with the initial
published version being release 1.0. The figure also shows the extent
statement, which describes the size of the corpus.
The final element of the bibliographic record provided by the file
description is the publication statement, shown in figure 4. Since no
organization can strictly be identified as the publisher, distributor or
release authority of the corpus, these TEI-suggested elements have been
discarded in favour of using the general-purpose element to
identify an archive site. The publication statement also contains
availability information. The description shown is an example only:
ultimate constraints on availability may differ.
4.2. The encoding description
The encoding description describes the manner in which a text or corpus
has been rendered into electronic form. The description in the corpus
header may be used simply to give those parts of the description which
apply to all texts in the corpus. If it does only this, text-specific
information must appear in the headers of affected texts, with the result
that it will be duplicated if particular encoding practices apply to
several texts, but not to every text in the corpus. This duplication may
be eliminated by moving descriptions of such practices to the corpus
header, and using the TEI's declarable element mechanism to associate
particular practices with particular texts. This methodology is heavily
used in the BNC, leaving only encoding information specific to single
texts to appear in specific texts' headers.
The first element of the encoding description is the project description.
Shown in figure 5, it gives a prose description of the British National
Corpus project.
A single project description covers the corpus and all texts in it. This
is not the case with the sampling declarations which follow: there are
five of these, covering respectively, books longer than 40,000 words;
books shorter than 40,000 words; written material from sources other than
books; demographically-sampled spoken material; and context-governed
spoken material. Figure 6 shows them all. Two features are of note.
Firstly, each declaration has an ID attribute in order that it may be
referenced from those texts to which it applies; secondly, descriptions
which apply to more than one sampling method are repeated in each
declaration to which they apply. (The limit on the number of words from
any one author is a case in point.) This duplication is necessary because
each text must refer to exactly one sampling declaration; it is not
possible to factor common parts of two declarations into a third
declaration, and then point from each text to two declarations.
Duplication of this type occurs at several points in corpus and text
headers, and, while it may be shown in the figures, will not generally
subsequently be mentioned in the text of this paper.
[Burnard, Sperberg-McQueen, 1993] specifies a defaulting mechanism for
declarable elements. This allows texts described by a default declaration
to omit an explicit reference to that declaration. Although each sampling
declaration applies to roughly equal numbers of texts, a default has been
specified because BNC policy is that defaults should always be provided.
(Designers of other corpora may wish to pursue a different policy.)
Figure 7 shows portions of the single editorial declaration in the BNC
corpus header. [Burnard, Sperberg-McQueen, 1993] allows multiple
declarations. This is useful in the case where a corpus contains a number
of distinct text types, and uniform editing conventions have been applied
to all texts of a particular type. In such a case, a text need only
reference the single declaration which applies. A single declaration
describing a variety of editorial conventions is more appropriate to the
BNC, in which different practices apply to different texts of the same
type. For example, written texts obtained from existing electronic
archives are treated in a different manner to from those captured
specifically for the BNC.
A consequence of the use of a single editorial declaration is that each
text must reference exactly one of each type of child element of
editorialDecl. This reference may be implicit if there is only one such
element, as is the case with in the figure; or if a default
declaration is supplied and is applicable. In other cases -- ,
for example -- explicit references are required. The example texts
(sections 5-7) show these references. They also also show that, while
[Burnard, Sperberg-McQueen, 1993] gives editorial declarations in an
individual text's header higher precedence than those in the corpus
header, no BNC text has such declarations in its header; the declarations
in the corpus header always apply.
The encoding description continues with a reference declaration, shown in
figure 8. The BNC reference scheme provides for a unique reference to any
segment in the corpus. (A segment is broadly analogous to an orthographic
sentence. All parts of each text in the corpus, whether conventionally
sentential or not, are divided into segment elements.) These references
should be used when material in the corpus is cited. The reference
declaration specifies an algorithm for constructing such references by
concatenating a hyphen separator and a five-character segment name with a
six-character text name using the stepwise method described by [Burnard,
Sperberg-McQueen, 1993]. (Segment names consist of five digits, and are
unique within a particular text; text names consist of six alphanumeric
characters, and are unique across the corpus.) Sample references to
segments in the example texts discussed in section 5-7 are Wingss-01009,
GaWldA-00035 and 026211-0005. The reference declaration for the released
BNC may have an extra element in order that a particular revision of the
corpus may be specified.
[Burnard, Sperberg-McQueen, 1993] allows a single header to contain more
than one reference declaration. This could be useful if some texts in the
corpus had a reference mechanism applying to their source form -- by page
and line number, for example -- and this information was preserved in the
corpus. The BNC, however, does not heed its sources' reference schemes
(in general, they have none), so a single reference declaration suffices.
The corpus header encoding description ends with the classification
declaration, which contains one or more taxonomy elements. These define
the dimensions along which texts in the corpus are classified, and legal
values along each dimension. As figure 9 suggests, the necessary
declarations can be lengthy, but the result is that the specification of
the profile of a particular text relative to a taxonomy can be very
compact -- see sections 5.3 and 7.3.
The taxonomy of the BNC is complex, with each text being classified along
many dimensions, although, while [Burnard, Sperberg-McQueen, 1993]
provides for a single text to assume multiple values in a given
dimension, values assumed by a BNC text are always single. Some
dimensions are important in balancing the content of the corpus relative
to its design criteria. For example, [BNC, 1991a] specifies that the
numbers of spoken texts recorded in the north, middle and south of
Britain should be approximately equal. Such criteria are known as balance
criteria. Other dimensions, while not used in balancing the content of
the corpus, classify texts in manners which may be useful to corpus
users. An example here is author sex for written works: while the design
documents for the corpus do not specify a ratio of female to male authors
(or, indeed of single-sex to mixed-sex collaborations), [BNC 1991b]
provides for such information to be recorded, although the examples in
this paper do not show these classification criteria. Either balance or
classification criteria may apply only to particular text types. For
example, interaction type -- monologue or dialogue -- applies only to
spoken texts.
A complex taxonomy can be specified in a number of ways. It may be
specified as a single large taxonomy in which not all the dimensions
specified apply to each text; or it may be split into a number of
taxonomies where, if one dimension in a particular taxonomy applies to a
particular text, then all the dimensions in that taxonomy apply. The BNC
follows the latter course. Figure 9 shows a selection of the taxonomies
defined in the corpus header.
4.3. The profile description
The profile description in the header of an individual text serves mainly
to classify that text against criteria specified in the corpus header
encoding description and elsewhere. The corpus header profile description
might be expected to contain only profile information common to all texts
in a corpus, and hence to be quite short. The language usage element,
shown in figure 10 is a case in point: it states that all texts in the
BNC are examples of modern British English. It references a writing
system declaration. As a modern, monolingual, corpus using a single
alphabet, the BNC requires only a simple definition of its language
usage: other types of corpus would be likely to require greater
complexity in this area. (See 2.2.)
In addition to giving information applying to all texts in a corpus, the
profile description can group information which describes aspects of
particular texts or, more usefully, groups of texts. The types of
information provided for by [Burnard, Sperberg-McQueen, 1993] are more
useful for grouping spoken material than for written, and it is for this
purpose that the remainder of the profile description is used in the BNC.
Figure 11 shows information which is common to a number of spoken texts
in the corpus, and which consequently is encoded in the corpus header. It
details the participants in a series of written texts, and the
relationships between those participants. Texts in which these
participants appear reference the descriptions by means of the WHO
attribute on the (utterance) tag, as shown in section 7.3.
4.4. The revision description
Figure 12 shows the corpus header revision description, which documents
changes to the corpus header itself, and to the corpus as a whole.
(Changes to individual texts in the corpus are described in the texts'
own headers -- see section 5.4.)
While the revision description in the initial released corpus may be
expected to be larger than the stub shown in the figure, it is not likely
to be greatly expanded. Future revisions of the corpus, should there be
any, will have more detailed revision descriptions which describe the
global and corpus header changes made between successive revisions.
5. Text Header for a Non-Composite Written Text
This section is the first of three dealing with sample headers for three
texts from the BNC. The subsections of this section discuss all parts of
the text header; subsections for the second and third examples discuss
only those aspects of the header which are of interest in connection with
particular types of text.
The first example text is Wings [Pratchett, 1991], a novel. Because the
source text contains more than 40,000 words, a whole number of chapters
starting at the beginning of the book and totalling a little less than
40,000 has been captured. This example is a sample from the beginning of
the source text; the BNC contains approximately equal numbers of
beginning, middle and end samples.
As with all written published texts in the BNC, [Pratchett, 1991] was
chosen by applying selection criteria to a number of lists of
publications. [BNC, 1991a] describes the methodology. Half the texts in
the corpus are chosen at random using the method described in [BNC,
1992a]; the remainder are chosen from lists which suggest that a
publication is influential in some manner (frequently borrowed from
libraries; recommended reading on courses...). [Pratchett, 1991], which
has sold well, is an example of the latter class.
The header for the text is introduced by the mark-up shown in figure 12.
The name of the text, given by the N attribute of the element, is
for use in constructing references. (See 4.2.) For convenience, the value
of the N attribute of duplicates information to be found in
the edition statement. (See next section.)
5.1. The file description
The file description of an electronic text begins with a statement of the
title of the text, and of those responsible for its intellectual content
as an electronic text. Thus, as shown in figure 14, the title has the
words "an electronic sample" appended so as to distinguish it from that
of the source text. Following similar reasoning, the author of the source
text is not named here; rather, those responsible for creating an
electronic sample from the source text are named. Note here that it is
the policy of the BNC to name the organisation responsible.
There are several thousand written texts with identical statements of
responsibility in the BNC. Consequently, actual headers use an entity
reference which expands to the text shown here.
The edition statement and extent statement, also shown in figure 14,
follow the title statement. All texts in the initial release version of
the BNC will be at revision 1.0. Should there be further releases, some
-- but possibly not all -- of the texts in a given release will have a
revision level greater than 1.0, signifying that their content has
changed relative to that of the corresponding text in the initial release
of the corpus. Such changes should be described by the texts' revision
descriptions. (See 5.4.)
The extent statement gives the size of the text (without header or
mark-up) in orthographic words, and in kilobytes. The latter figure is
intended to be useful for users who examine free-standing headers (see 8)
prior to copying complete texts to local storage. Information about the
range of pages captured from the source text appears in the source
statement. (See below.)
The next element of the file description, shown in figure 15, is the
publication statement. As with the title statement, this information
pertains to the electronic, not the source, text, and so describes how
the electronic text may be obtained, and the usage restrictions attaching
to it. Although much of the material in the publication statement is
identical for every text in the BNC, there is provision to insert usage
restrictions specific to a particular text if necessary. (See also 3.4.)
The publication statement also contains two elements giving
identifying numbers for the electronic text. The first of these consists
of three characters, and is used in generating unique identifiers (values
for ID attributes) for elements of the text and its header, as described
in 3.2. The second identifier is the six-character name used in creating
references to material in the corpus. (See section 4.2.) For written
texts, these names are generally derived from the title or author name
for the text.
The file description provides for a notes statement. A possible use is
shown in figure 16. Here, a characteristic of the text which was noted
during transcription, and which may be confusing to future users, is
described.
The final element of the file description is the source statement, a
bibliographic record for the source text. As figure 17 shows, the
information is recorded in a element, one of a number of
element types offered by [Burnard, Sperberg-McQueen, 1993]. Note that the
element gives the starting and ending pages of the electronic
sample; the start and end pages for the complete source text (if
appropriate, and if known) are given as a note in order that users may
know how great a proportion of the source text is contained in the
electronic sample.
5.2. The encoding description
As shown in figure 18, the encoding description for a typical corpus text
is very short, all necessary declarations having been made in the corpus
header. (See 4.2.) The relevant definitions are referenced via the DECLS
attribute of the element. (See 5.5.)
A project description is given, taking the form of a human-readable
pointer to information in the corpus header. The semantics defined by
[Burnard, Sperberg-McQueen, 1993] specify that information given in a
text header overrides corresponding information given in the corpus
header. Thus, although helpful to users "eyeballing" a text, giving a
project description in the text header prevents SGML- and TEI-aware
processing software from finding the fuller information in the corpus
header. The problem is overcome by listing the ID of the corpus header
project description among the declarations on the element, as
shown in 5.5. This makes TEI-aware processors ignore the information in
the text header.
The figure also shows an editorial declaration. Even this should not be
required, being necessary only if some editorial procedure specific to a
single corpus text has been applied. (There are currently no corpus texts
for which this is the case.) Explicit or implicit references from the
element to editorial declarations in the corpus header (see 4.2
and 5.5) should be sufficient.
The encoding description ends with a series of tag usage elements. For
each type of element which is a descendant of the element of a BNC
text, a element is given. It lists the number of occurrences
of the element, and the number of these where the element has an ID
attribute -- presumably indicating that it is referenced by some other
element. [Burnard, Sperberg-McQueen, 1993] allows elements to
have content describing the circumstances and manner in which a
particular element is used. This information is not provided here in the
BNC, but instead is provided globally as tag set documentation. (See
[Burnard, Sperberg-McQueen, 1993]).
5.3. The profile description
The profile description for a written corpus text consists of a sequence
of elements, as shown in figure 19. These elements point back to
elements of taxonomies in the corpus header, and so categorize the text
relative to the dimensions defined there. Decoding the description of the
example text, it is a written text (first ); it is from a book or
periodical, it is imaginative, it is addressed to a wide audience, it is
a beginning sample, and it was published between 1975 and 1993 (second
); and it was chosen for inclusion in the corpus because of its
circulation or influence (third ). Some of the corresponding
category definitions may be seen in figure 9.
Currently, no provision has been made in the BNC header for
classification information such as author age, author sex, author
domicile, audience age, and so on. Some of this information, such as
author sex, can be handled with further elements, while the
remainder will be encoded by using elements to describe the
author and the audience as participants in the interaction mediated by
the text. (See 4.3 for an example of the use of in
connection with spoken texts.)
5.4. The revision description
Figure 20 shows the final element in the text header, the revision
description. During development of the corpus, this lists processing
steps. This information will not appear in the published corpus. Should
it be necessary to revise a text within the corpus after the corpus has
been published, the revision description for the text in the next
published version of the corpus will describe the changes made.
5.5. The text
Figure 21 shows the start and end of the actual text. The element
has a DECLS attribute which references those declarations -- sampling,
editorial and so on -- which apply to the text. In the case of this
example text, just one declaration is required -- that for the project
description. (The reason for this is explained in 5.2.) In all the other
cases where declarations might be required, the default declarations
describe this particular text, and so need not be enumerated.
6. Text Header for a Composite Written Text
The second example is a short extract from The Guardian, a British
national daily newspaper. The start of its header is shown in figure 22.
It is the practice of the BNC to partition newspaper material into texts
which group a number of stories dealing with the same type of subject
matter -- here, world affairs. The example is atypical, containing just
two articles -- most newspaper-derived texts contain many more.
6.1. The file description
The title statement for the text, shown in figure 23 is very similar to
that for the first example text, discussed in 5.2. The publication
statement of figure 24 is more interesting. As stated in 5.2, specific
permissions information may appear in the publication statements of
individual non-composite texts. The same is true for composite texts, but
is complicated by the fact that each element of the composite may
potentially have its own specific restrictions on usage. The example
shows how, by using the DECLS attribute on a paragraph, it may be
associated with a particular element (or elements) of the composite. The
identifiers (IDs) point to the bibliographic records (see below) for the
elements in question. In the absence of a DECLS attribute, a paragraph
applies to all elements of the composite text.
The file description ends with a structured bibliographic record for each
element of (analytic text in) the composite. These are shown in figure
25. Each record has an identifier (ISD) attribute, so that it can be
referenced by permission information (see above), and from the relevant
part of the body of the text (see 6.3). Where an analytic text has an
attribution, as in the second case, this is given. The second
bibliographic record shows how an SGML entity reference may be used to
replace material common to many records. This greatly reduces header size
in cases where a single monographic text contains many analytic texts,
each with its own bibliographic record.
6.2. The text
The profile and revision descriptions for the composite written text are
much as for the non-composite text discussed in 5.3 and 5.4 respectively,
and so are not discussed here.
Fragments of the composite text are shown in figure 26. The two analytic
texts may be seen, each encoded as a element. (See also 10.) These
elements have DECLS attributes which point back to the relevant
bibliographic record in the text header. (See 6.2.)
The enclosing element also has a DECLS attribute. As well as
selecting the correct project description (see 5.5), these override the
default declarations for sampling and for the editorial treatment of
hyphenation. (See figure 5 for the former; the latter is not among the
editorial declarations shown in figure 7.)
7. Text Header for a Spoken Text
The third and final example is a spoken text transcribed from a cassette
tape recorded by a volunteer picked using a demographic sampling method.
The BNC also includes context-governed spoken material, which is not
discussed in this paper. [Burnage, Dunlop, 1993] gives an overview of
sampling methods for spoken material, which are fully described in [BNC,
1991a].
The start of the corpus text is shown in figure 27. The text name, given
by the N attribute to is numeric for all spoken texts in the BNC.
This is just a convention, serving to distinguish spoken texts from the
alphanumerically-named written texts.
7.1. The file description
Figure 28 shows the initial elements in the file description for the
spoken text. This information is similar to that in the written examples
presented in 5.1 and 6.1, and so is not discussed here. The source
description of figure 29, however, is quite different from that for a
written text. It consists of a recording statement rather than a
bibliographic record. As is the case for many header elements, [Burnard,
Sperberg-McQueen, 1993] provides for the recording statement either to be
a sequence of paragraphs, or to be structured. The BNC has elected to use
the former approach in this case, with the first paragraph giving the
date and time of recording, and the second the recording method. A more
structured approach involving the element (not shown in the
example) may be adopted for broadcast material included in the BNC.
7.2. The profile description
The profile description for spoken text is, as figure 30 shows, rather
more complex than that for a written text. In addition to the category
references which describe the text, text and setting descriptions are
required further to describe the interaction represented by the text.
The text description is provided by [Burnard, Sperberg-McQueen, 1993] for
use with any text, but is used by the BNC only in connection with spoken
texts. Even here, some elements -- , , and
-- contain default values, as the data collection methodology does not
capture this information. The developers of other spoken corpora may wish
to provide useful values for these elements.
The setting description describes one or more settings in which an
interaction takes place. In the case of most spoken material in the BNC,
a single element always suffices: precise information about the
location of individual participants (for example "under the sink", "on
the doorstep") is not available, and no recordings have been made of
interactions in which participants are widely separated (telephone
conversations). This situation may change if, for example, recordings of
broadcast interviews in which interviewer and interviewee are in
different studios are obtained.
The category references which end the profile description characterize
the text as spoken (first ); demographic, Midland region (second
); and as produced by a male aged 56 or over in social class DE
(third ). (See also 4.3 and figure 9.)
Information about participants in demographic spoken material is held in
the corpus, rather than the text, header because it is common to a number
of texts: the BNC demographic data collection methodology results in the
same speakers appearing in many texts. This is not the case with
context-governed material, in which participants (or participant groups)
typically appear only in a single text. For this material (not shown in
this paper), participant information appears in the text header profile
description.
7.3. The text
Figure 31 shows the framework of the spoken text. The DECLS attribute of
its element overrides default declarations for project description
(see 5.5), sampling, and for a number of editorial practices:
normalization, hyphenation, quotation and analysis. (See 4.3.) The
defaults in the BNC favour written texts, resulting in a greater need to
override them for spoken texts.
The body of the spoken text consists of a sequence of utterances. Each
element has a WHO attribute to identify the speaker. These attributes
correspond to ID attributes on elements in the corpus
header. (See 4.3 and 7.2.)
8. Free-standing headers and the BNC
[Burnard, Sperberg-McQueen, 1993] discusses the use of headers as
free-standing bibliographic records. It is intended that, in addition to
their use in conjunction with their associated texts, BNC headers should
be usable in this manner. In particular, a corpus user should, by
reference to the BNC corpus header and a collection of BNC text headers
alone, be able to select texts which conform to some set of user
designated criteria. (Imaginative written texts targeted at a wide
audience and containing quotations, for example.) The advantages of this
approach are two-fold: firstly, the user saves a great deal of local
storage through not needing the full corpus in order to make selections
of this type; secondly, the user need not undertake the copyright
responsibility defined by the BNC end user agreement [BNC, 1992b] for
texts which are not of interest -- only selected texts need be fetched
from a central archive after they have been identified by reference to
their headers.
9. The Automatic Generation of Headers
All of the information presented in BNC text headers, and much of the
information in he corpus header, is derived from data stored in a
relational database under the control of the Ingres database manager
[Ingres, 1989]. The headers are, in effect, database reports delivered in
a format which conforms to SGML syntax and to [Burnard, Sperberg-McQueen,
1993]. The structure of the database [BNC, 1992b] is intended to
eliminate duplication of data, even where duplicated information appears
in headers. (For example, in the bibliographic records of composite
source texts -- see 6.1)
Use of a database makes it possible to identify information which is
duplicated, and so identify candidates for movement out of text headers
into the corpus header. A case in point might be an author whose work
appears in one or more books sampled for the corpus, and in newspapers or
journals. Participant information for this author (see 5.3) could be
centralized in the corpus header (4.3) rather than being duplicated in
text headers. (However, it should be noted that [Burnard,
Sperberg-McQueen, 1993], while stating that it is intended that
participant descriptions should be useful for written, as well as for
spoken, texts, propose no mechanism for linking elements of written texts
to participant descriptions.)
The database can also help to identify elements which, because they have
identical content in many headers, are candidates for replacement with
space-saving entity references. (See 3.3 and 6.1.)
10. Grouping in composite texts
[Burnard, Sperberg-McQueen, 1993] describes a element, which
allows the encoding of composite texts of arbitrary complexity.
Unfortunately, this element was specified by the TEI after encoding of
the composite texts in the BNC had commenced. Consequently, composite
texts are represented in the BNC as elements having the attribute
ORG=COMPOS (composite, rather than the default of sequential,
organization), and containing a sequence of or elements.
Those engaged in future corpus development projects should take advantage
of s where possible.
11. Summary and Conclusions
Experience in developing corpus and text headers for a large and
relatively diverse corpus has shown that the model recommended by the TEI
is adequate to the needs of such undertakings. Areas in which the TEI
model was feared to be unnecessarily verbose -- for example, the need to
give full bibliographic records for each element of a composite text --
have proved in the event not to present problems. One important area
remains to be addressed, however: there is no mechanism allowing
information in the corpus header to link together all texts sharing some
feature in common. Such a feature would be useful to corpus users wishing
to identify all texts by a given author, all editions of a particular
newspaper, all conversations involving a particular participant, and so
on. While some queries of this type can be satisfied by textual searches
on fields in text headers, others currently require time-consuming
examination of the text of large parts of the corpus. Work continues in
co-operation with the Text Encoding Initiative to address this issue.
References
Copies of British National Corpus project documents may be obtained by
sending electronic mail to the author at natcorp@vax.ox.ac.uk.
BNC, 1991a
TGAW15: Spoken corpus design specification. British National Corpus
project document, 1991
BNC, 1991b
BNCW08: Written corpus design specification. British National Corpus
project document, 1991
BNC, 1992a
TGAP21: Selecting titles for the British National Corpus. British
National Corpus project document, 1992
BNC, 1992b
TGBP05: BNC Permissions Request. British National Corpus project
document, 1992
BNC, 1992c
TGDW36: The new BNC database. British National Corpus project
document, 1992
Burnage, Dunlop, 1993
Burnage, Gavin and Dunlop, Dominic. "Encoding the British National
Corpus". English language corpora: design, analysis and exploitation.
Aarts, Jan and de Haan, Pieter and Oostdijk, Nelleke (eds.).
Amsterdam and Atlanta: Editions Rodopi, 1993, 79-95
Burnard, Sperberg-McQueen, 1993
Burnard, Lou and Sperberg-McQueen, Michael (eds.). TEI P2: Guidelines
for Electronic Text Encoding and Interchange, Draft version 2.
Oxford, Chicago: The Text Encoding Initiative, 1993
Giordano, 1993
Giordano, Richard. "????". This edition of Computers and the
Humanities (1993), xx-yy [Editor, please fill in title and page numbers]
Goldfarb, 1990
Goldfarb, Charles F. The SGML Handbook. Oxford: Oxford University
Press, 1990
Ingres, 1989
Introducing Ingres for the UNIX and VMS operating systems. Alameda,
CA: Relational Technology Inc., 1989
ISO, 1986
ISO 8879:1986 Information processing -- Structured Generalized Markup
Language. Geneva: International Organization for Standardization, 1986.
ISO, 1991
ISO 646:1991 Information processing -- ISO 7-bit coded character set
for information interchange. Geneva: International Organization for Standardization, 1991
Pratchett, 1991
Pratchett, Terry. Wings. London: Corgi, 1991.
Table I
Positions Function Notes
1-3 Uniquely identify text Derived from 6-character
in which ID is defined text name by a mapping
or hash. First character
alphabetic; others
alphanumeric. Mapping
given explicitly by text
header s -- see figure
15. Omitted in corpus
header.
4-5 Uniquely identify the Somewhat mnemonic
element type (GI) to
which the ID belongs
6-8 Make the identifier Effectively a separate
unique across the base-33 counter per GI
entire BNC and per text.
Figures
Figure 1 Content model for TEI header
Mandatory description of electronic text and
any source texts
Optional description of the means by which the
source texts have been rendered into electronic form
Optional description of the characteristics of an
electronic text or the of the texts in a corpus
Optional text or corpus revision history
Figure 2 Corpus header source statement
The corpus, considered as a text in its own right,
has no source: it was originated in electronic
form.
See the source descriptions to component texts
in order to trace the sources of those texts.
Figure 3 Corpus header title statement
The British National Corpus
Consortium member The British Library Board
... details of five futher partners omitted
Consortium member The University of Oxford
0.1 (initial alpha test version)
One hundred million words — ninety million from
written sources, ten million from spoken sources.
Occupies approximately two gigabytes (eight-bit bytes) of
computer storage.
Figure 4 Corpus header publication statement
Archive site
Oxford University Computing Services
13 Banbury Road, Oxford OX2 6NN U.K.
Telephone: +44 491 273280
Facsimile: +44 491 273275
Internet mail: natcorp@ox.ac.uk
bnc0.1
Available at nominal charge for academic research
purposes throughout the EC subject to a signed
permissions agreement having been received by
Oxford University Computing Services, from which
blank forms and supporting materials are available.
Availability for commercial research and
exploitation only where terms have been agreed
with the BNC Consortium Exploitation Committee.
Apply in the first instance to Oxford University
Computing Services.
1993-04-17
Figure 5 Corpus header project description
The British National Corpus project is a pre-
competitive collaboration between commercial
and academic partners in the U.K. running from
1991 to 1994.
further descriptive material omitted...
Funding for the British National Corpus has been
provided by ...
Figure 6 Corpus header sampling declarations
Where a source text is a book, no more than
40,000 words are sampled. This is true even
where the book contains a collection of works
from a selection of authors ...
No more than 120,000 words from any one author,
whether individual or corporate, and whether
writing individually or collaboratively, appear
in the corpus ...
Where a source text is a book shorter than
40,000 words, the whole text is captured, and
ten per cent is then excised ...
No more than 120,000 words from any one author,
whether individual or corporate, and whether
writing individually or collaboratively, appear
in the corpus ...
Where the source text is a magazine or newspaper,
the whole of the editorial text is captured ...
The length of demographically-sampled spoken
samples is limited by recording technology to
ninety minutes. No word-count limit is applied.
Samples of context-governed spoken material
are truncated to no longer than 40,000 words ...
Figure 7 Corpus header editorial declarations
When noticed during encoding, errors or
suspected errors in the original text are
tagged with sic.
No normalization applied.
Transcription uses standard English spelling,
except for a control list of dialectal forms
and vocalized pauses and does not reflect
pronunciation.
...
Part-of-speech information corresponding to the
CLAWS C5 tag set is appended to each word ...
Overlapping speech is marked when two or three
speakers are speaking simultaneously. The
fourth and subsequent simultaneous
utterances are not marked.
Part-of-speech information corresponding to the
CLAWS C5 tag set is appended to each word ...
...
Figure 8 Corpus header reference declaration
Figure 9 Corpus header classification declaration
Text type Written published Written unpublished Spoken Medium (written published only) Books & periodicals Miscellaneous Written to be spoken Domain (written published only) Imaginative Applied science
...
Medium (spoken only) Demographic Context-governed
...
Figure 10 Corpus header language usage declaration
Modern British English
The language of the British National Corpus
is modern British English. Words, fragments,
and passages from many other languages, both
ancient and modern, occur within the corpus
where these may be represented using a Latin
alphabet. Long passages in these languages,
and material in other languages, are generally
silently deleted. In no case is the LANG
attribute used to indicate the language of a
word, phrase or passage, nor are alternate
writing system definitions used.
Figure 11 Corpus header participant descriptions
Fred British English; East Midlands dialect Northants, England To age 14 Retired Florence British English; Midlands dialect Retired
...
Steven British English; East Midlands dialect Office manager
...
...
Figure 12 Corpus header revision description
1993-05-17 DFD Internal alpha test version
Figure 13 Header start for non-composite written text
Figure 14 Non-composite written file description start
Wings &mdash an electronic sample Data capture Oxford University Press Encoding, storage and distribution Oxford University Computing Services Text enrichment Unit for Computer Research into the English
Language, University of Lancaster
1.0
37875 words
460 kbytes
Figure 15 Non-composite written text publication statement
Archive site Oxford University Computing Services
...
A73 Wingss
Additional restrictions relating to a particular
work (if any) are summarized here.
Available only as part of the British National
Corpus at nominal charge for academic research
purposes throughout the EC ...
1993-03-17
Figure 16 Non-composite written text notes statement
Attributions on the epigraphs at the start of each
chapter refer to characters in the text.
Figure 17 Non-composite written text source statement
Terry Pratchett Wings First paperback edition,
published 1991, reprinted 1992
Corgi 1991 13-115 0 552 52649 5 Source page range: 13-172
Figure 18 Non-composite written text encoding description
See project description in corpus header for
information about the British National Corpus
project.
Any editorial practice specific to a single
text is described here. All other practices
are referenced through decls on the text tag
or by default.
...
...
Figure 19 Non-composite written text profile description
Figure 20 Non-composite written text revision description
1993-03-17 OUP Passed to OUCS 1993-04-07 OUCS Passed to Lancaster 1993-05-30 UCREL Passed to OUCS 1993-06-15 OUCS Accession to corpus
Figure 21 Non-composite written text
...
Figure 22 Header start for composite written text
Figure 23 Composite written file description start
The Guardian, edition of 1989-11-08 &mdash an
electronic collection of material related to
world affairs
... as for first example text
1.0
850 words
12 kbytes
Figure 24 Composite written text publication statement
... as for first example text
B9H GaWldA
Additional restrictions relating to a particular
analytic text or texts (if any) are summarized
here. The decls attribute cross-references the
analytic texts affected. Further paragraphs
may summarize different restrictions applying to
different analytic texts.
... common information as for first example text
1993-03-30
Figure 25 Composite written text source statements
Quote&hellip
Following encoded as &GuardnA in second bibl.struct
[The Guardian, electronic edition
of 1989-11-08&rsqb Guardian Newspapers Ltd. 23 The Guardian 0261 3077 Diary Andrew Moncur
&GuardnA
Figure 26 Composite written text
Quote&hellip
...
Diary
Andrew Moncur
...
Figure 27 Header start for spoken text
Figure 28 Start of file description for spoken text
Spoken material from respondent Fred, sample
026211 Data capture and transcription Longman Dictionaries
... as for first example text
1.0
992 words
20 kbytes
... as for first example
QA0 026211
... as for first example
1993-05-04
Figure 29 Source description for spoken text
1992-03-15
Recorded by respondent on Walkman compact
cassette recorder; dubbed for archival to
Digital Audio tape at 44.1 kHz sampling rate;
redubbed for transcription to compact
cassette.
Figure 30 Profile description for spoken text
Rushden, Northants, U.K. Home Making tea/playing games
Figure 31 Spoken text
...
...
...