The British National Corpus

Practical Considerations in the Use of TEI headers in a Large Corpus BNCX35 Dominic Dunlop Oxford University Computing Services Submission for CHum Draft of 17 June, 1993 Abstract As the first large corpus developed using mark-up conforming to the guidelines of the Text Encoding Initiative (TEI), the British National Corpus is a test-bed for many TEI-developed mechanisms that they describe. This is particularly true in the case of the TEI header, which has three intended applications -- to describe a corpus, to describe an individual text, and as a free-standing bibliographic record -- all of them used by the corpus. This paper describes the application of the TEI header to the British National Corpus. It is intended that this information should, through a description of experience on a practical project, serve as a guide for those wishing to use TEI headers in the documentation and management of other corpora and collections of texts. Biographical statement Dominic Dunlop is project manager for the British National Corpus at Oxford University Computing Services. Prior to assuming this position, he worked in a variety of positions related to development and support of the UNIX operating system, and was active in the POSIX initiative for the standardization of UNIX. Keywords bibliographic records, electronic texts, electronic title page, large corpora, Standard Generalized Markup Language, Text Encoding Initiative 1. Introduction The British National Corpus (BNC) project is currently constructing a 100 million word corpus of modern British English for use in linguistic research. It is a collaborative, pre-competitive initiative carried out by Oxford University Press (OUP), Longman Group UK Ltd., W R Chambers, Lancaster University's Unit for Computer Research in the English Language (UCREL), Oxford University Computing Services (OUCS), and the British Library. The project receives funding from the UK Department of Trade and Industry and the Science and Engineering Research Council within their Joint Framework for Information Technology, and from the British Library. The constitution of the BNC is described in [Burnage, Dunlop, 1993]. The same paper provides examples of the application of TEI-conformant mark-up to written and spoken texts, and discusses the processing steps necessary to transform a source text into a form which may be included in the corpus. These issues are not discussed further here, except where they bear on the contents of text and corpus headers. The Text Encoding Initiative (TEI) header, described in [Giordano, 1993], is required in all TEI-conformant texts. While its default structure is adequate to the needs of many applications, there are two groups of optional additions or considerations. These cover the needs of language corpora, and of the electronic exchange of bibliographic information. Both of these additions are required to meet the needs of the British National Corpus. The overall SGML [Goldfarb, 1990] structure of the header and its immediate descendants are shown in figure 1. Sections 4-7 of this paper discuss each part of the header as it applies to the corpus header and to three example text headers. Section 8 briefly describes applications of free-standing headers in the British National Corpus. 1.1. Notes on the figures The figures in this paper consist of fragments of a preliminary version of the BNC corpus header and three example text headers. Their contents are subject to change prior to publication of the corpus. Indentation is used to show structure, and, with the exception of

, end-tags are shown, even where their omission is allowed. Ellipses (...) show where material has been omitted from some of the figures for reasons of brevity. Explanatory text is set in italic type; it is not part of the headers. 2. General Issues 2.1. Portability [Burnard, Sperberg-McQueen, 1993] allows conformant materials to exist in two forms: a local storage format and an interchange format. The former will typically use a richer character set than the latter, and may use SGML options such as end-tag omission [Goldfarb, 1990]. In the interests of portability across networks and between computer architectures, an interchange format often uses a restricted character repertoire, and, making few assumptions about the capabilities of processing software, does not use end-tag omission or other SGML options. The promotion of portability in this manner makes for larger files, and may compromise intelligibility to humans. Because a primary goal of the BNC is that it should easily be interchangeable between archive and user sites, no local storage format has been specified: the corpus exists only in an interchange format which uses the character repertoire of the International Reference Version of [ISO, 1991]. All examples given in this paper are presented in this form. Those building corpora for which interchange is a less important consideration may wish to specify a richer, more compact, local storage format. 2.2. Language The language of the British National Corpus is modern British English. As no other language is explicitly accommodated, and because the corpus does not provide information about, for example, prosody, which might require a repertoire of special characters, a single writing system declaration [Burnard, Sperberg-McQueen, 1993] suffices. This declaration is comparatively simple, as differences between the character repertoire of the interchange format (see previous subsection) and that of the source text for the corpus can, for the most part, be handled by means of SGML public entity sets [Goldfarb, 1990]. The BNC writing system declaration is not discussed further here. Compilers of corpora of source texts written in languages requiring a richer character repertoire than English, compilers of multi-lingual corpora, and those using special marks for annotation will need to devote greater effort to the writing system and language usage declarations (see 4.3) than has been necessary for the BNC. 3. Issues Specific to Large or Modern Corpora 3.1. Accommodating practical SGML processors and computing platforms Conceptually, the BNC is a single, SGML-compliant, document of around two gigabytes in extent. As such, it should be possible to submit it as a whole to SGML-aware software in order to accomplish user-specified processing. In practice, current SGML-aware software and computing hardware cannot handle the corpus as if it were a single document. Further, [ISO, 1986] specifies that conforming software must meet or exceed capacity limits which are very low: in some cases limits are unavoidably exceeded by individual texts in the BNC. Consequently, the BNC must be decomposable into manageable pieces in order to be processed at all, and steps must be taken to ensure that capacity limit requirements placed on processing software are not gratuitously inflated. The likely needs of users of the corpus suggest that such a manageable piece should consist of the corpus header, other global structures such as the writing system declaration (see 2.2), and a sub-corpus selected according to arbitrary user criteria from the totality of texts in the BNC. Such a scheme implies that no text in the corpus may require the presence of any other text (for example because of a cross-reference using the TEI element) in order that it can be processed: such requirements would constrain the composition of user-specified sub-corpora. Instead, a text may reference only information in its own header or in the corpus header. While [Burnard, Sperberg-McQueen, 1993] requires a greater value for the NAMELEN limit -- thirty-two -- than the minimum specified in [ISO, 1986] -- eight, the BNC has elected to stay with the lower limit. This has two consequences. Firstly, those TEI element and attribute names longer than eight characters are renamed in the BNC. [Burnard, Sperberg-McQueen, 1993] describes how this may be done. This paper uses TEI names for attributes and elements, rather than the shortened versions. Secondly, the identifiers through which one SGML element may reference another are limited in length to eight characters, and so are necessarily terse. The scheme used is described in the next subsection. 3.2. Ensuring the uniqueness of identifiers Identifiers are heavily used in the BNC, and are likely to be so in any large corpus. As stated in the previous subsection, they are limited in length to eight characters. Rather than pick names at random, a name-generation scheme is used. Identifiers for corpus header elements have two fields encoding the element name and its instance; identifiers in a text prepend a third field unique to that text. Table I describes the scheme. 3.3. Minimizing file size Some aspects of the interchange format of [Burnard, Sperberg-McQueen, 1993]are likely to increase file size. This is unfortunate since, other things being equal, it is easier to transport a smaller file than a larger over a network or on some physical medium. Happily, SGML provides a means by which files can be shortened through the replacement of repeated sequences of text by entity references which expand when processed to the replaced sequence. As subsequent sections show, several parts of a typical text header consist of "boilerplate" which may conveniently be replaced with entity references. This paper points out such situations as they occur, but the figures show the full text in such cases, rather than abbreviating it to an entity reference. A second means of reducing file size is to ensure that text headers contain a minimum of redundant information. This means that, where identical information would otherwise be be repeated in a number of text headers, the information is instead moved into the corpus header and simply referenced by the text headers concerned. Again, the following sections draw the reader's attention to situations in which this mechanism is used. 3.4. Copyright All the material in the BNC is subject to copyright considerations. After a text has been selected for inclusion in the corpus, but before it is converted to electronic form, considerable effort is put into tracing copyright holders and obtaining permission for a number of types of world-wide use. In order that users of the corpus are reminded of the responsibilities set out in [BNC, 1992b], copyright information is included in the header of each text and of the corpus itself. The information appears as plain text so as to be human-readable, in spite of the duplication this engenders. (See also 5.1.) 4. The Corpus Header 4.1. The file description The TEI file description has two main functions: firstly, it serves as a bibliographic record an electronic text as a work in its own right; and secondly, it describes the source text or texts from which an electronic text was derived. The second function is not applicable to a corpus as a whole, unless it has appeared elsewhere in alternative electronic forms. This is not the case for the BNC, so the relevant part of the corpus header -- the source description, shown in figure 2 -- consists simply of a statement that the corpus as a whole has no source text. The source description is the final element of the header. The elements which precede it are applicable to the corpus header. The first of these is the title statement, shown in figure 3. It gives the name of the corpus, and lists those responsible for its intellectual content. It is followed by the edition statement, which gives the release identifier of the corpus. This will change as the corpus is revised, with the initial published version being release 1.0. The figure also shows the extent statement, which describes the size of the corpus. The final element of the bibliographic record provided by the file description is the publication statement, shown in figure 4. Since no organization can strictly be identified as the publisher, distributor or release authority of the corpus, these TEI-suggested elements have been discarded in favour of using the general-purpose element to identify an archive site. The publication statement also contains availability information. The description shown is an example only: ultimate constraints on availability may differ. 4.2. The encoding description The encoding description describes the manner in which a text or corpus has been rendered into electronic form. The description in the corpus header may be used simply to give those parts of the description which apply to all texts in the corpus. If it does only this, text-specific information must appear in the headers of affected texts, with the result that it will be duplicated if particular encoding practices apply to several texts, but not to every text in the corpus. This duplication may be eliminated by moving descriptions of such practices to the corpus header, and using the TEI's declarable element mechanism to associate particular practices with particular texts. This methodology is heavily used in the BNC, leaving only encoding information specific to single texts to appear in specific texts' headers. The first element of the encoding description is the project description. Shown in figure 5, it gives a prose description of the British National Corpus project. A single project description covers the corpus and all texts in it. This is not the case with the sampling declarations which follow: there are five of these, covering respectively, books longer than 40,000 words; books shorter than 40,000 words; written material from sources other than books; demographically-sampled spoken material; and context-governed spoken material. Figure 6 shows them all. Two features are of note. Firstly, each declaration has an ID attribute in order that it may be referenced from those texts to which it applies; secondly, descriptions which apply to more than one sampling method are repeated in each declaration to which they apply. (The limit on the number of words from any one author is a case in point.) This duplication is necessary because each text must refer to exactly one sampling declaration; it is not possible to factor common parts of two declarations into a third declaration, and then point from each text to two declarations. Duplication of this type occurs at several points in corpus and text headers, and, while it may be shown in the figures, will not generally subsequently be mentioned in the text of this paper. [Burnard, Sperberg-McQueen, 1993] specifies a defaulting mechanism for declarable elements. This allows texts described by a default declaration to omit an explicit reference to that declaration. Although each sampling declaration applies to roughly equal numbers of texts, a default has been specified because BNC policy is that defaults should always be provided. (Designers of other corpora may wish to pursue a different policy.) Figure 7 shows portions of the single editorial declaration in the BNC corpus header. [Burnard, Sperberg-McQueen, 1993] allows multiple declarations. This is useful in the case where a corpus contains a number of distinct text types, and uniform editing conventions have been applied to all texts of a particular type. In such a case, a text need only reference the single declaration which applies. A single declaration describing a variety of editorial conventions is more appropriate to the BNC, in which different practices apply to different texts of the same type. For example, written texts obtained from existing electronic archives are treated in a different manner to from those captured specifically for the BNC. A consequence of the use of a single editorial declaration is that each text must reference exactly one of each type of child element of editorialDecl. This reference may be implicit if there is only one such element, as is the case with in the figure; or if a default declaration is supplied and is applicable. In other cases -- , for example -- explicit references are required. The example texts (sections 5-7) show these references. They also also show that, while [Burnard, Sperberg-McQueen, 1993] gives editorial declarations in an individual text's header higher precedence than those in the corpus header, no BNC text has such declarations in its header; the declarations in the corpus header always apply. The encoding description continues with a reference declaration, shown in figure 8. The BNC reference scheme provides for a unique reference to any segment in the corpus. (A segment is broadly analogous to an orthographic sentence. All parts of each text in the corpus, whether conventionally sentential or not, are divided into segment elements.) These references should be used when material in the corpus is cited. The reference declaration specifies an algorithm for constructing such references by concatenating a hyphen separator and a five-character segment name with a six-character text name using the stepwise method described by [Burnard, Sperberg-McQueen, 1993]. (Segment names consist of five digits, and are unique within a particular text; text names consist of six alphanumeric characters, and are unique across the corpus.) Sample references to segments in the example texts discussed in section 5-7 are Wingss-01009, GaWldA-00035 and 026211-0005. The reference declaration for the released BNC may have an extra element in order that a particular revision of the corpus may be specified. [Burnard, Sperberg-McQueen, 1993] allows a single header to contain more than one reference declaration. This could be useful if some texts in the corpus had a reference mechanism applying to their source form -- by page and line number, for example -- and this information was preserved in the corpus. The BNC, however, does not heed its sources' reference schemes (in general, they have none), so a single reference declaration suffices. The corpus header encoding description ends with the classification declaration, which contains one or more taxonomy elements. These define the dimensions along which texts in the corpus are classified, and legal values along each dimension. As figure 9 suggests, the necessary declarations can be lengthy, but the result is that the specification of the profile of a particular text relative to a taxonomy can be very compact -- see sections 5.3 and 7.3. The taxonomy of the BNC is complex, with each text being classified along many dimensions, although, while [Burnard, Sperberg-McQueen, 1993] provides for a single text to assume multiple values in a given dimension, values assumed by a BNC text are always single. Some dimensions are important in balancing the content of the corpus relative to its design criteria. For example, [BNC, 1991a] specifies that the numbers of spoken texts recorded in the north, middle and south of Britain should be approximately equal. Such criteria are known as balance criteria. Other dimensions, while not used in balancing the content of the corpus, classify texts in manners which may be useful to corpus users. An example here is author sex for written works: while the design documents for the corpus do not specify a ratio of female to male authors (or, indeed of single-sex to mixed-sex collaborations), [BNC 1991b] provides for such information to be recorded, although the examples in this paper do not show these classification criteria. Either balance or classification criteria may apply only to particular text types. For example, interaction type -- monologue or dialogue -- applies only to spoken texts. A complex taxonomy can be specified in a number of ways. It may be specified as a single large taxonomy in which not all the dimensions specified apply to each text; or it may be split into a number of taxonomies where, if one dimension in a particular taxonomy applies to a particular text, then all the dimensions in that taxonomy apply. The BNC follows the latter course. Figure 9 shows a selection of the taxonomies defined in the corpus header. 4.3. The profile description The profile description in the header of an individual text serves mainly to classify that text against criteria specified in the corpus header encoding description and elsewhere. The corpus header profile description might be expected to contain only profile information common to all texts in a corpus, and hence to be quite short. The language usage element, shown in figure 10 is a case in point: it states that all texts in the BNC are examples of modern British English. It references a writing system declaration. As a modern, monolingual, corpus using a single alphabet, the BNC requires only a simple definition of its language usage: other types of corpus would be likely to require greater complexity in this area. (See 2.2.) In addition to giving information applying to all texts in a corpus, the profile description can group information which describes aspects of particular texts or, more usefully, groups of texts. The types of information provided for by [Burnard, Sperberg-McQueen, 1993] are more useful for grouping spoken material than for written, and it is for this purpose that the remainder of the profile description is used in the BNC. Figure 11 shows information which is common to a number of spoken texts in the corpus, and which consequently is encoded in the corpus header. It details the participants in a series of written texts, and the relationships between those participants. Texts in which these participants appear reference the descriptions by means of the WHO attribute on the (utterance) tag, as shown in section 7.3. 4.4. The revision description Figure 12 shows the corpus header revision description, which documents changes to the corpus header itself, and to the corpus as a whole. (Changes to individual texts in the corpus are described in the texts' own headers -- see section 5.4.) While the revision description in the initial released corpus may be expected to be larger than the stub shown in the figure, it is not likely to be greatly expanded. Future revisions of the corpus, should there be any, will have more detailed revision descriptions which describe the global and corpus header changes made between successive revisions. 5. Text Header for a Non-Composite Written Text This section is the first of three dealing with sample headers for three texts from the BNC. The subsections of this section discuss all parts of the text header; subsections for the second and third examples discuss only those aspects of the header which are of interest in connection with particular types of text. The first example text is Wings [Pratchett, 1991], a novel. Because the source text contains more than 40,000 words, a whole number of chapters starting at the beginning of the book and totalling a little less than 40,000 has been captured. This example is a sample from the beginning of the source text; the BNC contains approximately equal numbers of beginning, middle and end samples. As with all written published texts in the BNC, [Pratchett, 1991] was chosen by applying selection criteria to a number of lists of publications. [BNC, 1991a] describes the methodology. Half the texts in the corpus are chosen at random using the method described in [BNC, 1992a]; the remainder are chosen from lists which suggest that a publication is influential in some manner (frequently borrowed from libraries; recommended reading on courses...). [Pratchett, 1991], which has sold well, is an example of the latter class. The header for the text is introduced by the mark-up shown in figure 12. The name of the text, given by the N attribute of the element, is for use in constructing references. (See 4.2.) For convenience, the value of the N attribute of duplicates information to be found in the edition statement. (See next section.) 5.1. The file description The file description of an electronic text begins with a statement of the title of the text, and of those responsible for its intellectual content as an electronic text. Thus, as shown in figure 14, the title has the words "an electronic sample" appended so as to distinguish it from that of the source text. Following similar reasoning, the author of the source text is not named here; rather, those responsible for creating an electronic sample from the source text are named. Note here that it is the policy of the BNC to name the organisation responsible. There are several thousand written texts with identical statements of responsibility in the BNC. Consequently, actual headers use an entity reference which expands to the text shown here. The edition statement and extent statement, also shown in figure 14, follow the title statement. All texts in the initial release version of the BNC will be at revision 1.0. Should there be further releases, some -- but possibly not all -- of the texts in a given release will have a revision level greater than 1.0, signifying that their content has changed relative to that of the corresponding text in the initial release of the corpus. Such changes should be described by the texts' revision descriptions. (See 5.4.) The extent statement gives the size of the text (without header or mark-up) in orthographic words, and in kilobytes. The latter figure is intended to be useful for users who examine free-standing headers (see 8) prior to copying complete texts to local storage. Information about the range of pages captured from the source text appears in the source statement. (See below.) The next element of the file description, shown in figure 15, is the publication statement. As with the title statement, this information pertains to the electronic, not the source, text, and so describes how the electronic text may be obtained, and the usage restrictions attaching to it. Although much of the material in the publication statement is identical for every text in the BNC, there is provision to insert usage restrictions specific to a particular text if necessary. (See also 3.4.) The publication statement also contains two elements giving identifying numbers for the electronic text. The first of these consists of three characters, and is used in generating unique identifiers (values for ID attributes) for elements of the text and its header, as described in 3.2. The second identifier is the six-character name used in creating references to material in the corpus. (See section 4.2.) For written texts, these names are generally derived from the title or author name for the text. The file description provides for a notes statement. A possible use is shown in figure 16. Here, a characteristic of the text which was noted during transcription, and which may be confusing to future users, is described. The final element of the file description is the source statement, a bibliographic record for the source text. As figure 17 shows, the information is recorded in a element, one of a number of element types offered by [Burnard, Sperberg-McQueen, 1993]. Note that the element gives the starting and ending pages of the electronic sample; the start and end pages for the complete source text (if appropriate, and if known) are given as a note in order that users may know how great a proportion of the source text is contained in the electronic sample. 5.2. The encoding description As shown in figure 18, the encoding description for a typical corpus text is very short, all necessary declarations having been made in the corpus header. (See 4.2.) The relevant definitions are referenced via the DECLS attribute of the element. (See 5.5.) A project description is given, taking the form of a human-readable pointer to information in the corpus header. The semantics defined by [Burnard, Sperberg-McQueen, 1993] specify that information given in a text header overrides corresponding information given in the corpus header. Thus, although helpful to users "eyeballing" a text, giving a project description in the text header prevents SGML- and TEI-aware processing software from finding the fuller information in the corpus header. The problem is overcome by listing the ID of the corpus header project description among the declarations on the element, as shown in 5.5. This makes TEI-aware processors ignore the information in the text header. The figure also shows an editorial declaration. Even this should not be required, being necessary only if some editorial procedure specific to a single corpus text has been applied. (There are currently no corpus texts for which this is the case.) Explicit or implicit references from the element to editorial declarations in the corpus header (see 4.2 and 5.5) should be sufficient. The encoding description ends with a series of tag usage elements. For each type of element which is a descendant of the element of a BNC text, a element is given. It lists the number of occurrences of the element, and the number of these where the element has an ID attribute -- presumably indicating that it is referenced by some other element. [Burnard, Sperberg-McQueen, 1993] allows elements to have content describing the circumstances and manner in which a particular element is used. This information is not provided here in the BNC, but instead is provided globally as tag set documentation. (See [Burnard, Sperberg-McQueen, 1993]). 5.3. The profile description The profile description for a written corpus text consists of a sequence of elements, as shown in figure 19. These elements point back to elements of taxonomies in the corpus header, and so categorize the text relative to the dimensions defined there. Decoding the description of the example text, it is a written text (first ); it is from a book or periodical, it is imaginative, it is addressed to a wide audience, it is a beginning sample, and it was published between 1975 and 1993 (second ); and it was chosen for inclusion in the corpus because of its circulation or influence (third ). Some of the corresponding category definitions may be seen in figure 9. Currently, no provision has been made in the BNC header for classification information such as author age, author sex, author domicile, audience age, and so on. Some of this information, such as author sex, can be handled with further elements, while the remainder will be encoded by using elements to describe the author and the audience as participants in the interaction mediated by the text. (See 4.3 for an example of the use of in connection with spoken texts.) 5.4. The revision description Figure 20 shows the final element in the text header, the revision description. During development of the corpus, this lists processing steps. This information will not appear in the published corpus. Should it be necessary to revise a text within the corpus after the corpus has been published, the revision description for the text in the next published version of the corpus will describe the changes made. 5.5. The text Figure 21 shows the start and end of the actual text. The element has a DECLS attribute which references those declarations -- sampling, editorial and so on -- which apply to the text. In the case of this example text, just one declaration is required -- that for the project description. (The reason for this is explained in 5.2.) In all the other cases where declarations might be required, the default declarations describe this particular text, and so need not be enumerated. 6. Text Header for a Composite Written Text The second example is a short extract from The Guardian, a British national daily newspaper. The start of its header is shown in figure 22. It is the practice of the BNC to partition newspaper material into texts which group a number of stories dealing with the same type of subject matter -- here, world affairs. The example is atypical, containing just two articles -- most newspaper-derived texts contain many more. 6.1. The file description The title statement for the text, shown in figure 23 is very similar to that for the first example text, discussed in 5.2. The publication statement of figure 24 is more interesting. As stated in 5.2, specific permissions information may appear in the publication statements of individual non-composite texts. The same is true for composite texts, but is complicated by the fact that each element of the composite may potentially have its own specific restrictions on usage. The example shows how, by using the DECLS attribute on a paragraph, it may be associated with a particular element (or elements) of the composite. The identifiers (IDs) point to the bibliographic records (see below) for the elements in question. In the absence of a DECLS attribute, a paragraph applies to all elements of the composite text. The file description ends with a structured bibliographic record for each element of (analytic text in) the composite. These are shown in figure 25. Each record has an identifier (ISD) attribute, so that it can be referenced by permission information (see above), and from the relevant part of the body of the text (see 6.3). Where an analytic text has an attribution, as in the second case, this is given. The second bibliographic record shows how an SGML entity reference may be used to replace material common to many records. This greatly reduces header size in cases where a single monographic text contains many analytic texts, each with its own bibliographic record. 6.2. The text The profile and revision descriptions for the composite written text are much as for the non-composite text discussed in 5.3 and 5.4 respectively, and so are not discussed here. Fragments of the composite text are shown in figure 26. The two analytic texts may be seen, each encoded as a element. (See also 10.) These elements have DECLS attributes which point back to the relevant bibliographic record in the text header. (See 6.2.) The enclosing element also has a DECLS attribute. As well as selecting the correct project description (see 5.5), these override the default declarations for sampling and for the editorial treatment of hyphenation. (See figure 5 for the former; the latter is not among the editorial declarations shown in figure 7.) 7. Text Header for a Spoken Text The third and final example is a spoken text transcribed from a cassette tape recorded by a volunteer picked using a demographic sampling method. The BNC also includes context-governed spoken material, which is not discussed in this paper. [Burnage, Dunlop, 1993] gives an overview of sampling methods for spoken material, which are fully described in [BNC, 1991a]. The start of the corpus text is shown in figure 27. The text name, given by the N attribute to is numeric for all spoken texts in the BNC. This is just a convention, serving to distinguish spoken texts from the alphanumerically-named written texts. 7.1. The file description Figure 28 shows the initial elements in the file description for the spoken text. This information is similar to that in the written examples presented in 5.1 and 6.1, and so is not discussed here. The source description of figure 29, however, is quite different from that for a written text. It consists of a recording statement rather than a bibliographic record. As is the case for many header elements, [Burnard, Sperberg-McQueen, 1993] provides for the recording statement either to be a sequence of paragraphs, or to be structured. The BNC has elected to use the former approach in this case, with the first paragraph giving the date and time of recording, and the second the recording method. A more structured approach involving the element (not shown in the example) may be adopted for broadcast material included in the BNC. 7.2. The profile description The profile description for spoken text is, as figure 30 shows, rather more complex than that for a written text. In addition to the category references which describe the text, text and setting descriptions are required further to describe the interaction represented by the text. The text description is provided by [Burnard, Sperberg-McQueen, 1993] for use with any text, but is used by the BNC only in connection with spoken texts. Even here, some elements -- , , and -- contain default values, as the data collection methodology does not capture this information. The developers of other spoken corpora may wish to provide useful values for these elements. The setting description describes one or more settings in which an interaction takes place. In the case of most spoken material in the BNC, a single element always suffices: precise information about the location of individual participants (for example "under the sink", "on the doorstep") is not available, and no recordings have been made of interactions in which participants are widely separated (telephone conversations). This situation may change if, for example, recordings of broadcast interviews in which interviewer and interviewee are in different studios are obtained. The category references which end the profile description characterize the text as spoken (first ); demographic, Midland region (second ); and as produced by a male aged 56 or over in social class DE (third ). (See also 4.3 and figure 9.) Information about participants in demographic spoken material is held in the corpus, rather than the text, header because it is common to a number of texts: the BNC demographic data collection methodology results in the same speakers appearing in many texts. This is not the case with context-governed material, in which participants (or participant groups) typically appear only in a single text. For this material (not shown in this paper), participant information appears in the text header profile description. 7.3. The text Figure 31 shows the framework of the spoken text. The DECLS attribute of its element overrides default declarations for project description (see 5.5), sampling, and for a number of editorial practices: normalization, hyphenation, quotation and analysis. (See 4.3.) The defaults in the BNC favour written texts, resulting in a greater need to override them for spoken texts. The body of the spoken text consists of a sequence of utterances. Each element has a WHO attribute to identify the speaker. These attributes correspond to ID attributes on elements in the corpus header. (See 4.3 and 7.2.) 8. Free-standing headers and the BNC [Burnard, Sperberg-McQueen, 1993] discusses the use of headers as free-standing bibliographic records. It is intended that, in addition to their use in conjunction with their associated texts, BNC headers should be usable in this manner. In particular, a corpus user should, by reference to the BNC corpus header and a collection of BNC text headers alone, be able to select texts which conform to some set of user designated criteria. (Imaginative written texts targeted at a wide audience and containing quotations, for example.) The advantages of this approach are two-fold: firstly, the user saves a great deal of local storage through not needing the full corpus in order to make selections of this type; secondly, the user need not undertake the copyright responsibility defined by the BNC end user agreement [BNC, 1992b] for texts which are not of interest -- only selected texts need be fetched from a central archive after they have been identified by reference to their headers. 9. The Automatic Generation of Headers All of the information presented in BNC text headers, and much of the information in he corpus header, is derived from data stored in a relational database under the control of the Ingres database manager [Ingres, 1989]. The headers are, in effect, database reports delivered in a format which conforms to SGML syntax and to [Burnard, Sperberg-McQueen, 1993]. The structure of the database [BNC, 1992b] is intended to eliminate duplication of data, even where duplicated information appears in headers. (For example, in the bibliographic records of composite source texts -- see 6.1) Use of a database makes it possible to identify information which is duplicated, and so identify candidates for movement out of text headers into the corpus header. A case in point might be an author whose work appears in one or more books sampled for the corpus, and in newspapers or journals. Participant information for this author (see 5.3) could be centralized in the corpus header (4.3) rather than being duplicated in text headers. (However, it should be noted that [Burnard, Sperberg-McQueen, 1993], while stating that it is intended that participant descriptions should be useful for written, as well as for spoken, texts, propose no mechanism for linking elements of written texts to participant descriptions.) The database can also help to identify elements which, because they have identical content in many headers, are candidates for replacement with space-saving entity references. (See 3.3 and 6.1.) 10. Grouping in composite texts [Burnard, Sperberg-McQueen, 1993] describes a element, which allows the encoding of composite texts of arbitrary complexity. Unfortunately, this element was specified by the TEI after encoding of the composite texts in the BNC had commenced. Consequently, composite texts are represented in the BNC as elements having the attribute ORG=COMPOS (composite, rather than the default of sequential, organization), and containing a sequence of or elements. Those engaged in future corpus development projects should take advantage of s where possible. 11. Summary and Conclusions Experience in developing corpus and text headers for a large and relatively diverse corpus has shown that the model recommended by the TEI is adequate to the needs of such undertakings. Areas in which the TEI model was feared to be unnecessarily verbose -- for example, the need to give full bibliographic records for each element of a composite text -- have proved in the event not to present problems. One important area remains to be addressed, however: there is no mechanism allowing information in the corpus header to link together all texts sharing some feature in common. Such a feature would be useful to corpus users wishing to identify all texts by a given author, all editions of a particular newspaper, all conversations involving a particular participant, and so on. While some queries of this type can be satisfied by textual searches on fields in text headers, others currently require time-consuming examination of the text of large parts of the corpus. Work continues in co-operation with the Text Encoding Initiative to address this issue. References Copies of British National Corpus project documents may be obtained by sending electronic mail to the author at natcorp@vax.ox.ac.uk. BNC, 1991a TGAW15: Spoken corpus design specification. British National Corpus project document, 1991 BNC, 1991b BNCW08: Written corpus design specification. British National Corpus project document, 1991 BNC, 1992a TGAP21: Selecting titles for the British National Corpus. British National Corpus project document, 1992 BNC, 1992b TGBP05: BNC Permissions Request. British National Corpus project document, 1992 BNC, 1992c TGDW36: The new BNC database. British National Corpus project document, 1992 Burnage, Dunlop, 1993 Burnage, Gavin and Dunlop, Dominic. "Encoding the British National Corpus". English language corpora: design, analysis and exploitation. Aarts, Jan and de Haan, Pieter and Oostdijk, Nelleke (eds.). Amsterdam and Atlanta: Editions Rodopi, 1993, 79-95 Burnard, Sperberg-McQueen, 1993 Burnard, Lou and Sperberg-McQueen, Michael (eds.). TEI P2: Guidelines for Electronic Text Encoding and Interchange, Draft version 2. Oxford, Chicago: The Text Encoding Initiative, 1993 Giordano, 1993 Giordano, Richard. "????". This edition of Computers and the Humanities (1993), xx-yy [Editor, please fill in title and page numbers] Goldfarb, 1990 Goldfarb, Charles F. The SGML Handbook. Oxford: Oxford University Press, 1990 Ingres, 1989 Introducing Ingres for the UNIX and VMS operating systems. Alameda, CA: Relational Technology Inc., 1989 ISO, 1986 ISO 8879:1986 Information processing -- Structured Generalized Markup Language. Geneva: International Organization for Standardization, 1986. ISO, 1991 ISO 646:1991 Information processing -- ISO 7-bit coded character set for information interchange. Geneva: International Organization for Standardization, 1991 Pratchett, 1991 Pratchett, Terry. Wings. London: Corgi, 1991. Table I Positions Function Notes 1-3 Uniquely identify text Derived from 6-character in which ID is defined text name by a mapping or hash. First character alphabetic; others alphanumeric. Mapping given explicitly by text header s -- see figure 15. Omitted in corpus header. 4-5 Uniquely identify the Somewhat mnemonic element type (GI) to which the ID belongs 6-8 Make the identifier Effectively a separate unique across the base-33 counter per GI entire BNC and per text. Figures Figure 1 Content model for TEI header Mandatory description of electronic text and any source texts Optional description of the means by which the source texts have been rendered into electronic form Optional description of the characteristics of an electronic text or the of the texts in a corpus Optional text or corpus revision history Figure 2 Corpus header source statement

The corpus, considered as a text in its own right, has no source: it was originated in electronic form.

See the source descriptions to component texts in order to trace the sources of those texts. Figure 3 Corpus header title statement The British National Corpus Consortium member The British Library Board ... details of five futher partners omitted Consortium member The University of Oxford 0.1 (initial alpha test version) One hundred million words — ninety million from written sources, ten million from spoken sources. Occupies approximately two gigabytes (eight-bit bytes) of computer storage. Figure 4 Corpus header publication statement Archive site Oxford University Computing Services

13 Banbury Road, Oxford OX2 6NN U.K. Telephone: +44 491 273280 Facsimile: +44 491 273275 Internet mail: natcorp@ox.ac.uk

bnc0.1

Available at nominal charge for academic research purposes throughout the EC subject to a signed permissions agreement having been received by Oxford University Computing Services, from which blank forms and supporting materials are available.

Availability for commercial research and exploitation only where terms have been agreed with the BNC Consortium Exploitation Committee. Apply in the first instance to Oxford University Computing Services. 1993-04-17 Figure 5 Corpus header project description

The British National Corpus project is a pre- competitive collaboration between commercial and academic partners in the U.K. running from 1991 to 1994. further descriptive material omitted...

Funding for the British National Corpus has been provided by ... Figure 6 Corpus header sampling declarations

Where a source text is a book, no more than 40,000 words are sampled. This is true even where the book contains a collection of works from a selection of authors ...

No more than 120,000 words from any one author, whether individual or corporate, and whether writing individually or collaboratively, appear in the corpus ...

Where a source text is a book shorter than 40,000 words, the whole text is captured, and ten per cent is then excised ...

No more than 120,000 words from any one author, whether individual or corporate, and whether writing individually or collaboratively, appear in the corpus ...

Where the source text is a magazine or newspaper, the whole of the editorial text is captured ...

The length of demographically-sampled spoken samples is limited by recording technology to ninety minutes. No word-count limit is applied.

Samples of context-governed spoken material are truncated to no longer than 40,000 words ... Figure 7 Corpus header editorial declarations

When noticed during encoding, errors or suspected errors in the original text are tagged with sic.

No normalization applied.

Transcription uses standard English spelling, except for a control list of dialectal forms and vocalized pauses and does not reflect pronunciation. ...

Part-of-speech information corresponding to the CLAWS C5 tag set is appended to each word ...

Overlapping speech is marked when two or three speakers are speaking simultaneously. The fourth and subsequent simultaneous utterances are not marked.

Part-of-speech information corresponding to the CLAWS C5 tag set is appended to each word ... ... Figure 8 Corpus header reference declaration Figure 9 Corpus header classification declaration Text type Written published Written unpublished Spoken Medium (written published only) Books & periodicals Miscellaneous Written to be spoken Domain (written published only) Imaginative Applied science ... Medium (spoken only) Demographic Context-governed ... Figure 10 Corpus header language usage declaration Modern British English

The language of the British National Corpus is modern British English. Words, fragments, and passages from many other languages, both ancient and modern, occur within the corpus where these may be represented using a Latin alphabet. Long passages in these languages, and material in other languages, are generally silently deleted. In no case is the LANG attribute used to indicate the language of a word, phrase or passage, nor are alternate writing system definitions used. Figure 11 Corpus header participant descriptions Fred British English; East Midlands dialect Northants, England To age 14 Retired Florence British English; Midlands dialect Retired ... Steven British English; East Midlands dialect Office manager ... ... Figure 12 Corpus header revision description 1993-05-17 DFD Internal alpha test version Figure 13 Header start for non-composite written text Figure 14 Non-composite written file description start Wings &mdash an electronic sample Data capture Oxford University Press Encoding, storage and distribution Oxford University Computing Services Text enrichment Unit for Computer Research into the English Language, University of Lancaster 1.0 37875 words 460 kbytes Figure 15 Non-composite written text publication statement Archive site Oxford University Computing Services

...

A73 Wingss

Additional restrictions relating to a particular work (if any) are summarized here.

Available only as part of the British National Corpus at nominal charge for academic research purposes throughout the EC ... 1993-03-17 Figure 16 Non-composite written text notes statement

Attributions on the epigraphs at the start of each chapter refer to characters in the text. Figure 17 Non-composite written text source statement Terry Pratchett Wings First paperback edition, published 1991, reprinted 1992 Corgi 1991 13-115 0 552 52649 5 Source page range: 13-172 Figure 18 Non-composite written text encoding description

See project description in corpus header for information about the British National Corpus project.

Any editorial practice specific to a single text is described here. All other practices are referenced through decls on the text tag or by default. ... ... Figure 19 Non-composite written text profile description Figure 20 Non-composite written text revision description 1993-03-17 OUP Passed to OUCS 1993-04-07 OUCS Passed to Lancaster 1993-05-30 UCREL Passed to OUCS 1993-06-15 OUCS Accession to corpus Figure 21 Non-composite written text

... Figure 22 Header start for composite written text Figure 23 Composite written file description start The Guardian, edition of 1989-11-08 &mdash an electronic collection of material related to world affairs ... as for first example text 1.0 850 words 12 kbytes Figure 24 Composite written text publication statement ... as for first example text B9H GaWldA

Additional restrictions relating to a particular analytic text or texts (if any) are summarized here. The decls attribute cross-references the analytic texts affected. Further paragraphs may summarize different restrictions applying to different analytic texts.

... common information as for first example text 1993-03-30 Figure 25 Composite written text source statements Quote&hellip Following encoded as &GuardnA in second bibl.struct [The Guardian, electronic edition of 1989-11-08&rsqb Guardian Newspapers Ltd. 23 The Guardian 0261 3077 Diary Andrew Moncur &GuardnA Figure 26 Composite written text Quote&hellip

... Diary Andrew Moncur

... Figure 27 Header start for spoken text Figure 28 Start of file description for spoken text Spoken material from respondent Fred, sample 026211 Data capture and transcription Longman Dictionaries ... as for first example text 1.0 992 words 20 kbytes ... as for first example QA0 026211 ... as for first example 1993-05-04 Figure 29 Source description for spoken text

1992-03-15 17:05

Recorded by respondent on Walkman compact cassette recorder; dubbed for archival to Digital Audio tape at 44.1 kHz sampling rate; redubbed for transcription to compact cassette. Figure 30 Profile description for spoken text Rushden, Northants, U.K. 17:05 1992-03-15 Home Making tea/playing games Figure 31 Spoken text ... ... ...