BNC User Reference Guide

12 Formal Specification of the BNC XML schema

Up: Contents Previous: 11 The Xaira Specification

The structure of the XML edition of the British National Corpus is described by means of a single XML schema, which is however expressed in three different schema languages: the traditional DTD language which XML inherits from SGML; the more recently defined ISO schema language known as RELAXNG; and the W3C defined schema language. The three schema files are all generated from the same TEI-conformant XML source file, which is also used to generate the present documentation.

This section of the document contains the TEI-conformant reference specification for all components of the BNC schema. These include definitions for attribute classses, model classes, and macro patterns as well as definitions for elements and their associated attributes and possible value lists. A full description of these concepts and how they are used to define and document XML encoding schemes is given by the TEI Guidelines (in particular, in chapter TD); the following summary provides only basic information about them.

When several elements in a schema share attributes of the same name, with values drawn from a common set, they are considered to form an attribute class. The members of such a class can then all reference the same class definition rather then each repeat the same information. In the BNC, for example, the elements <bibl>,<corr>, <div>, <head>, <hi>, and half a dozen others, all have the same attribute rend which takes a coded value taken from the same short list of possibilities. Rather than repeat this definition half a dozen times therefore, the relevant elements are all said to be members of a class att.rendered, which is defined independently of those elements (but includes a list of its members). In the same way, the <w> and <mw> elements, as members of the att.c5coded class, share the same definition for the possible CLAWS5 codes specified by their c5 attribute. Note however that the element <c>, although it has an attribute c5, is not a member of this class because the possible values for this attribute on this element are entirely different.

In any reasonably large schema, and particularly one derived from the TEI model, several elements are likely to have very similar content models, since it will often be the case that at a given point in the document hierarchy any one of several possible elements will be permissible. The specific subset of elements (<w>, <mw>, <c> and a few others) which may appear within an <s> element in the BNC, is different from the subsets of elements which may appear within a <p> or <div> element. However, there are several elements which can appear in the same places as a <p>. Following TEI practice, we call the set of elements which can appear together (in sequence or alternation) at a specific place in the document hierarchy a model class. For example, since <l>, <lg>, <list>, <p>, <quote>, and <sp> are all permitted as immediate components of a <div> elements, we define a class model.divPart, of which these six elements are all members. Wherever convenient, content models are defined in terms of these model classes.

As noted above, this usage of model classes is a distinctive and pervasive feature of the TEI encoding scheme. Because the BNC derives from the TEI scheme, it uses the same names and (as far as is practicable) the same model classes throughout. Although this introduces an occasionally redundant degree of indirection in the resulting schema, it also makes clearer the relationship between the components defined for the BNC and their origins in the TEI scheme.

Finally, we define here a few macros for commonly encountered content models. These are also taken from the TEI encoding scheme, though in a few cases with different meanings. In the TEI for example, the macro macro.phraseSeq is defined as a mixture of various ‘phrase level’ elements and plain text; in the BNC scheme, it has been redefined as plain text only. The places where this macro is referenced however are unchanged; in this respect therefore, the BNC schema is a proper subset of the full BNC schema.

The remainder of this section lists in alphabetical order all of the attribute classes, model classes, elements, and macros defined for the BNC encoding scheme, using a similar method of display as the full TEI Guidelines. For each component, we give a brief description and also a usage example. Note that many of the elements listed here appear only in the corpus header rather than in the texts, and may thus be safely disregarded by applications which operate on the texts alone or in isolation.

Classes defined

[att.ascribed] [att.c5coded] [att.editLike] [att.identifiable] [att.rendered] [att.timed] [att.uniqueId] [model.addressLike] [model.assertLike] [model.biblLike] [model.dateLike] [model.divPart] [model.divPart.spoken] [model.divWrapper] [model.encodingPart] [] [] [model.glossLike] [model.headerPart] [model.hiLike] [model.imprintPart] [model.inter] [model.lLike] [model.listLike] [model.milestoneLike] [model.nameLike] [model.nameLike.agent] [model.pLike] [] [model.pPart.edit] [model.persStateLike] [model.personPart] [model.phrase] [model.profileDescPart] [model.ptrLike] [model.publicationStmtPart] [model.qLike] [model.recordingPart] [model.respLike] [model.segLike] [model.settingPart] [model.sourceDescPart] [model.stageLike]

Elements defined

[<activity>] [<address>] [<age>] [<align>] [<attDef>] [<attList>] [<attributePolicy>] [<author>] [<availability>] [<bibl>] [<bncDoc>] [<c>] [<catDesc>] [<catRef>] [<category>] [<change>] [<classCode>] [<classDecl>] [<collate>] [<corr>] [<creation>] [<date>] [<defaultVal>] [<desc>] [<dialect>] [<distributor>] [<div>] [<edition>] [<editionStmt>] [<editor>] [<editorialDecl>] [<elementPolicy>] [<email>] [<encodingDesc>] [<event>] [<extent>] [<fileDesc>] [<gap>] [<gi>] [<head>] [<hi>] [<ident>] [<idno>] [<imprint>] [<item>] [<joinTo>] [<keywords>] [<l>] [<label>] [<labelGen>] [<langUsage>] [<language>] [<lg>] [<list>] [<locale>] [<mw>] [<name>] [<nameList>] [<namespace>] [<note>] [<occupation>] [<p>] [<para>] [<particDesc>] [<pause>] [<pb>] [<persName>] [<persNote>] [<person>] [<placeName>] [<pp>] [<profileDesc>] [<projectDesc>] [<pubPlace>] [<publicationStmt>] [<publisher>] [<quote>] [<recording>] [<recordingStmt>] [<refsDecl>] [<resp>] [<respStmt>] [<revisionDesc>] [<s>] [<samplingDecl>] [<setting>] [<settingDesc>] [<shift>] [<sourceDesc>] [<sp>] [<speaker>] [<stage>] [<stext>] [<tagUsage>] [<tagsDecl>] [<taxonomy>] [<bnc>] [<teiHeader>] [<term>] [<textClass>] [<title>] [<titleStmt>] [<tokenize>] [<trunc>] [<u>] [<unclear>] [<valItem>] [<valList>] [<valSource>] [<vocal>] [<w>] [<wtext>] [<xairaItem>] [<xairaList>] [<xairaSpecification>]

Macros defined

[data.count] [data.enumerated] [data.language] [] [data.namespace] [data.pointer] [data.pointers] [data.temporal] [data.word] [macro.fileDescPart] [macro.paraContent] [macro.phraseSeq] [mix.spoken]

Up: Contents Previous: 11 The Xaira Specification

edited by Lou Burnard. Date: January 2007
This page is copyrighted