BNC User Reference Guide

11 The Xaira Specification

Up: Contents Previous: 10 List of Sources Next: 12 Formal Specification of the BNC XML schema

The <xairaSpecification> supplied in the corpus header determines the behaviour of the XAIRA indexer, and hence of the XAIRA-indexed system delivered with the BNC. In this section, we document that specification as it applies to the BNC only. The information provided here is for reference purposes only, and is of no interest unless you are using the XAIRA system to index the BNC or a similar corpus. Note however that this document is not an exhaustive description of the capabilities of the XAIRA system: for more information on that, please consult the project web site at http://www.xaira.org/

The <xairaSpecification> element is as a member of the model.encodingPart class, and may therefore be included within the <encodingDesc> element of the TEI Header for any corpus. It is organized as a number of <xairaList> elements, each of which contains a number of <xairaItem> elements. Both of these latter elements have a type attribute which specifies more exactly the function of the item or list, by supplying one of a number of predefined codes, as further described in this section.

<xairaSpecification> specifies additional information needed by XAIRA.
<xairaList> contains a list of XAIRA parameters of a particular type.
type indicates the function of this part of the specification.

<xairaItem> provides data needed to define one part of a XAIRA specification.

type	indicates what is defined by this part of the specification.
ns	supplies the namespace within which the generic identifier is to be found.
ident	supplies an element's generic identifier, or one of the codes * (meaning all elements), or name() meaning that the name of the referenced element is to be used rather than its value.

The following values are defined for the type attribute on <xairaList>:

elementSpec: lists and glosses the elements, attributes, and codebooks used in a corpus (11.1 Element specification)
keySpec: specifies how items are to be indexed (11.2 Key specification)
regionSpec: specifies any predefined regions to be made available to the client (11.4 Region Specification)
lemmaSpec: specifies any lemmatization schemes used (11.3 Lemma Scheme Specification)
refSpec: specifies how items are to be referenced (11.5 Reference specification)
indexSpec: specifies any special indexing policies (11.6 Indexing Policies)
langSpec: specifies any language-specific rules (11.7 Language specification)

All of these are used in the BNC.

The following values are defined for the type attribute on the <xairaItem> element:

element: an element
form: a lexical form for the indexer
addKey: an additional key for the indexer
lemmaScheme: a lemmatization scheme
region: a region
textRef: a reference identifying a document
unitRef: a reference identifying a low level unit within a document
scopeRef: a reference identifying a low level unit used to delimit results obtained when querying a corpus
indexPol: defines an index policy
defaultLang: specifies the default language for a corpus
langRules: specifies non-default tokenization or collation rules for a language used in a corpus

All of these except the last are used in the BNC.

11.1 Element specification

A XAIRA element specification consists of a <xairaList> of type elementSpec containing one or more <xairaItem> elements, one for each element that the Xaira indexer or client needs to be aware of. Elements which are not mentioned within the Xaira element specification may however appear within a corpus. When the indexer finds such an element, it will index it using all default options; the client will not have access to any explanatory text or gloss for such elements. Equally, the specification may include definitions for elements which do not appear within the corpus.

The simplest form of element specification just provides a description for the element:

<xairaItem type="element" ident="pause">
<desc>marks a pause in the transcription</desc>
</xairaItem>

More usually, an element specification will also supply glosses for the attributes of an element. These are supplied by an <attList> element embedded within the <xairaItem>, consisting of one <attDef> element for each attribute concerned:

<xairaItem type="element" ident="w">
 <desc>a lexical token as identified by CLAWS</desc>
 <attList>
 <attDef ident="c5">
 <desc>the part of speech assigned to a token by CLAWS</desc>
 </attDef>
 <attDef ident="hw">
 <desc>the base form of a word as determined by the Lancaster lemmatization scheme</desc>
 </attDef>
 <attDef ident="pos">
 <desc>a simplified part of speech code derived from the CLAWS C5 tag</desc>
 </attDef>
 </attList>
</xairaItem>

Descriptions may also be supplied for the values indexed for given attributes. This is accomplished by providing a <valList> element within the <attDef>, as in the following example:

<xairaItem type="element" ident="person">
 <desc>a person whose speech is recorded in the corpus</desc>
 <attList>
 <attDef ident="ageGroup">
 <desc>the age group to which a person belongs</desc>
 <valList>
 <valItem ident="A0">unknown age</valItem>
 <valItem ident="A1">under 16 years old</valItem>
 <valItem ident="A2">aged 16 to 35 years</valItem>
 <valItem ident="A3">aged 36 to 45 years</valItem>
 <valItem ident="A4">46 or older</valItem>
 </valList>
 </attDef>
 </attList>
</xairaItem>

The values A0, A1 etc. supplied by the ident attribute on <valItem> need not be unique across the corpus.

A single definition may be supplied for ‘global’ attributes which appear on any element by using the following syntax:

<xairaItem type="element" ident="*">
 <desc>global attributes</desc>
 <attList>
 <attDef ident="n">
 <desc>a name or number used to label any element</desc>
 </attDef>
 </attList>
</xairaItem>

If the element or attribute to be defined is taken from some non-default namespace, the ns attribute must be supplied on the <xairaItem> element:

<xairaItem
 type="element"
 ident="*"
 ns="http://www.w3.org/XML/1998/namespace">
 <desc>global attributes</desc>
 <attList>
 <attDef ident="id">
 <desc>a name or number used to label any element</desc>
 </attDef>
 </attList>
</xairaItem>

Here the globally-available xml:id attribute is explicitly associated with the namespace http://www.w3.org/XML/1998/namespace

A type attribute may also be specified on the <valList> element to indicate whether the list of values it contains is exhaustive or exemplary; at present Xaira does not use this information however.

In this section, we have introduced the following elements:

<desc> (description) supplies explanatory text associated with a category or other component defined in the corpus header.
<attList> contains documentation for all the attributes associated with this element, as a series of attDef elements.
<attDef> (attribute definition) provides the definition for a single attribute.
ident supplies an element's generic identifier, or one of the codes * (meaning all elements), or name() meaning that the name of the referenced element is to be used rather than its value.

<valList> (value list) contains one or more valItem elements defining possible values for an attribute.

ident	supplies an element's generic identifier, or one of the codes * (meaning all elements), or name() meaning that the name of the referenced element is to be used rather than its value.
type	specifies the extensibility of the list of attribute values specified.
copyOf	supplies the identifier of a previously-defined value list to be used at this point.

<valItem> (value definition) contains a single value and gloss pair for an attribute.

ident	supplies an element's generic identifier, or one of the codes * (meaning all elements), or name() meaning that the name of the referenced element is to be used rather than its value.
ns	supplies the namespace within which the generic identifier is to be found.

11.2 Key specification

A Xaira key specification is used to define how the indexer should identify which parts of the input documents are to be regarded as lexical forms and what additional keys should be associated with those forms. Additional keys are used to distinguish otherwise identical forms in the index (for example, the same spelling with two different POS codes); they are also used too build up lemma schemes and regions on which see below.

The key specification consists of a <xairaList> of type keySpec. If no specification is given, the indexer will assume default implicit tokenization is in force and no additional keys are defined.

If a key specification is supplied, it contains at least one <xairaItem type="form">, optionally followed by one or more <xairaItem type="addKey"> elements, each of which may contain a <desc> element to document its purpose, and should also contain a <valSource> element to specify an element or attribute within the corpus being indexed which is to be used as the source for the values to be used as a key.

The BNC index specification begins by specifying that the elements <w> and <c> delimit the forms which the indexer must index:

The <valSource> element specifies where the indexer is to find the value which is to be treated as the form part of the index entry. In both cases, it is found as element content, of a <c> or <w> element. Since no further information is given about where such elements are to be found, this will apply to every occurrence of a <w> or <c> element, irrespective of its context. Since no namespace is specified, the element is assumed to be in the current or default namespace.

Next, the BNC index specification defines three additional keys, corresponding with the attributes c5, hw, and pos. First, the CLAWS C5 code which is supplied as the value of a c5 attribute on the elements <w> and <c>:

This defines an additional key called c5, the value of which is supplied by the attribute also called c5, but only when that attribute is supplied on an element called <w> or <c> and at any point in the document structure. Other attributes called c5 (such as that on <mw>) will not be used for this purpose.

When an additional key value is required, but no value is available, because the attribute or element specified does not exist or has no value, the literal content of the <defaultVal> element (XXX in the example above) will be used instead. In the BNC, this should not happen, and this value should not therefore appear.

The remaining two additional keys are defined in much the same way, except that they derive from attributes specified only for the <w> element:

These addkeys are used in the BNC lemma scheme specification discussed below (11.3 Lemma Scheme Specification).

The caseFold attribute is used to specify that forms should be case folded before indexing, so that forms differing only in letter case will be stored identically.

The last additional key defined in the BNC index specification is derived from a source other than an element or attribute:

<xairaItem ident="region" type="addKey">
 <desc>defines the additional keys used to support filtering of text from different regions of selected texts</desc>
 <valSource caseFold="false" ident="name()" type="pseudo">
 <nameList>
 <gi>stext</gi>
 <gi>teiHeader</gi>
 <gi>wtext</gi>
 </nameList>
 <defaultVal>nowhere</defaultVal>
 </valSource>
</xairaItem>

The effect of this is to define an additional key called region, the value of which on a given form in the index will be one of the strings stext, teiHeader, wtext, or nowhere depending on the location of the form being indexed. The name() identifier here indicates that it is the name of the associated elements which is to be used as the value of the key, rather than their content. If no <nameList> were provided, then the key generated would contain the name of the nearest ancestor element. This key is used in the subsequent region specification (see 11.4 Region Specification).

11.3 Lemma Scheme Specification

Any combination of additional keys may be used to form a lemma scheme. This enables the values of the nominated keys to be treated as alternate forms for the associated index entries. For example, occurrences of words such as "dogs", "dogged", "dogging" etc in the BNC all have the value "dog" for an additional key called "Headword". To distinguish verbal senses from nominal ones, this additional key would need to be combined with another key giving the part of speech (noun or verb) for each occurrence. The resulting lemma scheme would then distinguish forms of "dog (noun)" from forms of "dog (verb)".

Xaira supports the definition of multiple lemma schemes, but only a single one is defined for the BNC. All lemma schemes are defined together in a single <xairaList type="lemmaSpec"> element, containing one <xairaItem type="lemmaScheme"> for each scheme. This element contains an optional <desc>, followed by a <nameList> containing the names of the additional keys used to constitute the scheme. (The name of the additional key is the name supplied by the ident attribute when the key was defined.). Thus, the lemma scheme defined for the BNC has the following specification:

<xairaList type="lemmaSpec">
 <xairaItem type="lemmaScheme" ident="BNC">
 <nameList>
 <att>Headword</att>
 <att>pos</att>
 </nameList>
 </xairaItem>
</xairaList>

This defines a lemma scheme called BNC which is based on the combination of the values given by the additional keys Headword and pos which were defined in the previous section.

11.4 Region Specification

A region is a collection of possibly discontinuous sections of a corpus defined by the XML tagging within it. For example, each BNC document contains a <teiHeader> element and either a <wtext> or an <stext> element. We say that all the parts of each document contained by a <teiHeader> element constitute one region. All the parts contained by either a <wtext> or a <stext> element constitute another region. Regions (unlike partitions) span document boundaries, and are not made up of whole texts but of defined parts of them.

A region is defined by means of a <xairaItem> of type region. The ident attribute on the <xairaItem> supplies a name for the region, which can be used by the client to limit searches to locations within the named region.

The definition of the region is contained within a <nameList>. It combines the name of a previously-defined additional key (region in the case of the BNC) which is tagged as an <ident> element, with a list of one or more values. Word occurrences whose region additional key has the value specified will be considered to fall within the region being defined. Since these values are element names, they are tagged within the <nameList> using the <gi> element.

For example:

<xairaItem type="region" ident="speechOnly">
 <nameList>
 <ident>region</ident>
 <gi>stext</gi>
 </nameList>
</xairaItem>

This part of the BNC region specification defines a region called speechOnly. Any word for which the additional key region has the value stext will fall within this region.

Two other regions are defined in a very similar way in the BNC:

<xairaList type="regionSpec">
 <xairaItem type="region" ident="headerOnly">
 <nameList>
 <ident>region</ident>
 <gi>teiHeader</gi>
 </nameList>
 </xairaItem>
 <xairaItem type="region" ident="textOnly">
 <nameList>
 <ident>region</ident>
 <gi>stext</gi>
 <gi>wtext</gi>
 </nameList>
 </xairaItem>
</xairaList>

The first of these defines the region headerOnly, for words occurring within the header; the second defines the region textOnly for words occurring within <wtext> or <stext> elements, as indicated by the values supplied for their respective region additional key.

11.5 Reference specification

The index maps occurrences of index terms as defined in the previous section to locations in the corpus, which may be identified in a number of ways, additional to the internally-defined location system. This external referencing scheme is used by the system to label the context of occurrences found by the search program. Occurrences themselves are precisely located by the internal location scheme. Although the index contains information about the complete xpath location of occurrences within the corpus, the internal location scheme is highly optimized and cannot be used to support access via arbitrary Xpaths or XQL queries.

The referencing scheme used to identify contexts has the following components:

a single ‘text’ identifier: this may be derived from a system identifier, or specified by a nominated attribute on the element which contains the text, or it may calculated by the indexer in terms of the XML structure indexed.
a single ‘scope’ identifier: this may be derived from the value of a specified attribute on any element in the text; calculated by the indexer in terms of the XML structure; or derived from the physical input structure.
optionally additional ‘unit’ labels: these may derived from the value of a specified attribute on any element in the text; calculated by the indexer in terms of the XML structure; or derived from the physical input line number.

The element from which the text identifier is derived also delimits a single ‘text’ in the corpus. This effectively limits the kinds of value which may be used to identify it: it must be an attribute value or a pseudo value; element content is not permitted.

The referencing specification for a Xaira index is given by a <xairaList type="refSpec">, containing exactly one <xairaItem type="textRef">, followed by one <xairaItem type="scopeRef"> and optionally one or more further <xairaItem type="unitRef"> elements. Each such <xairaItem> element contains a <valSource> element as defined above, to indicate where the value for the reference is to be obtained in the input document. It may also contain a <labelGen> element which further defines the parts of the document to which the reference applies and its format.

<xairaList type="refSpec">
 <xairaItem type="textRef">
 <valSource
 type="attribute"
 ident="id"
 ns="http://www.w3.org/XML/1998/namespace">
 <nameList>
 <gi>bncDoc</gi>
 </nameList>
 </valSource>
 </xairaItem>
 <xairaItem type="scopeRef">
 <valSource type="attribute" ident="n">
 <nameList>
 <gi>s</gi>
 </nameList>
 <labelGen>%1.%2</labelGen>
 </valSource>
 </xairaItem>
</xairaList>

In the BNC, each <bncDoc> begins a new ‘text’, which is identified by the value of its xml:id attribute, and the scope for each query is to be a complete <s> element, identified by its n attribute. The reference is to be formatted with a dot between the two values.

This specification will produce references like ABC.123 for an <s> element with attribute n set to 123, found within a <bncDoc> element whose xml:id attribute has the value ABC.

11.6 Indexing Policies

In addition to index terms derived from the lexical content of a corpus, a Xaira index also contains information about the occurrence of XML start- and end-tags within the corpus. This information is used to facilitate a number of search options: searching for non-lexical features, searching for lexical features within a given structural context, scoping co-occurrences of lexical or non-lexical features, etc.

By default an entry is made in the index for each occurrence of each tag, both start and end. This entry may also distinguish start-tag occurrences depending on the values of specified attributes supplied with them. (Note that this is independent of the use of such attribute values in the creation of index terms as described in the previous section).

For example:

<head>The heading</head>

will create index entries for the tags <head> and </head>

<head type="sub">The subheading</head>

will create index entries for the tags <head>, <head type="sub"> and </head>

The content of every element found in a corpus is indexed by default, as are all of the tags, and all of their attributes. This behaviour may be modified by specifying explicit indexing policies for elements to which this default policy does not apply. An indexing policy may not be specified for elements or attributes which have been nominated as the sources for an additional key or reference, since these are indexed in a different way. Any indexing policy specified for such elements or attributes will be ignored by the indexer.

The following indexing policies are used in the BNC:

none: No part of the specified element or attribute will be indexed. In the case of an element, this means that none of its start and end tags, its attributes, its child elements, and its character data content will be included in the index. In the case of an attribute, its value will be omitted.
markup: This policy applies only to elements. Only start-and end-tags and attributes for the specified element and for any child elements will be indexed; no content of the element or its children will be indexed.
jointo: This policy applies only to attributes. The specified attribute is available for use as the target of an attribute indexed with the joinfrom policy.
joinfrom: This policy applies only to attributes. The attribute specified has values which correspond with those on an attribuite of some other element which has been indexed with the jointo policy, or (if no jointo attribute has been defined) which uses the xml:id attribute
taxonomy: This policy applies only to attributes. The attribute specified has values which correspond with the xml:id attribute on some <category> element within a TEI-conformant <taxonomy> element.

For every element or attribute to which a non-default indexing policy applies, a <xairaItem type="indexPol"> appears within the <xairaList type="indexSpec"> element. This may contain either an <elementPolicy> or an <attributePolicy>, element depending on whether it relates to elements or attributes.

11.6.1 Index policies NONE and MARKUP

Within the BNC, an attribute policy of none is applied to the element <revisionDesc>:

The effect of this is that, although <revisionDesc> elements will be visible in search results, they cannot be searched for and a query for one or for anything contained by of one, will return no hits

The indexing policy markup is applied to the element <bibliography>. One occurrence of this element, declared in its own name space, is necessary for a XAIRA system: it holds metadata relating to each text constituting the corpus. In the BNC this bibliographic information is copied from the text headers, which are also indexed in their own right. To avoid duplication of this content, the indexer is instructed to index only the structure of the bibliography but not its content:

11.6.2 Index policies JOINFROM and JOINTO

The purpose of joinfrom and jointo indexing policies is to support join queries. A join query is one in which attributes are effectively transferred from one element to another. For example, in the BNC, each text header contains detailed data about individual speakers within the <person> element, and also uses the attribute who to identify the speaker or speakers of each speech in the transcribed part of the corpus:

<person xml:id="ABC" age="A" soc="B1"> ... </person> <person xml:id="DEF" age="Z" soc="A1"> ... </person> ... .... ....

Since the values for who all correspond with the value of an xml:id attribute on some <person> element, a join query can be effected. The XAIRA client can be configured to support queries in which the attributes age and soc appear to be attributes of the  element, their values being transferred from the <person> element whose xml:id value is equal to that given by the who attribute on . The effect is as it would be if the  elements above looked like this:

....
 ....

This is accomplished by the following set of indexing policies:

<xairaItem type="indexPol">
 <attributePolicy
 ident="id"
 type="jointo"
 ns="http://www.w3.org/XML/1998/namespace"/>
</xairaItem>
<xairaItem type="indexPol">
 <attributePolicy type="joinfrom" ident="who">
 <nameList>
 <gi>u</gi>
 </nameList>
 <joinTo>
 <gi>person</gi>
 </joinTo>
 </attributePolicy>
</xairaItem>

First, we declare a join-to policy for any xml:id attribute. Next we declare the join-from policy for the who attribute on the  element. As well as specifying which attribute carries the value required (who), we need additionally to supply the name of the element on which the corresponding join-to attribute should be found (). Values are transferred when a match is found between the value for the who attribute and that of whichever attribute of the nominated element has been indexed with the join-to policy. Note that only one attribute of a given element may be indexed with the join-to policy and that the values of attributes indexed with the join-to policy must be unique within the specified element and attribute combination. Thus, there may be only <person> element with the value ABC for its xml:id attribute, though the same value may appear on other attributes. If the value appears on the xml:id attribute of some other element, it will not be found with this join-to policy. Note that, since the globally-available xml:id attribute is used to hold the joint-to attribute, its values must be unique across the whole corpus.

11.6.3 Index policy taxonomy

A taxonomy is a special kind of codebook, the purpose of which is to provide a set of defined codes to classify the texts making up a corpus. The BNC defines several different taxonomies as means of classifying its constituent texts, as further described in 5.2.3 The reference and classification declarations. The element or attribute within a particular text which identifies its classification, by referencing one or more codes within a taxonomy, is called its classifier.

Each distinct taxonomy for a corpus is defined by a TEI <taxonomy> element, within the corpus header. This defines the codes available for use and gives a gloss to them. Where, as is usual, the texts in a corpus are classified along more than one dimension (for example, by text type, by medium of distribution, by audience type etc.), a <taxonomy> must be defined for each dimension, rather than defining a single taxonomy with disjoint sets of children. Note that the classification codes used must be unique across the whole corpus, irrespective of the taxonomy to which they belong. This approach also enables the client to regard each taxonomy as defining a partition of the corpus.

To use a taxonomy defined in this way, the relevant attribute must be defined with the taxonomy indexing policy. In the case of the BNC, classification information is carried by two attributes:

the targets attribute on the <catRef> element in each text header supplies a list of values for all the original selection and descriptive criteria, described in 1 Design of the corpus
the type attribute on <wtext> and <stext> elements carries a broadbrush text-type categorization, derived from the other classification codes, see further 9.1 XML tag usage by text type

The following declarations achieve this effect:

<xairaItem type="indexPol">
 <attributePolicy ident="targets" type="taxonomy">
 <nameList>
 <gi>catRef</gi>
 </nameList>
 </attributePolicy>
</xairaItem>
<xairaItem type="indexPol">
 <attributePolicy ident="type" type="taxonomy">
 <nameList>
 <gi>wtext</gi>
 <gi>stext</gi>
 </nameList>
 </attributePolicy>
</xairaItem>

11.7 Language specification

As a Unicode system, XAIRA is able to handle data in any natural language or writing system. However, it is still necessary to specify the language or languages used in the corpus being indexed. This specification is performed by a <xairaList> of type langspec. This contains at least one <xairaItem type="defaultLang">, and optionally other <xairaItem type="langRules"> elements.

The BNC uses only standard English and thus contains only a default language specification, which looks like this:

Up: Contents Previous: 10 List of Sources Next: 12 Formal Specification of the BNC XML schema

edited by Lou Burnard. Date: January 2007
This page is copyrighted