Software for BNC-baby

Up: Contents Previous: 6. Miscellaneous code tables Next: 8. Lists of works excerpted

7.1. Why XML?

A design goal of the original BNC project was that it should not be delivered in a format which was proprietary or which required the use of any particular piece of software. This, together with the desire to conform to emerging international standards, was a key factor in determining the choice of SGML as the vehicle for the corpus interchange format. Six years after this decision, SGML is still a widely used international standard format for which many public domain and commercial utilities exist. Indeed, in the shape of XML, which is a simplified version of the original standard, SGML now dominates development of the world wide web, and hence of most sectors of the information processing community. New XML software appears almost every week, and it has been adopted by current `major players' from Sun and IBM to Microsoft.

That said, it must be recognised that the requirements of corpus linguists and others wishing to make use of the BNC are often rather specialist, and therefore unlikely to be supported by mainstream commercially produced software. For this and other reasons, the research user of the BNC should expect to have to do some programming. This is another reason behind the choice of SGML or XML as a vehicle for the system: because of the wide take up of these formalisms, there exist many utility libraries and generic programming interfaces which greatly simplify such processes as extracting the tags from a file, selecting portions of the text according to its logical structure, picking out files with certain attributes by searching their headers, and so on.

BNC-baby uses XML in a simple and straightforward way described in the rest of this manual; simple programs can be readily written using standard UNIX utilities such as grep or perl to access the corpus just as plain text files. More reliably, programs can be written to application programming interfaces (APIs) such as the W3C's Document Object Model (DOM) or the Simple API for XML (SAX), using application libraries developed for almost every modern programming language (C, Perl, Python, tcl etc.). Information about such resources is not provided here, but is readily found on the World Wide Web: currently, one good place to start looking is www.xml.com.

When the BNC was first published, the top of the range personal computer might have as much as 50 or even 100 megabytes of disk storage and 8 Mb of RAM. At the time of writing, 20 or 30 gigabyte hard disks and 128 Mb of RAM are commonplace on entry level machines. It is thus quite likely that software capable of efficiently handling the 1.5 gigabytes of text which make up the BNC will also soon become commonplace. For the moment, however, it has to be recognized that general purpose tools for SGML and XML do not always cope very well with the large size of the whole corpus, although they can still be very useful for processing subsets extracted from it. To handle the whole of the corpus, special purpose indexing software will usually be necessary. Although such systems exist, they are often expensive or difficult to implement. For that reason, the BNC Project also developed its own low-cost alternative, the SARA package, which is documented separately. It should be emphasized however that use of the BNC is not synonymous with use of SARA. Most generic tools developed for corpus linguistics and NLP can be used with the BNC, although the tools may be vary in the extent to which they can make use of the markup in the corpus.

Whatever software is used, the programmer must have a clear understanding of the various elements tagged in the corpus, the contexts in which they may appear, and their intended semantics. The syntax of an XML document is defined by a schema or a Document Type Definition. For TEI conformant texts, the TEI Header provides additional meta-information. The semantics of the elements encoded in an XML document are provided by documentation such as that provided elsewhere by this manual.

7.2. BNC-baby delivery format

The corpus is delivered as a collection of 128 individual text files, grouped into four subdirectories, one for each of the text registers making up the corpus.

Each file contains a single BNC document, i.e. a TEI header and its associated spoken or written text, and has the same name as the value of the id attribute on its <bncDoc> element.

Note that the three-character identifiers used (and hence the directory structure) are entirely arbitrary and do not convey any information about the type of text contained. Each text contains a TEI Header which specifies all such meta information, either directly, or by reference to the corpus header, as described in section 5. The header.

Some ancillary files relating to the encoding and processing of the corpus are included in the standard release in a subdirectory called Work. This contains the following files:

bnc.dtd: An XML document type declaration for the corpus, as a single SGML file. This file is automatically generated from the standard TEI DTD using a pair of `extension files' as further discussed in the TEI Guidelines.
bncMods.dtd and bncMods.ent: The two extension files used to parameterize the TEI for BNC usage
bncDocs.dtd: SGML declarations for all documents making up the BNC
driver1.sgm and driver2.sgm: Example SGML driver files for processing the BNC.

The remainder of this section discusses how these files may be used together as an XML document. This is by no means the only way of processing the corpus, of course, and is intended solely to demonstrate the function of the various files listed above. Some basic understanding of XML is assumed.

To process a single text from the corpus (say, text fic/ABC), a driver file like the following might be used

<!DOCTYPE bnc SYSTEM  
   "http://www.tei-c.org/Guidelines/DTD/tei2.dtd" [
<!ENTITY % TEI.prose          "INCLUDE">
<!ENTITY % TEI.spoken         "INCLUDE">
<!ENTITY % TEI.general        "INCLUDE">
<!ENTITY % TEI.analysis       "INCLUDE">
<!ENTITY % TEI.corpus         "INCLUDE">
<!ENTITY % TEI.extensions.ent 
           SYSTEM "bncMods.ent">
<!ENTITY % TEI.extensions.dtd 
           SYSTEM "bncMods.dtd">
<!ENTITY corphdr SYSTEM "corphdr">
<!ENTITY text    SYSTEM "fic/ABC.xml">
]>
<bnc>
&corphdr;
&text;
</bnc>

This driver assumes that the standard TEI DTD is available from the URL given (which was true as of the date of this manual), and that the files from the BNC-baby distribution have been installed under /home/BNC-baby. Alternatively, if the driver file is to be used offline, using the `compiled' version of the BNC dtd, it might look like the following:

<!DOCTYPE bnc SYSTEM  "bnc.dtd" [
<!ENTITY BNChdr SYSTEM "corphdr">
<!ENTITY text    SYSTEM "fic/ABC.xml">
]>
<bnc>
&corphdr;
&text1;
</bnc>

To process more than one file from the corpus, a set of declarations like the one given above for the entity text would be necessary, one for each text concerned. For convenenience, a file containing such declarations for every text in the corpus is also provided: this file, bncdocs.dtd, consists of declarations like the following:

<!ENTITY ABC SYSTEM "fic/ABC.xml">
<!ENTITY ABD SYSTEM "fic/ABD.xml">

. With these declarations in force, it becomes possible to refer to the corpus file ABC simply by means of the entity reference &ABC;, as in the following example:

<!DOCTYPE bnc SYSTEM  "bnc.dtd" [
<!ENTITY % BNCdocs           
           SYSTEM "bncDocs.ent">
%BNCdocs;
<!ENTITY BNChdr SYSTEM "corphdr">
]>
<bnc>
&BNChdr;
&ABC;
&ABD;
</bnc>

The first line declares that what follows is an SGML document and that the dtd describing it is located in the file with the SYSTEM identifier given (bnc.dtd). The next few lines (the portion within square brackets) comprise the DTD subset declaration: declarations here are to be processed before the content of the DTD. It comprises three entity declarations.

The first, for BNCdocs, associates that name with the external entity containing declarations for all the documents making up the BNC itself (i.e. the file bncDocs.ent), and then immediately references that entity. The percent sign is a syntactic convention of XML which need not concern us here: the effect is that each file in the corpus can now be referenced using a name such as &ABC;. In the same way, the next declaration associates the name BNChdr with the file containing the corpus header.

Following this, the driver file contains the XML document itself, beginning with the <bnc> start-tag, and ending with the </bnc> end-tag. Between these tags are entity references, one for the corpus header, followed by one for each file to be included in this view of the corpus.

7.3. The BNC corpus header

As discussed in section 3. Basic structure above, the BNC consists of an overall corpus header, and a large number of distinct BNC documents, each with its own header. The corpus header must be present for an SGML processor to work with any part of the Corpus, because the corpus header contains declarations of elements (such as the classification records) referred to by almost every part of the corpus.

The various elements making up the header and their functions are discussed in section 5. The header. The corpus header itself is included in the file corphdr. Its contents are reproduced below, reformatted for legibility.

<teiHeader 
type="corpus" 
creator="lou" 
status="new" 
date.updated="2003-11-27" 
id="BNC-BABY">

<fileDesc>

File Description

<titleStmt>

BNC-Baby: a sampling of the British National Corpus

Selection, design, distribution: Research Technologies Service, Oxford University

Creation of original BNC : BNC Consortium

</titleStmt>

<editionStmt 
n="1.0">

First Edition

</editionStmt>

<extent>

Approximately four million words

</extent>

<publicationStmt>

Publication

<distributor>

Oxford University Computing Services

</distributor>

<address>

13 Banbury Road, Oxford OX2 6NN U.K.
Telephone: +44 1865 273221
Facsimile: +44 1865 273275
Internet mail: natcorp@oucs.ox.ac.uk

</address>

<idno 
type="BNC">

BNC-B

</idno>

<availability>

BNC-baby is distributed worldwide by Oxford University Computing Services on a not-for-profit basis, and under the terms of a standard license agreement. Each copy of the corpus must include a copy of this corpus header and any redistribution or republishing of the corpus texts (the "BNC Processed Material") is strictly forbidden.

For information, the conditions of the Standard License Agreement are as follows:

The BNC Consortium grants according to the terms and conditions set out herein and in consideration of the payments specified herein a non-exclusive, non-transferable Licence to the Licensee to use the BNC Processed Material for the purposes of linguistic research and/or the development of language products.
Distribution of the BNC Processed Material is restricted to the Licensee or in the event of the Licensee being an organisation, to the Licensee's research group. This group is defined as consisting only of those Licensee's employees whom the Licensee authorises to perform the work using the BNC Processed Material for the purposes described in paragraph (a).
Members of the said research group must not, except as herein provided, copy, publish or otherwise give to any third party access to the whole or any part of the BNC Processed Material. It is the responsibility of the Licensee to ensure that the members of the said research group understand and abide by this restriction, and to supervise their activities with respect to the BNC Processed Material. Neither the Licensee nor members of the Licensee's said research group may assign, transfer, lease, sell, rent, charge or otherwise encumber the BNC Processed Material.
The BNC Processed Material may be installed at the place or places of work of the said research group. The place of work is defined as the computing systems that the members of a research group normally use to conduct their research activities. It can include both work and home computers, and is not restricted to a particular machine or building.
Copies of the BNC Processed Material may be made for backup purposes, or for the purposes of making data available to members of the research group but the Licensee shall ensure that the BNC Consortium's copyright notice is reproduced on all copies or parts thereof of the BNC Processed Material. Any such copies will be deemed to be part of the BNC Processed Material.
There is no restriction on the use of the Licensee's Results except that the Licensee may not publish in print or electronic form or exploit commercially in any form whatsoever any extracts from the BNC Processed Material other than as permitted under the provisions of the relevant copyright laws.
The BNC Consortium does not grant to the Licensee any rights whatsoever to reproduce the BNC Texts or use all or any part of the BNC Texts in commercial products or services in any way other than would be permitted under the fair dealings provision of copyright law.

</availability>

<date 
value="2003-12-1">

1 December 2003

</date>

</publicationStmt>

<sourceDesc>

Like the British National Corpus from which it is derived, BNC-baby has no single source document. For details of the source or sources used in the creation of each electronic text, see the individual text headers. The principles and practices underlying selection and design of the corpus is documented in the BNC User Reference Guide, a copy of which should be supplied with it.

</sourceDesc>

</fileDesc>

<encodingDesc>

Encoding

<projectDesc>

Goals

The British National Corpus (BNC) Consortium was formed in 1990, and started work in 1991 on the three-year task of producing a hundred-million word corpus of modern British English for use in commercial and academic research. The first edition was published in 1994. The World Edition of the BNC was produced during between 1998 and 2000. It contains a thorough revision of the part of speech tagging, several corrections to the headers, and some minor revision of the SGML tagging used. BNC Baby is a principled sampling of four million words chosen from the World Edition in order to represent four key registers of the English language: newspapers, fiction, academic prose, and informal conversation.

The Consortium Participants

The BNC is the result of of a unique collaboration between three major U.K. dictionary publishers, two universities, and the British Library. The dictionary publishers are Chambers Harrap, Longman, and Oxford University Press; the universities are The Unit for Computer Research into the English Language (UCREL) at Lancaster University; and Oxford University Computing Services (OUCS).

Funding

The development of the BNC was funded by the commercial partners in the consortium with assistance from the the U.K. government's Department of Trade and Industry (DTI) and Science and Engineering Research Council (SERC) under the Joint Framework for Information Technology (JFIT).

Design

The British National Corpus is

A large corpus: its hundred million words are made up of ninety million from written and ten million from spoken sources.
A sample corpus: it is composed of text samples, generally of no more than 40,000 words, rather than of complete works.
A synchronic corpus: it includes imaginative texts dating from the 1960s to 1994; informative texts dating from 1975 to 1994; and spoken texts gathered primarily between 1990 and 1994.
A general corpus: it is not specifically restricted to any particular subject field, register or genre. It includes language from all age and social groups and a broad spread of U.K. regions.
A monolingual British English Corpus: text samples are substantially the product of British English speakers. A small proportion of the words in the corpus are in a foreign language or non-British English.
A TEI-conformant Corpus: texts in the corpus are uniformly marked up according to the recommendations of the Text Encoding Initiative (TEI), an international consortium concerned with the mark-up of texts for use in academic research. These recommendations are an application of Standardized General Markup Language (SGML), defined by International Standard IS 8879:1986.

Uses

Lexicography: The corpus provides a body of new data on word meaning, grammar and usage. It yields empirical data on word frequencies, word classes and spelling preferences, among other things. It also reveals hitherto undocumented evidence about the spoken language, with consequences that go far beyond the immediate impact on dictionary-writing.
Linguistic research: The corpus provides a standard basis for investigating phenomena and testing competing linguistic theories.
Language technology: Statistical techniques, requiring very large samples of text, are increasingly used in machine translation, speech recognition, speech synthesizers, spelling and grammar checkers for word-processing and desk-top publishing, hand-held electronic books and other developments in information technology.
Teaching: The corpus provides a rich source of examples of current usage for English Language Teaching, allowing more frequent patterns of use to be distinguished from less frequent. In addition, the corpus provides a valuable didactic resource for use in many areas of higher education.
As a model: Future TEI-conformant corpora, in English and other languages, may base their designs on the experience gained in the production of the BNC.

</projectDesc>

<samplingDecl>

Different parts of the BNC were constructed using different sampling policies, as further described in the BNC Design Documentation. The policies are summarized below. Note that information about which policy resulted in the selection of a particular text is not available.

SD000: Published: chosen selectively from candidate population
SD001: Published: chosen at random from candidate population
SD002: Unpublished: chosen according to relevant design criteria
SD003: Spoken: obtained from demographic sample of UK population
SD004: Spoken: obtained in context determined by design criteria

</samplingDecl>

<editorialDecl>

The following editorial policies were applied in creating the BNC. The DECLS attribute indicates which policies apply to a given <text> or <div> element; but not all policies are necessarily marked. Policies are identified by ID codes as follows:

correction policies

Errors tagged with <sic> when seen; no normalization
Errors tagged with <sic> if seen; normalisation with <corr>
Normalized to standard British English or control list member
Corrections and normalizations applied silently

hyphenation policies

Smart elision of line-end hyphens; &rehy used for remainder
Dumb elision of line-end hyphens; true hyphens hand-reinstated
Line-end hyphens removed by hand where appropriate
Source material contains no line-end hyphens

quotation policies

Open, close quote normalized to &bquo, &equo
Open and close quote normalized to &quo
Quotation may be represented using <shift>

Segmentation

In this version of the Corpus, all segmentation and word-class marking was carried out in the same way, using CLAWS5.

Transcription methods

Copy-typed from hard-copy into OUP format; transduced to CDIF
Copy-typed from hard-copy into Longman format; transduced to CDIF
Scanned from hard-copy into OUP format; transduced to CDIF
Scanned from hard-copy into Longman format; transduced to CDIF
Transduced from M-R into OUP format; transduced to CDIF
Transduced from M-R into Longman format; transduced to CDIF
Recording transcribed into Longman format; transduced to CDIF

</editorialDecl>

<tagsDecl>

Tags

<align>: Alignment map for synchronizing overlapped speech
<bibl>: Free format bibliographic citation
<bncDoc>: an individual text in the BNC
<body>: The body of a written text
<c>: A single character, typically punctuation
<caption>: Floating caption in written material
<corr>: An editorial correction
<div>: Spoken text division
<div1>: Written text division, level 1
<div2>: Written text division, level 2
<div3>: Written text division, level 3
<div4>: Written text division, level 4
<event>: Non-verbal event in spoken text
<gap>: Point where source material omitted from electronic text
<head>: Header or headline on written text division
<hi>: Written text highlight indicator
<item>: List item
<l>: Poem or verse line
<label>: List item's label
<lb>: Line break indicator
<lg>: Group of verse lines
<list>: A list
<loc>: Anchor indicating synchronization point
<note>: Editorial or original note pertaining to a text
<p>: Written text paragraph
<pause>: Pause indicator in spoken text
<pb>: Written text page break
<poem>: Poetic or verse material
<ptr>: Pointer from one part of a text to another
<quote>: Written text quoted material indicator
<reg>: Regularizes questionable or incorrectly-spelled material
<s>: Text segment
<salute>: A salutation (as in a letter etc.)
<shift>: Indicates a change of register etc. in spoken material
<sic>: Marks questionable spelling or usage
<sp>: Dramatic written material speech marker
<spkr>: Dramatic written material speaker indicator
<stage>: Dramatic written material stage direction
<stext>: Spoken text
<text>: Written text
<trunc>: Indicates truncated word in spoken material
<u>: Spoken text utterance
<unclear>: Indicates untranscribable material in spoken text
<vocal>: Vocalized non-word in spoken material
<w>: CLAWS-defined word

</tagsDecl>

<refsDecl>

Canonical references in the British National Corpus are to text segment (<s>) elements, and are constructed by taking the value of the n attribute of the <bncDoc> element containing the target text, and concatenating a dot separator, followed by the value of the n attribute of the target <s> element.

Segments are numbered sequentially within each text or stext, starting at 1. There may be gaps in the numeric sequence, as a consequence of post-segmentation corrections.

</refsDecl>

<classDecl>

DLee: David Lee's Register classification as documented at http://members.xoom.com/davidlee00/genre_register.zip

COPAC: Keyword classifications as supplied by the UK's COPAC service at http://copac.ac.uk

allava Text availability

allava0 Ownership has not been claimed

allava1 Worldwide rights cleared

allava2 Worldwide rights cleared

allava3 Not available in North America

allava4 Not available in U.S.A.

allava5 Not available outside the European Union

allava6 Not available in U.S.A. & Philippines

allava7 Not available in N America & Philippines

alltyp Text type

alltyp1 Spoken demographic

alltyp2 Spoken context-governed

alltyp3 Written books and periodicals

alltyp4 Written-to-be-spoken

alltyp5 Written miscellaneous

alltim Publication date

alltim1 1960-1974

alltim2 1975-1984

alltim3 1985-1993

alltim0 Unknown

scgdom Domain for context-governed spoken material

scgdom1 Educational/Informative

scgdom2 Business

scgdom3 Public/Institutional

scgdom4 Leisure

sdeage Age band for demographic respondent

sdeage1 0-14

sdeage2 15-24

sdeage3 25-34

sdeage4 35-44

sdeage5 45-59

sdeage6 60+

sdecla Social class for demographic repondent

sdecla0 Unknown

sdecla1 AB

sdecla2 C1

sdecla3 C2

sdecla4 DE

sdesex Sex of demographic respondent

sdesex0 Unknown

sdesex1 Male

sdesex2 Female

spolog Interaction type for spoken text

spolog1 Monologue

spolog2 Dialogue

sporeg Region where spoken text captured

sporeg0 Unknown

sporeg1 South

sporeg2 Midlands

sporeg3 North

wriaag Author age band for written material

wriaag0 Unknown

wriaag1 0-14

wriaag2 15-24

wriaag3 25-34

wriaag4 35-44

wriaag5 45-59

wriaag6 60+

wriad Author domicile

wriad1 UK and Ireland

wriad2 Commonwealth

wriad3 Continental Europe

wriad4 USA

wriad5 Elsewhere

wriad0 Unknown

wriase Written: author sex

wriase0 Unknown

wriase1 Male

wriase2 Female

wriase3 Mixed

wriase4 Unknown

wriaty Written: type of author

wriaty1 Corporate

wriaty0 Unknown

wriaty2 Multiple

wriaty3 Sole

wriaty4 Unknown

wriaud Written: audience age

wriaud0 Unknown

wriaud1 Child

wriaud2 Teenager

wriaud3 Adult

wriaud4 Any

wridom Domain for written corpus texts

wridom1 Imaginative

wridom2 Informative: natural & pure science

wridom3 Informative: applied science

wridom4 Informative: social science

wridom5 Informative: world affairs

wridom6 Informative: commerce & finance

wridom7 Informative: arts

wridom8 Informative: belief & thought

wridom9 Informative: leisure

wrilev Written: perceived level of difficulty:

wrilev0 Unknown

wrilev1 Low

wrilev2 Medium

wrilev3 High

wrimed Medium for written corpus texts

wrimed1 Book

wrimed2 Periodical

wrimed3 Miscellaneous: published

wrimed4 Miscellaneous: unpublished

wrimed5 To-be-spoken

wripp Place of publication

wripp0 Unknown

wripp1 UK (unspecific)

wripp2 Ireland

wripp6 United States

wripp3 UK: North (north of Mersey-Humber line)

wripp4 UK: Midlands (north of Bristol Channel-Wash line)

wripp5 UK: South (south of Bristol Channel-Wash line)

wrisam Written text sample type

wrisam0 Unknown

wrisam1 Whole text

wrisam2 Beginning sample

wrisam3 Middle sample

wrisam4 End sample

wrisam5 Composite

wrista Written: estimated circulation size

wrista0 Unknown

wrista1 Low

wrista2 Medium

wrista3 High

writas Written: target audience sex

writas0 Unknown

writas1 Male

writas2 Female

writas3 Mixed

writas4 Unknown

</classDecl>

</encodingDesc>

<profileDesc>

Profile

<creation>

This version of the corpus contains only texts accessioned on or before 1994-11-04.

</creation>

<langUsage>

The language of the British National Corpus is modern British English. Words, fragments, and passages from many other languages, both ancient and modern, occur within the corpus where these may be represented using a Latin alphabet. Long passages in these languages, and material in other languages, are generally silently deleted. In no case is the lang attribute used to indicate the language of a word, phrase or passage, nor are alternate writing system definitions used.

</langUsage>

<particDesc>

<person 
age="X" 
id="PS000" 
n="W0000" 
role="other" 
sex="u" 
soc="UU">

PS000: Unknown speaker

</person>

<person 
age="X" 
id="PS001" 
n="W000M" 
role="other" 
sex="u" 
soc="UU">

PS001: Group of unknown speakers

</person>

</particDesc>

</profileDesc>

<revisionDesc>

Revision History

2003-11-23: 2003-11-23 ed OUCS
Manual changes for BNC-baby

</revisionDesc>

</teiHeader>