British National Corpus User Reference Guide |
|
7. Software for BNC-baby | |
Author: edited by Lou Burnard (revised LB) Date: (revised 19-22 Nov 2003) |
Up: Contents Previous: 6. Miscellaneous code tables Next: 8. Lists of works excerpted
A design goal of the original BNC project was that it should not be delivered in a format which was proprietary or which required the use of any particular piece of software. This, together with the desire to conform to emerging international standards, was a key factor in determining the choice of SGML as the vehicle for the corpus interchange format. Six years after this decision, SGML is still a widely used international standard format for which many public domain and commercial utilities exist. Indeed, in the shape of XML, which is a simplified version of the original standard, SGML now dominates development of the world wide web, and hence of most sectors of the information processing community. New XML software appears almost every week, and it has been adopted by current `major players' from Sun and IBM to Microsoft.
That said, it must be recognised that the requirements of corpus linguists and others wishing to make use of the BNC are often rather specialist, and therefore unlikely to be supported by mainstream commercially produced software. For this and other reasons, the research user of the BNC should expect to have to do some programming. This is another reason behind the choice of SGML or XML as a vehicle for the system: because of the wide take up of these formalisms, there exist many utility libraries and generic programming interfaces which greatly simplify such processes as extracting the tags from a file, selecting portions of the text according to its logical structure, picking out files with certain attributes by searching their headers, and so on.
BNC-baby uses XML in a simple and straightforward way described in the rest of this manual; simple programs can be readily written using standard UNIX utilities such as grep or perl to access the corpus just as plain text files. More reliably, programs can be written to application programming interfaces (APIs) such as the W3C's Document Object Model (DOM) or the Simple API for XML (SAX), using application libraries developed for almost every modern programming language (C, Perl, Python, tcl etc.). Information about such resources is not provided here, but is readily found on the World Wide Web: currently, one good place to start looking is www.xml.com.
When the BNC was first published, the top of the range personal computer might have as much as 50 or even 100 megabytes of disk storage and 8 Mb of RAM. At the time of writing, 20 or 30 gigabyte hard disks and 128 Mb of RAM are commonplace on entry level machines. It is thus quite likely that software capable of efficiently handling the 1.5 gigabytes of text which make up the BNC will also soon become commonplace. For the moment, however, it has to be recognized that general purpose tools for SGML and XML do not always cope very well with the large size of the whole corpus, although they can still be very useful for processing subsets extracted from it. To handle the whole of the corpus, special purpose indexing software will usually be necessary. Although such systems exist, they are often expensive or difficult to implement. For that reason, the BNC Project also developed its own low-cost alternative, the SARA package, which is documented separately. It should be emphasized however that use of the BNC is not synonymous with use of SARA. Most generic tools developed for corpus linguistics and NLP can be used with the BNC, although the tools may be vary in the extent to which they can make use of the markup in the corpus.
Whatever software is used, the programmer must have a clear understanding of the various elements tagged in the corpus, the contexts in which they may appear, and their intended semantics. The syntax of an XML document is defined by a schema or a Document Type Definition. For TEI conformant texts, the TEI Header provides additional meta-information. The semantics of the elements encoded in an XML document are provided by documentation such as that provided elsewhere by this manual.
The corpus is delivered as a collection of 128 individual text files, grouped into four subdirectories, one for each of the text registers making up the corpus.
Each file contains a single BNC document, i.e. a TEI header and its associated spoken or written text, and has the same name as the value of the id attribute on its <bncDoc> element.
Note that the three-character identifiers used (and hence the directory structure) are entirely arbitrary and do not convey any information about the type of text contained. Each text contains a TEI Header which specifies all such meta information, either directly, or by reference to the corpus header, as described in section 5. The header.
Some ancillary files relating to the encoding and processing of the corpus are included in the standard release in a subdirectory called Work. This contains the following files:
The remainder of this section discusses how these files may be used together as an XML document. This is by no means the only way of processing the corpus, of course, and is intended solely to demonstrate the function of the various files listed above. Some basic understanding of XML is assumed.
To process a single text from the corpus (say, text fic/ABC), a driver file like the following might be used
<!DOCTYPE bnc SYSTEM "http://www.tei-c.org/Guidelines/DTD/tei2.dtd" [ <!ENTITY % TEI.prose "INCLUDE"> <!ENTITY % TEI.spoken "INCLUDE"> <!ENTITY % TEI.general "INCLUDE"> <!ENTITY % TEI.analysis "INCLUDE"> <!ENTITY % TEI.corpus "INCLUDE"> <!ENTITY % TEI.extensions.ent SYSTEM "bncMods.ent"> <!ENTITY % TEI.extensions.dtd SYSTEM "bncMods.dtd"> <!ENTITY corphdr SYSTEM "corphdr"> <!ENTITY text SYSTEM "fic/ABC.xml"> ]> <bnc> &corphdr; &text; </bnc>
This driver assumes that the standard TEI DTD is available from the URL given (which was true as of the date of this manual), and that the files from the BNC-baby distribution have been installed under /home/BNC-baby. Alternatively, if the driver file is to be used offline, using the `compiled' version of the BNC dtd, it might look like the following:
<!DOCTYPE bnc SYSTEM "bnc.dtd" [ <!ENTITY BNChdr SYSTEM "corphdr"> <!ENTITY text SYSTEM "fic/ABC.xml"> ]> <bnc> &corphdr; &text1; </bnc>
To process more than one file from the corpus, a set of declarations like the one given above for the entity text would be necessary, one for each text concerned. For convenenience, a file containing such declarations for every text in the corpus is also provided: this file, bncdocs.dtd, consists of declarations like the following:
<!ENTITY ABC SYSTEM "fic/ABC.xml"> <!ENTITY ABD SYSTEM "fic/ABD.xml">. With these declarations in force, it becomes possible to refer to the corpus file ABC simply by means of the entity reference &ABC;, as in the following example:
<!DOCTYPE bnc SYSTEM "bnc.dtd" [ <!ENTITY % BNCdocs SYSTEM "bncDocs.ent"> %BNCdocs; <!ENTITY BNChdr SYSTEM "corphdr"> ]> <bnc> &BNChdr; &ABC; &ABD; </bnc>
The first line declares that what follows is an SGML document and that the dtd describing it is located in the file with the SYSTEM identifier given (bnc.dtd). The next few lines (the portion within square brackets) comprise the DTD subset declaration: declarations here are to be processed before the content of the DTD. It comprises three entity declarations.
The first, for BNCdocs, associates that name with the external entity containing declarations for all the documents making up the BNC itself (i.e. the file bncDocs.ent), and then immediately references that entity. The percent sign is a syntactic convention of XML which need not concern us here: the effect is that each file in the corpus can now be referenced using a name such as &ABC;. In the same way, the next declaration associates the name BNChdr with the file containing the corpus header.
Following this, the driver file contains the XML document itself, beginning with the <bnc> start-tag, and ending with the </bnc> end-tag. Between these tags are entity references, one for the corpus header, followed by one for each file to be included in this view of the corpus.
As discussed in section 3. Basic structure above, the BNC consists of an overall corpus header, and a large number of distinct BNC documents, each with its own header. The corpus header must be present for an SGML processor to work with any part of the Corpus, because the corpus header contains declarations of elements (such as the classification records) referred to by almost every part of the corpus.
The various elements making up the header and their functions are discussed in section 5. The header. The corpus header itself is included in the file corphdr. Its contents are reproduced below, reformatted for legibility.
<teiHeader type="corpus" creator="lou" status="new" date.updated="2003-11-27" id="BNC-BABY">
<fileDesc>
<titleStmt>
BNC-Baby: a sampling of the British National Corpus
Selection, design, distribution: Research Technologies Service, Oxford University
Creation of original BNC : BNC Consortium
</titleStmt>
<editionStmt n="1.0">
First Edition
</editionStmt>
<extent>
Approximately four million words
</extent>
<publicationStmt>
<distributor>
Oxford University Computing Services
</distributor>
<address>
13 Banbury Road, Oxford OX2 6NN U.K.
Telephone: +44 1865 273221
Facsimile: +44 1865 273275
Internet mail: natcorp@oucs.ox.ac.uk
</address>
<idno type="BNC">
BNC-B
</idno>
<availability>
BNC-baby is distributed worldwide by Oxford University Computing Services on a not-for-profit basis, and under the terms of a standard license agreement. Each copy of the corpus must include a copy of this corpus header and any redistribution or republishing of the corpus texts (the "BNC Processed Material") is strictly forbidden.
For information, the conditions of the Standard License Agreement are as follows:
</availability>
<date value="2003-12-1">
1 December 2003
</date>
</publicationStmt>
<sourceDesc>
Like the British National Corpus from which it is derived, BNC-baby has no single source document. For details of the source or sources used in the creation of each electronic text, see the individual text headers. The principles and practices underlying selection and design of the corpus is documented in the BNC User Reference Guide, a copy of which should be supplied with it.
</sourceDesc>
</fileDesc>
<encodingDesc>
<projectDesc>
</projectDesc>
<samplingDecl>
Different parts of the BNC were constructed using different sampling policies, as further described in the BNC Design Documentation. The policies are summarized below. Note that information about which policy resulted in the selection of a particular text is not available.
Published: chosen selectively from candidate population
Published: chosen at random from candidate population
Unpublished: chosen according to relevant design criteria
Spoken: obtained from demographic sample of UK population
Spoken: obtained in context determined by design criteria
</samplingDecl>
<editorialDecl>
The following editorial policies were applied in creating the BNC. The DECLS attribute indicates which policies apply to a given <text> or <div> element; but not all policies are necessarily marked. Policies are identified by ID codes as follows:
</editorialDecl>
<tagsDecl>
</tagsDecl>
<refsDecl>
Canonical references in the British National Corpus are to text segment (<s>) elements, and are constructed by taking the value of the n attribute of the <bncDoc> element containing the target text, and concatenating a dot separator, followed by the value of the n attribute of the target <s> element.
Segments are numbered sequentially within each text or stext, starting at 1. There may be gaps in the numeric sequence, as a consequence of post-segmentation corrections.
</refsDecl>
<classDecl>
DLee: David Lee's Register classification as documented at http://members.xoom.com/davidlee00/genre_register.zip
COPAC: Keyword classifications as supplied by the UK's COPAC service at http://copac.ac.uk
:
allava Text availability
allava0 Ownership has not been claimed
allava1 Worldwide rights cleared
allava2 Worldwide rights cleared
allava3 Not available in North America
allava4 Not available in U.S.A.
allava5 Not available outside the European Union
allava6 Not available in U.S.A. & Philippines
allava7 Not available in N America & Philippines
alltyp Text type
alltyp1 Spoken demographic
alltyp2 Spoken context-governed
alltyp3 Written books and periodicals
alltyp4 Written-to-be-spoken
alltyp5 Written miscellaneous
alltim Publication date
alltim1 1960-1974
alltim2 1975-1984
alltim3 1985-1993
alltim0 Unknown
scgdom Domain for context-governed spoken material
scgdom1 Educational/Informative
scgdom2 Business
scgdom3 Public/Institutional
scgdom4 Leisure
sdeage Age band for demographic respondent
sdeage1 0-14
sdeage2 15-24
sdeage3 25-34
sdeage4 35-44
sdeage5 45-59
sdeage6 60+
sdecla Social class for demographic repondent
sdecla0 Unknown
sdecla1 AB
sdecla2 C1
sdecla3 C2
sdecla4 DE
sdesex Sex of demographic respondent
sdesex0 Unknown
sdesex1 Male
sdesex2 Female
spolog Interaction type for spoken text
spolog1 Monologue
spolog2 Dialogue
sporeg Region where spoken text captured
sporeg0 Unknown
sporeg1 South
sporeg2 Midlands
sporeg3 North
wriaag Author age band for written material
wriaag0 Unknown
wriaag1 0-14
wriaag2 15-24
wriaag3 25-34
wriaag4 35-44
wriaag5 45-59
wriaag6 60+
wriad Author domicile
wriad1 UK and Ireland
wriad2 Commonwealth
wriad3 Continental Europe
wriad4 USA
wriad5 Elsewhere
wriad0 Unknown
wriase Written: author sex
wriase0 Unknown
wriase1 Male
wriase2 Female
wriase3 Mixed
wriase4 Unknown
wriaty Written: type of author
wriaty1 Corporate
wriaty0 Unknown
wriaty2 Multiple
wriaty3 Sole
wriaty4 Unknown
wriaud Written: audience age
wriaud0 Unknown
wriaud1 Child
wriaud2 Teenager
wriaud3 Adult
wriaud4 Any
wridom Domain for written corpus texts
wridom1 Imaginative
wridom2 Informative: natural & pure science
wridom3 Informative: applied science
wridom4 Informative: social science
wridom5 Informative: world affairs
wridom6 Informative: commerce & finance
wridom7 Informative: arts
wridom8 Informative: belief & thought
wridom9 Informative: leisure
wrilev Written: perceived level of difficulty:
wrilev0 Unknown
wrilev1 Low
wrilev2 Medium
wrilev3 High
wrimed Medium for written corpus texts
wrimed1 Book
wrimed2 Periodical
wrimed3 Miscellaneous: published
wrimed4 Miscellaneous: unpublished
wrimed5 To-be-spoken
wripp Place of publication
wripp0 Unknown
wripp1 UK (unspecific)
wripp2 Ireland
wripp6 United States
wripp3 UK: North (north of Mersey-Humber line)
wripp4 UK: Midlands (north of Bristol Channel-Wash line)
wripp5 UK: South (south of Bristol Channel-Wash line)
wrisam Written text sample type
wrisam0 Unknown
wrisam1 Whole text
wrisam2 Beginning sample
wrisam3 Middle sample
wrisam4 End sample
wrisam5 Composite
wrista Written: estimated circulation size
wrista0 Unknown
wrista1 Low
wrista2 Medium
wrista3 High
writas Written: target audience sex
writas0 Unknown
writas1 Male
writas2 Female
writas3 Mixed
writas4 Unknown
</classDecl>
</encodingDesc>
<profileDesc>
<creation>
This version of the corpus contains only texts accessioned on or before 1994-11-04.
</creation>
<langUsage>
The language of the British National Corpus is modern British English. Words, fragments, and passages from many other languages, both ancient and modern, occur within the corpus where these may be represented using a Latin alphabet. Long passages in these languages, and material in other languages, are generally silently deleted. In no case is the lang attribute used to indicate the language of a word, phrase or passage, nor are alternate writing system definitions used.
</langUsage>
<particDesc>
<person age="X" id="PS000" n="W0000" role="other" sex="u" soc="UU">
PS000: Unknown speaker
</person>
<person age="X" id="PS001" n="W000M" role="other" sex="u" soc="UU">
PS001: Group of unknown speakers
</person>
</particDesc>
</profileDesc>
<revisionDesc>
Manual changes for BNC-baby
</revisionDesc>
</teiHeader>
Up: Contents Previous: 6. Miscellaneous code tables Next: 8. Lists of works excerpted