BNC

British National Corpus User Reference Guide

6. Miscellaneous code tables

  Author: edited by Lou Burnard (revised LB) Date: (revised 19-22 Nov 2003)

Up: Contents Previous: 5. The header Next: 7. Software for BNC-baby

This section consists of a series of tables identifying a number of codes used in various aspects of the corpus and its encoding.

The following code tables are provided:

6.1. Elements defined by the BNC DTD

The following list gives a brief description of each element defined in the BNC document type definition (DTD). Elements are listed in alphabetical order. Descriptions prefixed by ‘(H)’ are for elements which appear only in the text headers. Counts are given for elements occurring within texts.

<activity>
(H) participants' activity during recording
<address>
(H) postal or other address
<align>
alignment map for synchronizing overlap points (3461)
<analytic>
(H) analytic bibliographic entry
<author>
(H) author in bibliographic entry
<avail>
(H) availability code for file
<bibl>
loosely structured bibliographic reference (1037)
<biblScope>
(H) page range within bibliographic entry
<biblStruct>
(H) structured bibliographic entry
<bnc>
the corpus itself
<bncDoc>
an individual text within the corpus
<body>
the body of a written text in the corpus (3136)
<c>
a punctuation mark (13620069)
<caption>
a floating heading or caption (89935)
<catDesc>
(H) description of a category
<category>
(H) a category-value pair
<catRef>
(H) category codes applicable to a text
<change>
(H) change note
<classDecl>
(H) description of classification scheme
<corr>
editorial correction (8323)
<creation>
(H) information about creation of a text
<date>
(H) a date
<classCode>
(H) externally-defined classification code for a text
<div>
any subdivision of a spoken text (3779)
<div1>
first-level subdivision of a written text (84777)
<div2>
second-level subdivision of a written text (72697)
<div3>
third-level subdivision of a written text (38122)
<div4>
fourth-level subdivision of a written text (12506)
<editorialDecl>
(H) descriptions of editorial policies
<edition>
(H) edition in a bibliographic entry
<editionStmt>
(H) information about a particular edition
<encodingDesc>
(H) encoding description
<event>
non-verbal event within a spoken text (6565)
<extent>
(H) size of a corpus text
<fileDesc>
(H) documentation of an electronic text
<gap>
a spot where part of source text has been omitted (95959)
<head>
any form of heading or title (222876)
<hi>
typographically highlighted phrase (210927)
<idno>
(H) identifying number for a text
<imprint>
(H) imprint within a bibliographic entry
<item>
item within a list (117207)
<keyWords>
(H) descriptive keywords for topics of a text
<l>
line of verse (51559)
<label>
label of a list item (65664)
<langUsage>
(H) description of languages used in a text
<lb>
line break in printed source (169)
<lg>
group of verse lines (1)
<list>
list of items (19758)
<loc>
synchronisation point within an alignment map (244975)
<locale>
(H) description of a place where speech recorded
<monogr>
(H) monographic bibliographic entry
<name>
(H) name of place where speech recorded
<note>
note or comment of any kind (17206)
<p>
paragraph in written text (1515002)
<particDesc>
(H) description of spoken text participants
<pause>
noticeable pause in spoken text (217916)
<pb>
page break in written text (153642)
<person>
(H) information about a speaker
<poem>
group of verse lines in a written text (3048)
<profileDesc>
(H) additional information about a text
<projectDesc>
(H) background information about BNC project
<ptr>
link to a displaced element or to synchronisation point (578248)
<publicationStmt>
(H) publication or distribution information
<pubPlace>
(H) place of publication within bibliographic entry
<quote>
quotation from some other work (15221)
<recording>
(H) information about a single recording
<recordingStmt>
(H) information about the recordings from which a transcript was made
<refsDecl>
(H) description of reference system used
<reg>
an editorial regularization (8363)
<relation>
(H) relationship between participants in a spoken text
<resp>
(H) nature of responsibility
<respStmt>
(H) statement of responsibility in a bibliographic entry
<revisionDesc>
(H) revision description
<s>
sentence-like linguistic segment (6053093)
<salute>
salutation or greeting (444)
<samplingDecl>
(H) description of sampling policy
<settingDesc>
(H) description of the settings in which speech occurs
<setting>
(H) an individual setting in which speech occurs
<shift>
change in voice quality (36216)
<sic>
apparently erroneous transcription (7797)
<sp>
speech in a written text (29858)
<spkr>
speaker of a speech in a written text (23708)
<sourceDesc>
(H) description of the source for a text
<stage>
stage direction in a written text (508)
<stext>
an individual spoken text (918)
<tagsDecl>
(H) list of tags used in a particular text
<tagUsage>
(H) count for a particular tag in a text
<teiHeader>
meta-information describing a corpus text
<term>
(H) individual term in a list of keywords
<text>
an individual written text (3136)
<title>
(H) title within a bibliographic entry
<titleStmt>
(H) title statement for a text
<trans>
(H) declaration of transcription policy
<trunc>
truncated form in a spoken text (52724)
<textClass>
(H) text classification
<u>
utterance in a spoken text (775799)
<unclear>
inaudible or incomprehensible passage in a spoken text (204239)
<vocal>
non-verbal vocalization in a spoken text (44286)
<w>
POS-tagged lexical item (97619934)

6.2. Voice quality codes

Changes in voice quality in spoken texts are indicated by values for the <new> attribute on a <shift> element, at the point where the speaker's voice change. The following values are used in BNC-baby, with the frequency given in square brackets:

crying [62] eating [2]
in a boyish voice [1] laughing [1512]
laughing+reading [2] mimicking [9]
mimicking American accent [1] mimicking Birmingham accent [2]
mimicking Jamaican accent [1] mimicking an upper class person [1]
mimicking baby voice [1] praying [1]
reading [610] reading+laughing [6]
reading+shouting [1] screaming [47]
shouting [274] sighing [39]
singing [443] singing + mimicking [1]
singing+shouting [4] speaking dramatically [1]
spelling [11] whingeing [3]
whining [7] whispering [129]
whispering+laughing [1] yawning [71]
yawning+reading [1]

6.3. Regional codes

The following codes are used in BNC-baby to mark perceived speaker dialect, which is specified by the dialect attribute on the <person> element in the text header:

CAN
Canada [1]
XEA
accent: East Anglia [8]
XHC
accent: Home Counties [15]
XIR
accent: Irish [4]
XLC
accent: Lancashire [14]
XLO
accent: London [16]
XMC
accent: central Midlands [7]
XMD
accent: Merseyside [2]
XME
accent: north-east Midlands [19]
XMI
accent: Midlands [1]
XMS
accent: south Midlands [4]
XMW
accent: north-west Midlands [2]
XNC
accent: central northern England [2]
XNE
accent: north-east England [8]
XNO
accent: northern England [9]
XOT
accent: other or unidentifiable [71]
XSD
accent: Scottish [16]
XSL
accent: lower south-west England [16]
XUR
accent: European [4]
XWA
accent: Welsh [17]

6.4. Text and genre classification codes

Texts are classified in several different ways in the original BNC, as described in section 5.3.5. Text classification. Each text carries a number of text classification codes, specified a string of values on the target attribute of its <catRefs> element. Possible values for these codes and their significance are listed in the corpus header (see 7.3. The BNC corpus header). These values are intended for use by any indexing system wishing to partition the whole BNC in a particular way or to extract subcorpora (such as BNC-baby) from it. Distribution tables showing the number of texts, words, and sentences classified under most of them are given above in section 2.5. Design of the BNC World Edition.

As well as these classification codes, which were used during the construction of the corpus, an additional classification is provided for each text as the content of a <classCode> element in its text header, as an alternative way of characterising each text. Full details of the analysis scheme used and its rationale are provided in an article by David Lee (Genres, registers, text types and styles: clarifying the concepts and navigating a path through the BNC Jungle published in Language Learning and Technology, vol 5 no 3, September 2001). The full range of Lee's classification, and the number of texts thus classified is documented in the Users Reference Guide for the BNC World Edition. In BNC-baby only the following classifications are used:

Code Texts
S conv 30
W ac humanities arts 7
W ac medicine 2
W ac nat science 6
W ac polit law edu 6
W ac soc science 7
W ac tech engin 2
W fict prose 27
W newsp brdsht nat arts 9
W newsp brdsht nat commerce 7
W newsp brdsht nat editorial 1
W newsp brdsht nat misc 25
W newsp brdsht nat report 3
W newsp brdsht nat science 5
W newsp brdsht nat social 13
W newsp brdsht nat sports 3
W newsp other arts 3
W newsp other commerce 5
W newsp other report 8
W newsp other science 7
W newsp other social 8
W newsp tabloid 1

6.5. Word class codes

A full discussion of the principles and practice underlying the CLAWS word class annotation scheme used in the BNC is provided by the document Manual to accompany The British National Corpus (Version 2) with Improved Word-class Tagging, which is distributed with the BNC World Edition in HTML format.

For convenience, a list of the codes used by this scheme extracted from that manual is also provided here.

POS usage POS usage
AJ0 Adjective (general or positive) (e.g. good, old, beautiful)
AJC Comparative adjective (e.g. better, older) AJS Superlative adjective (e.g. best, oldest)
AT0 Article (e.g. the, a, an, no) AV0 General adverb: an adverb not subclassified as AVP or AVQ (see below) (e.g. often, well, longer (adv.), furthest.
AVP Adverb particle (e.g. up, off, out) AVQ Wh-adverb (e.g. when, where, how, why, wherever)
CJC Coordinating conjunction (e.g. and, or, but) CJS Subordinating conjunction (e.g. although, when)
CJT The subordinating conjunction that CRD Cardinal number (e.g. one, 3, fifty-five, 3609)
DPS Possessive determiner-pronoun (e.g. your, their, his) DT0 General determiner-pronoun: i.e. a determiner-pronoun which is not a DTQ or an AT0.
DTQ Wh-determiner-pronoun (e.g. which, what, whose, whichever) EX0 Existential there, i.e. there occurring in the there is ... or there are ... construction
ITJ Interjection or other isolate (e.g. oh, yes, mhm, wow) NN0 Common noun, neutral for number (e.g. aircraft, data, committee)
NN1 Singular common noun (e.g. pencil, goose, time, revelation) NN2 Plural common noun (e.g. pencils, geese, times, revelations)
NP0 Proper noun (e.g. London, Michael, Mars, IBM) ORD Ordinal numeral (e.g. first, sixth, 77th, last) .
PNI Indefinite pronoun (e.g. none, everything, one [as pronoun], nobody) PNP Personal pronoun (e.g. I, you, them, ours)
PNQ Wh-pronoun (e.g. who, whoever, whom) PNX Reflexive pronoun (e.g. myself, yourself, itself, ourselves)
POS The possessive or genitive marker 's or ' PRF The preposition of
PRP Preposition (except for of) (e.g. about, at, in, on, on behalf of, with) PUL Punctuation: left bracket - i.e. ( or [
PUN Punctuation: general separating mark - i.e. . , ! , : ; - or ? PUQ Punctuation: quotation mark - i.e. ' or "
PUR Punctuation: right bracket - i.e. ) or ] TO0 Infinitive marker to
UNC Unclassified items which are not appropriately considered as items of the English lexicon. VBB The present tense forms of the verb BE, except for is, 's: i.e. am, are, 'm, 're and be [subjunctive or imperative]
VBD The past tense forms of the verb BE: was and were VBG The -ing form of the verb BE: being
VBI The infinitive form of the verb BE: be VBN The past participle form of the verb BE: been
VBZ The -s form of the verb BE: is, 's VDB The finite base form of the verb BE: do
VDD The past tense form of the verb DO: did VDG The -ing form of the verb DO: doing
VDI The infinitive form of the verb DO: do VDN The past participle form of the verb DO: done
VDZ The -s form of the verb DO: does, 's VHB The finite base form of the verb HAVE: have, 've
VHD The past tense form of the verb HAVE: had, 'd VHG The -ing form of the verb HAVE: having
VHI The infinitive form of the verb HAVE: have VHN The past participle form of the verb HAVE: had
VHZ The -s form of the verb HAVE: has, 's VM0 Modal auxiliary verb (e.g. will, would, can, could, 'll, 'd)
VVB The finite base form of lexical verbs (e.g. forget, send, live, return) [Including the imperative and present subjunctive] VVD The past tense form of lexical verbs (e.g. forgot, sent, lived, returned)
VVG The -ing form of lexical verbs (e.g. forgetting, sending, living, returning) VVI The infinitive form of lexical verbs (e.g. forget, send, live, return)
VVN The past participle form of lexical verbs (e.g. forgotten, sent, lived, returned) VVZ The -s form of lexical verbs (e.g. forgets, sends, lives, returns)
XX0 The negative particle not or n't ZZ0 Alphabetical symbols (e.g. A, a, B, b, c, d)

In addition to the basic 57 codes tabulated above, the BNC World Edition uses thirty `portmanteau' or `ambiguity' tags. These are applied wherever the probabilities assigned by the CLAWS automatic tagger to its first and second choice tags were considered too low for reliable disambiguation. So, for example, the ambiguity tag AJ0-AV0 indicates that the choice between adjective (AJ0) and adverb (AV0) is left open, although the tagger has a preference for an adjective reading. The mirror tag, AV0-AJ0, again shows adjective-adverb ambiguity, but this time the more likely reading is the adverb.

The following table lists the ambiguity codes used in BNC World:

Ambiguity code Ambiguous between More probable tag AJ0-NN1 AJ0 or NN1 AJ0
AJ0-VVD AJ0 or VVD AJ0 AJ0-VVG AJ0 or VVG AJ0
AJ0-VVN AJ0 or VVN AJ0 AV0-AJ0 AV0 or AJ0 AV0
AVP-PRP AVP or PRP AVP AVQ-CJS AVQ or CJS AVQ
CJS-AVQ CJS or AVQ CJS CJS-PRP CJS or PRP CJS
CJT-DT0 CJT or DT0 CJT CRD-PNI CRD or PNI CRD
DT0-CJT DT0 or CJT DT0 NN1-AJ0 NN1 or AJ0 NN1
NN1-NP0 NN1 or NP0 NN1 NN1-VVB NN1 or VVB NN1
NN1-VVG NN1 or VVG NN1 NN2-VVZ NN2 or VVZ NN2
NP0-NN1 NP0 or NN1 NP0 PNI-CRD PNI or CRD PNI
PRP-AVP PRP or AVP PRP PRP-CJS PRP or CJS PRP
VVB-NN1 VVB or NN1 VVB VVD-AJ0 VVD or AJ0 VVD
VVD-VVN VVD or VVN VVD VVG-AJ0 VVG or AJ0 VVG
VVG-NN1 VVG or NN1 VVG VVN-AJ0 VVN or AJ0 VVN
VVN-VVD VVN or VVD VVN VVZ-NN2 VVZ or NN2 VVZ

Up: Contents Previous: 5. The header Next: 7. Software for BNC-baby


Date: (revised 19-22 Nov 2003) Author: edited by Lou Burnard (revised LB).
British National Corpus.