British National Corpus User Reference Guide |
|
6. Miscellaneous code tables | |
Author: edited by Lou Burnard (revised LB) Date: (revised 19-22 Nov 2003) |
Up: Contents Previous: 5. The header Next: 7. Software for BNC-baby
This section consists of a series of tables identifying a number of codes used in various aspects of the corpus and its encoding.
The following code tables are provided:
The following list gives a brief description of each element defined in the BNC document type definition (DTD). Elements are listed in alphabetical order. Descriptions prefixed by ‘(H)’ are for elements which appear only in the text headers. Counts are given for elements occurring within texts.
Changes in voice quality in spoken texts are indicated by values for the <new> attribute on a <shift> element, at the point where the speaker's voice change. The following values are used in BNC-baby, with the frequency given in square brackets:
crying [62] | eating [2] |
in a boyish voice [1] | laughing [1512] |
laughing+reading [2] | mimicking [9] |
mimicking American accent [1] | mimicking Birmingham accent [2] |
mimicking Jamaican accent [1] | mimicking an upper class person [1] |
mimicking baby voice [1] | praying [1] |
reading [610] | reading+laughing [6] |
reading+shouting [1] | screaming [47] |
shouting [274] | sighing [39] |
singing [443] | singing + mimicking [1] |
singing+shouting [4] | speaking dramatically [1] |
spelling [11] | whingeing [3] |
whining [7] | whispering [129] |
whispering+laughing [1] | yawning [71] |
yawning+reading [1] |
The following codes are used in BNC-baby to mark perceived speaker dialect, which is specified by the dialect attribute on the <person> element in the text header:
Texts are classified in several different ways in the original BNC, as described in section 5.3.5. Text classification. Each text carries a number of text classification codes, specified a string of values on the target attribute of its <catRefs> element. Possible values for these codes and their significance are listed in the corpus header (see 7.3. The BNC corpus header). These values are intended for use by any indexing system wishing to partition the whole BNC in a particular way or to extract subcorpora (such as BNC-baby) from it. Distribution tables showing the number of texts, words, and sentences classified under most of them are given above in section 2.5. Design of the BNC World Edition.
As well as these classification codes, which were used during the construction of the corpus, an additional classification is provided for each text as the content of a <classCode> element in its text header, as an alternative way of characterising each text. Full details of the analysis scheme used and its rationale are provided in an article by David Lee (Genres, registers, text types and styles: clarifying the concepts and navigating a path through the BNC Jungle published in Language Learning and Technology, vol 5 no 3, September 2001). The full range of Lee's classification, and the number of texts thus classified is documented in the Users Reference Guide for the BNC World Edition. In BNC-baby only the following classifications are used:
Code | Texts |
S conv | 30 |
W ac humanities arts | 7 |
W ac medicine | 2 |
W ac nat science | 6 |
W ac polit law edu | 6 |
W ac soc science | 7 |
W ac tech engin | 2 |
W fict prose | 27 |
W newsp brdsht nat arts | 9 |
W newsp brdsht nat commerce | 7 |
W newsp brdsht nat editorial | 1 |
W newsp brdsht nat misc | 25 |
W newsp brdsht nat report | 3 |
W newsp brdsht nat science | 5 |
W newsp brdsht nat social | 13 |
W newsp brdsht nat sports | 3 |
W newsp other arts | 3 |
W newsp other commerce | 5 |
W newsp other report | 8 |
W newsp other science | 7 |
W newsp other social | 8 |
W newsp tabloid | 1 |
A full discussion of the principles and practice underlying the CLAWS word class annotation scheme used in the BNC is provided by the document Manual to accompany The British National Corpus (Version 2) with Improved Word-class Tagging, which is distributed with the BNC World Edition in HTML format.
For convenience, a list of the codes used by this scheme extracted from that manual is also provided here.
POS | usage | POS | usage | ||
AJ0 | Adjective (general or positive) (e.g. good, old, beautiful) | ||||
AJC | Comparative adjective (e.g. better, older) | AJS | Superlative adjective (e.g. best, oldest) | ||
AT0 | Article (e.g. the, a, an, no) | AV0 | General adverb: an adverb not subclassified as AVP or AVQ (see below) (e.g. often, well, longer (adv.), furthest. | ||
AVP | Adverb particle (e.g. up, off, out) | AVQ | Wh-adverb (e.g. when, where, how, why, wherever) | ||
CJC | Coordinating conjunction (e.g. and, or, but) | CJS | Subordinating conjunction (e.g. although, when) | ||
CJT | The subordinating conjunction that | CRD | Cardinal number (e.g. one, 3, fifty-five, 3609) | ||
DPS | Possessive determiner-pronoun (e.g. your, their, his) | DT0 | General determiner-pronoun: i.e. a determiner-pronoun which is not a DTQ or an AT0. | ||
DTQ | Wh-determiner-pronoun (e.g. which, what, whose, whichever) | EX0 | Existential there, i.e. there occurring in the there is ... or there are ... construction | ||
ITJ | Interjection or other isolate (e.g. oh, yes, mhm, wow) | NN0 | Common noun, neutral for number (e.g. aircraft, data, committee) | ||
NN1 | Singular common noun (e.g. pencil, goose, time, revelation) | NN2 | Plural common noun (e.g. pencils, geese, times, revelations) | ||
NP0 | Proper noun (e.g. London, Michael, Mars, IBM) | ORD | Ordinal numeral (e.g. first, sixth, 77th, last) . | ||
PNI | Indefinite pronoun (e.g. none, everything, one [as pronoun], nobody) | PNP | Personal pronoun (e.g. I, you, them, ours) | ||
PNQ | Wh-pronoun (e.g. who, whoever, whom) | PNX | Reflexive pronoun (e.g. myself, yourself, itself, ourselves) | ||
POS | The possessive or genitive marker 's or ' | PRF | The preposition of | ||
PRP | Preposition (except for of) (e.g. about, at, in, on, on behalf of, with) | PUL | Punctuation: left bracket - i.e. ( or [ | ||
PUN | Punctuation: general separating mark - i.e. . , ! , : ; - or ? | PUQ | Punctuation: quotation mark - i.e. ' or " | ||
PUR | Punctuation: right bracket - i.e. ) or ] | TO0 | Infinitive marker to | ||
UNC | Unclassified items which are not appropriately considered as items of the English lexicon. | VBB | The present tense forms of the verb BE, except for is, 's: i.e. am, are, 'm, 're and be [subjunctive or imperative] | ||
VBD | The past tense forms of the verb BE: was and were | VBG | The -ing form of the verb BE: being | ||
VBI | The infinitive form of the verb BE: be | VBN | The past participle form of the verb BE: been | ||
VBZ | The -s form of the verb BE: is, 's | VDB | The finite base form of the verb BE: do | ||
VDD | The past tense form of the verb DO: did | VDG | The -ing form of the verb DO: doing | ||
VDI | The infinitive form of the verb DO: do | VDN | The past participle form of the verb DO: done | ||
VDZ | The -s form of the verb DO: does, 's | VHB | The finite base form of the verb HAVE: have, 've | ||
VHD | The past tense form of the verb HAVE: had, 'd | VHG | The -ing form of the verb HAVE: having | ||
VHI | The infinitive form of the verb HAVE: have | VHN | The past participle form of the verb HAVE: had | ||
VHZ | The -s form of the verb HAVE: has, 's | VM0 | Modal auxiliary verb (e.g. will, would, can, could, 'll, 'd) | ||
VVB | The finite base form of lexical verbs (e.g. forget, send, live, return) [Including the imperative and present subjunctive] | VVD | The past tense form of lexical verbs (e.g. forgot, sent, lived, returned) | ||
VVG | The -ing form of lexical verbs (e.g. forgetting, sending, living, returning) | VVI | The infinitive form of lexical verbs (e.g. forget, send, live, return) | ||
VVN | The past participle form of lexical verbs (e.g. forgotten, sent, lived, returned) | VVZ | The -s form of lexical verbs (e.g. forgets, sends, lives, returns) | ||
XX0 | The negative particle not or n't | ZZ0 | Alphabetical symbols (e.g. A, a, B, b, c, d) |
In addition to the basic 57 codes tabulated above, the BNC World Edition uses thirty `portmanteau' or `ambiguity' tags. These are applied wherever the probabilities assigned by the CLAWS automatic tagger to its first and second choice tags were considered too low for reliable disambiguation. So, for example, the ambiguity tag AJ0-AV0 indicates that the choice between adjective (AJ0) and adverb (AV0) is left open, although the tagger has a preference for an adjective reading. The mirror tag, AV0-AJ0, again shows adjective-adverb ambiguity, but this time the more likely reading is the adverb.
The following table lists the ambiguity codes used in BNC World:
Ambiguity code | Ambiguous between | More probable tag | AJ0-NN1 | AJ0 or NN1 | AJ0 |
AJ0-VVD | AJ0 or VVD | AJ0 | AJ0-VVG | AJ0 or VVG | AJ0 |
AJ0-VVN | AJ0 or VVN | AJ0 | AV0-AJ0 | AV0 or AJ0 | AV0 |
AVP-PRP | AVP or PRP | AVP | AVQ-CJS | AVQ or CJS | AVQ |
CJS-AVQ | CJS or AVQ | CJS | CJS-PRP | CJS or PRP | CJS |
CJT-DT0 | CJT or DT0 | CJT | CRD-PNI | CRD or PNI | CRD |
DT0-CJT | DT0 or CJT | DT0 | NN1-AJ0 | NN1 or AJ0 | NN1 |
NN1-NP0 | NN1 or NP0 | NN1 | NN1-VVB | NN1 or VVB | NN1 |
NN1-VVG | NN1 or VVG | NN1 | NN2-VVZ | NN2 or VVZ | NN2 |
NP0-NN1 | NP0 or NN1 | NP0 | PNI-CRD | PNI or CRD | PNI |
PRP-AVP | PRP or AVP | PRP | PRP-CJS | PRP or CJS | PRP |
VVB-NN1 | VVB or NN1 | VVB | VVD-AJ0 | VVD or AJ0 | VVD |
VVD-VVN | VVD or VVN | VVD | VVG-AJ0 | VVG or AJ0 | VVG |
VVG-NN1 | VVG or NN1 | VVG | VVN-AJ0 | VVN or AJ0 | VVN |
VVN-VVD | VVN or VVD | VVN | VVZ-NN2 | VVZ or NN2 | VVZ |
Up: Contents Previous: 5. The header Next: 7. Software for BNC-baby