Miscellaneous code tables
This section consists of a series of tables identifying a number of codes used in various aspects of the corpus and its encoding.
- Elements defined by the BNC DTD lists all SGML elements used in the corpus, with a brief description of each
- Character entities defined by the BNC DTD lists all SGML entities used in the corpus, with a brief description of each
- Division types lists all values actually used in the corpus for the type attribute on division elements (<div1>, <div2> etc.)
- Rendition codes lists all values used in the corpus for the r (rendition) attribute, chiefly on <hi> elements, to indicate typographic rendering of the source
- Voice quality codes lists all values used in the corpus for the new attribute on the <shift> element, to indicate changes in voice quality for spoken texts
- Regional codes lists the codes used to identify regional origins of participants, as specified in the <person> element in the header
- Relationship codes lists the codes used to identify relationships documented between participants, as specified in the <relation> element in the header
- Word class codes lists all part of speech codes in the C5 tagset, used to specify the linguistic category for all <w> and <c> elements
In addition, a list of ‘non-orthographic words’ recognized by the CLAWS system (i.e. multiword items and clitics) which was included in this section in the first edition of this document is now available in the accompanying Manual to accompany The British National Corpus (Version 2) with Improved Word-class Tagging by Geoffrey Leech and Nicholas Smith.
The list of text classification codes present in the first edition of this document is now included only as part of the presentation of the corpus header in section ??.
Elements defined by the BNC DTD
- <activity>
- (H) participants' activity during recording
- <address>
- (H) postal or other address
- <align>
- alignment map for synchronizing overlap points (3461)
- <analytic>
- (H) analytic bibliographic entry
- <author>
- (H) author in bibliographic entry
- <avail>
- (H) availability code for file
- <bibl>
- loosely structured bibliographic reference (1037)
- <biblScope>
- (H) page range within bibliographic entry
- <biblStruct>
- (H) structured bibliographic entry
- <bnc>
- the BNC itself
- <bncDoc>
- an individual text in the BNC
- <body>
- the body of a written text in the BNC (3136)
- <c>
- a punctuation mark (13620069)
- <caption>
- a floating heading or caption (89935)
- <catDesc>
- (H) description of a category
- <category>
- (H) a category-value pair
- <catRef>
- (H) category codes applicable to a text
- <change>
- (H) change note
- <classDecl>
- (H) description of classification scheme
- <corr>
- editorial correction (8323)
- <creation>
- (H) information about creation of a text
- <date>
- (H) a date
- <classCode>
- (H) externally-defined classification code for a text
- <div>
- any subdivision of a spoken text (3779)
- <div1>
- first-level subdivision of a written text (84777)
- <div2>
- second-level subdivision of a written text (72697)
- <div3>
- third-level subdivision of a written text (38122)
- <div4>
- fourth-level subdivision of a written text (12506)
- <editorialDecl>
- (H) descriptions of editorial policies
- <edition>
- (H) edition in a bibliographic entry
- <editionStmt>
- (H) information about a particular edition
- <encodingDesc>
- (H) encoding description
- <event>
- non-verbal event within a spoken text (6565)
- <extent>
- (H) size of a corpus text
- <fileDesc>
- (H) documentation of an electronic text
- <gap>
- a spot where part of source text has been omitted (95959)
- <head>
- any form of heading or title (222876)
- <hi>
- typographically highlighted phrase (210927)
- <idno>
- (H) identifying number for a text
- <imprint>
- (H) imprint within a bibliographic entry
- <item>
- item within a list (117207)
- <keyWords>
- (H) descriptive keywords for topics of a text
- <l>
- line of verse (51559)
- <label>
- label of a list item (65664)
- <langUsage>
- (H) description of languages used in a text
- <lb>
- line break in printed source (169)
- <lg>
- group of verse lines (1)
- <list>
- list of items (19758)
- <loc>
- synchronisation point within an alignment map (244975)
- <locale>
- (H) description of a place where speech recorded
- <monogr>
- (H) monographic bibliographic entry
- <name>
- (H) name of place where speech recorded
- <note>
- note or comment of any kind (17206)
- <p>
- paragraph in written text (1515002)
- <particDesc>
- (H) description of spoken text participants
- <pause>
- noticeable pause in spoken text (217916)
- <pb>
- page break in written text (153642)
- <person>
- (H) information about a speaker
- <poem>
- group of verse lines in a written text (3048)
- <profileDesc>
- (H) additional information about a text
- <projectDesc>
- (H) background information about BNC project
- <ptr>
- link to a displaced element or to synchronisation point (578248)
- <publicationStmt>
- (H) publication or distribution information
- <pubPlace>
- (H) place of publication within bibliographic entry
- <quote>
- quotation from some other work (15221)
- <recording>
- (H) information about a single recording
- <recordingStmt>
- (H) information about the recordings from which a transcript was made
- <refsDecl>
- (H) description of reference system used
- <reg>
- an editorial regularization (8363)
- <relation>
- (H) relationship between participants in a spoken text
- <resp>
- (H) nature of responsibility
- <respStmt>
- (H) statement of responsibility in a bibliographic entry
- <revisionDesc>
- (H) revision description
- <s>
- sentence-like linguistic segment (6053093)
- <salute>
- salutation or greeting (444)
- <samplingDecl>
- (H) description of sampling policy
- <settingDesc>
- (H) description of the settings in which speech occurs
- <setting>
- (H) an individual setting in which speech occurs
- <shift>
- change in voice quality (36216)
- <sic>
- apparently erroneous transcription (7797)
- <sp>
- speech in a written text (29858)
- <spkr>
- speaker of a speech in a written text (23708)
- <sourceDesc>
- (H) description of the source for a text
- <stage>
- stage direction in a written text (508)
- <stext>
- an individual spoken text (918)
- <tagsDecl>
- (H) list of tags used in a particular text
- <tagUsage>
- (H) count for a particular tag in a text
- <teiHeader>
- meta-information describing a corpus text
- <term>
- (H) individual term in a list of keywords
- <text>
- an individual written text (3136)
- <title>
- (H) title within a bibliographic entry
- <titleStmt>
- (H) title statement for a text
- <trans>
- (H) declaration of transcription policy
- <trunc>
- truncated form in a spoken text (52724)
- <textClass>
- (H) text classification
- <u>
- utterance in a spoken text (775799)
- <unclear>
- inaudible or incomprehensible passage in a spoken text (204239)
- <vocal>
- non-verbal vocalization in a spoken text (44286)
- <w>
- POS-tagged lexical item (97619934)
Character entities defined by the BNC DTD
The following list gives a brief description of each character entity used within the text of the BNC. Declarations for these entities may be found either in standard entity sets or from the entity definitions supplied as part of the BNC document type definition, in the file BNCents.dtd. In either case, system specific values should be supplied for the characters described below. The number in parentheses indicates the number of times this entity reference appears in the current version of the corpus.
Division types
Rendition codes
The following codes are used to indicate the kind of typographic rendition associated with an element which is typographically distinct in some way. These codes are mostly used as values for the rend attribute of the <hi> element, but may be used on any element bearing this attribute.
More than one value from the above list may occasionally be specified for a single element. In this case, the values are separated by spaces.
Voice quality codes
- cheering
- crying
- eating
- giggling
- humming
- humming the stripper's song
- imitates woman's voice
- imitating a monkey
- imitating a sexy woman's voice
- imitating Chinese voice
- imitating drunken voice
- imitating Italian accent
- imitating man's voice
- imitating posh voice
- imitating woman's voice
- in a boyish voice
- in the distance
- laughing
- laughing+reading
- laughing+shouting
- mimicking
- mimicking American accent
- mimicking American accent from Wayne's World
- mimicking an upper class person
- mimicking baby voice
- mimicking Birmingham accent
- mimicking Chinese speaking
- mimicking Cilla Black's accent
- mimicking crying
- mimicking deep voice
- mimicking Donald Duck
- mimicking finance lady
- mimicking Geordie accent
- mimicking German accent
- mimicking girlie voice
- mimicking Henry Cooper
- mimicking Jamaican accent
- mimicking Manchester accent
- mimicking mentally handicapped
- mimicking northern accent
- mimicking Pakistani accent
- mimicking refined accent
- mimicking Scottish accent
- mimicking stupid man's voice
- mimicking Swedish accent
- mimicking telephone voice
- mimicking the German accent
- mimicking whining
- mimicking witch
- mimicking Yorkshire accent
- mimicking+screaming
- moaning
- mumbling
- muttering
- on telephone
- praying
- quoting
- raising voice
- rapping
- reading
- reading+laughing
- reading+shouting
- reading+whispering
- screaming
- shouting
- shouting+laughing
- shouting+spelling
- sighing
- singing
- singing+laughing
- singing+mimicking
- singing+shouting
- singing+whispering
- singing+yawning
- speaking as if mentally handicapped
- speaking dramatically
- speaking with mouth full
- spelling
- talking with mouth full
- whingeing
- whining
- whispering
- whispering+laughing
- yawning
- yawning+reading
Regional codes
The codes used to mark places of origin, regions, and dialects in the TEI Header are all derived from the same set of ISO 3-letter codes. The codes used are listed here:
- CAN
- Canada
- CHN
- China
- DEU
- Germany
- FRA
- France
- GBR
- United Kingdom
- IND
- India
- IRL
- Ireland
- USA
- United States
- XXX
- Unknown
- ZZG
- Europe
- XDE
- accent: German
- XEA
- accent: East Anglia
- XFR
- accent: French
- XHC
- accent: Home Counties
- XHM
- accent: Humberside
- XIR
- accent: Irish
- XIS
- accent: Indian subcontinent
- XLC
- accent: Lancashire
- XLO
- accent: London
- XMC
- accent: central Midlands
- XMD
- accent: Merseyside
- XME
- accent: north-east Midlands
- XMI
- accent: Midlands
- XMS
- accent: south Midlands
- XMW
- accent: north-west Midlands
- XNC
- accent: central northern England
- XNE
- accent: north-east England
- XNO
- accent: northern England
- XOT
- accent: other or unidentifiable
- XSD
- accent: Scottish
- XSL
- accent: lower south-west England
- XSS
- accent: central south-west England
- XSU
- accent: upper south-west England
- XUR
- accent: European
- XUS
- accent: U.S.A.
- XWA
- accent: Welsh
- XWE
- accent: West Indian
Relationship codes
Where relationships between individual participants in spoken texts can be identified, they will be specified by means of the <relation> element within the text header (as discussed in section ??). The type attribute of this element may take any of the values listed below. The number in parentheses indicates the number of times this value appears in the current version of the corpus.
- acquaint
- acquaintance (6)
- audience
- (4)
- aunt
- (8)
- aunt-i-l
- aunt-in-law (1)
- b-friend
- boyfriend (5)
- b-i-l
- brother-in-law (13)
- b-sitter
- baby sitter (2)
- brother
- (53)
- chairman
- (8)
- child
- (2)
- church-m
- church member (1)
- cl-m-i-l
- common law mother-in-law (1)
- client
- (1)
- colleagu
- colleague (123)
- cous-i-l
- cousin-in-law (1)
- cousin
- (7)
- customer
- (3)
- d-i-l
- daughter-in-law (11)
- daughter
- (84)
- doctor
- (77)
- employee
- (4)
- employer
- (9)
- f-i-l
- father-in-law (16)
- father
- (73)
- fiance
- (1)
- fiancee
- (2)
- friend
- (123)
- g-aunt
- great-aunt (1)
- g-daught
- granddaughter (15)
- g-fath
- grand-father (11)
- g-friend
- girlfriend (5)
- g-moth
- grandmother (21)
- g-niece
- great-niece (1)
- g-son
- grandson (17)
- gg-daugh
- great-granddaughter (1)
- gg-moth
- great-grandmother (1)
- hairdres
- hairdresser (1)
- host
- (1)
- housekee
- housekeeper (1)
- husband
- (103)
- intervee
- interviewee (42)
- lecturer
- (4)
- m-i-l
- mother-in-law (21)
- mother
- (117)
- neighbou
- neighbour (13)
- neph-i-l
- nephew-in-law (1)
- nephew
- (7)
- niece
- (9)
- parent
- (5)
- patient
- (76)
- s-daught
- step-daughter (1)
- s-father
- step-father (1)
- secretar
- secretary (3)
- server
- (2)
- sib-i-l
- sibling-in-law (1)
- sibling
- (1)
- sis-i-l
- sister-in-law (12)
- sister
- (48)
- son
- (71)
- son-i-l
- son-in-law (18)
- speaker
- (9)
- stranger
- (13)
- student
- (31)
- teacher
- (26)
- trainee
- (1)
- trainer
- (2)
- tutor
- (4)
- uncle
- (6)
- visitor
- (2)
- wife
- (104)
Text and genre classification codes
Texts are classified in several different ways in the BNC, as described in section ??. Each text carries a number of text classification codes, specified a string of values on the target attribute of its <catRefs> element. Possible values for these codes and their significance are listed in the corpus header (see ??). These values are also used in the BNC indexing files described in section ?? and distribution tables showing the number of texts, words, and sentences classified under most of them are given above in section ??.
One of the codes listed below is also supplied for each text as the content of a <classCode> element in its text header, as an alternative way of characterising each text. Full details of the analysis scheme used and its rationale are provided in an article by David Lee (Genres, registers, text types and styles: clarifying the concepts and navigating a path through the BNC Jungle, to be published in Language Learning and Technology, vol 5 no 3, September 2001) who has also generously agreed to make some of the results of this work available with the current release of the BNC.
Word class codes
A full discussion of the principles and practice underlying the CLAWS word class annotation scheme used in the BNC is provided by the document Manual to accompany The British National Corpus (Version 2) with Improved Word-class Tagging, which is distributed with the BNC World Edition in HTML format.
Tag | Description |
AJ0 |
Adjective (general or positive) (e.g. good, old, beautiful) |
AJC |
Comparative adjective (e.g. better, older) |
AJS |
Superlative adjective (e.g. best, oldest) |
AT0 |
Article (e.g. the, a, an, no) |
AV0 |
General adverb: an adverb not subclassified as AVP or AVQ (see below) (e.g. often, well, longer (adv.), furthest. |
AVP |
Adverb particle (e.g. up, off, out) |
AVQ |
Wh-adverb (e.g. when, where, how, why, wherever) |
CJC |
Coordinating conjunction (e.g. and, or, but) |
CJS |
Subordinating conjunction (e.g. although, when) |
CJT |
The subordinating conjunction that |
CRD |
Cardinal number (e.g. one, 3, fifty-five, 3609) |
DPS |
Possessive determiner-pronoun (e.g. your, their, his) |
DT0 |
General determiner-pronoun: i.e. a determiner-pronoun which is not a DTQ or an AT0. |
DTQ |
Wh-determiner-pronoun (e.g. which, what, whose, whichever) |
EX0 |
Existential there, i.e. there occurring in the there is ... or there are ... construction |
ITJ |
Interjection or other isolate (e.g. oh, yes, mhm, wow) |
NN0 |
Common noun, neutral for number (e.g. aircraft, data, committee) |
NN1 |
Singular common noun (e.g. pencil, goose, time, revelation) |
NN2 |
Plural common noun (e.g. pencils, geese, times, revelations) |
NP0 |
Proper noun (e.g. London, Michael, Mars, IBM) |
ORD |
Ordinal numeral (e.g. first, sixth, 77th, last) . |
PNI |
Indefinite pronoun (e.g. none, everything, one [as pronoun], nobody) |
PNP |
Personal pronoun (e.g. I, you, them, ours) |
PNQ |
Wh-pronoun (e.g. who, whoever, whom) |
PNX |
Reflexive pronoun (e.g. myself, yourself, itself, ourselves) |
POS |
The possessive or genitive marker 's or ' |
PRF |
The preposition of |
PRP |
Preposition (except for of) (e.g. about, at, in, on, on behalf of, with) |
PUL |
Punctuation: left bracket - i.e. ( or [ |
PUN |
Punctuation: general separating mark - i.e. . , ! , : ; - or ? |
PUQ |
Punctuation: quotation mark - i.e. ' or " |
PUR |
Punctuation: right bracket - i.e. ) or ] |
TO0 |
Infinitive marker to |
UNC |
Unclassified items which are not appropriately considered as items of the English lexicon. |
VBB |
The present tense forms of the verb BE, except for is, 's: i.e. am, are, 'm, 're and be [subjunctive or imperative] |
VBD |
The past tense forms of the verb BE: was and were |
VBG |
The -ing form of the verb BE: being |
VBI |
The infinitive form of the verb BE: be |
VBN |
The past participle form of the verb BE: been |
VBZ |
The -s form of the verb BE: is, 's |
VDB |
The finite base form of the verb BE: do |
VDD |
The past tense form of the verb DO: did |
VDG |
The -ing form of the verb DO: doing |
VDI |
The infinitive form of the verb DO: do |
VDN |
The past participle form of the verb DO: done |
VDZ |
The -s form of the verb DO: does, 's |
VHB |
The finite base form of the verb HAVE: have, 've |
VHD |
The past tense form of the verb HAVE: had, 'd |
VHG |
The -ing form of the verb HAVE: having |
VHI |
The infinitive form of the verb HAVE: have |
VHN |
The past participle form of the verb HAVE: had |
VHZ |
The -s form of the verb HAVE: has, 's |
VM0 |
Modal auxiliary verb (e.g. will, would, can, could, 'll, 'd) |
VVB |
The finite base form of lexical verbs (e.g. forget, send, live, return) [Including the imperative and present subjunctive] |
VVD |
The past tense form of lexical verbs (e.g. forgot, sent, lived, returned) |
VVG |
The -ing form of lexical verbs (e.g. forgetting, sending, living, returning) |
VVI |
The infinitive form of lexical verbs (e.g. forget, send, live, return) |
VVN |
The past participle form of lexical verbs (e.g. forgotten, sent, lived, returned) |
VVZ |
The -s form of lexical verbs (e.g. forgets, sends, lives, returns) |
XX0 |
The negative particle not or n't |
ZZ0 |
Alphabetical symbols (e.g. A, a, B, b, c, d) |
In addition to the basic 57 codes tabulated above, the BNC World Edition uses
thirty ‘portmanteau’ or ‘ambiguity’ tags. These are applied wherever the probabilities assigned by the CLAWS automatic tagger to its first and second choice tags were considered too low for reliable disambiguation. So, for example, the ambiguity tag AJ0-AV0
indicates that the choice between adjective (AJ0
) and adverb (AV0
) is left open, although the tagger has a preference for an adjective reading. The mirror tag, AV0-AJ0
, again shows adjective-adverb ambiguity, but this time the more likely reading is the adverb.
Up: Contents