BNC User Reference Guide

6 Wordclass Tagging in BNC XML

Up: Contents Previous: 5 The header Next: 7 Software for the BNC

This section of the User Reference Guide is derived from the Manual to accompany The British National Corpus (Version 2) with Improved Word-class Tagging originally prepared for the BNC World edition by Geoffrey Leech and Nicholas Smith at the University of Lancaster.

6.1 Introduction

The wordclass tagging2 has not changed significantly between the BNC World edition (2001) and the BNC XML edition (2006). In particular, no attempt has been made to completely retag the corpus, desirable though this might be. Changes have been made in the treatment of multiword units and some additional annotation has been provided (see Additional annotation in BNC XML , but in most respects the wordclass information provided by the corpus now is identical to that provided with the first release of the BNC in 1994.

The BNC is wordclass-tagged using a set of 57 tags (known as C5) which we refer to as the "BNC Basic Tagset". (There are also 4 punctuation tags, excluded from consideration here.) Each C5 tag represents a grammatical class of words represented by a three character code such as NN1 for "singular common noun". The codes are, in many cases, mnemonic.

The BNC, consisting of c.100 million words, was tagged automatically, using the CLAWS4 automatic tagger developed by Roger Garside at Lancaster, and a second program, known as Template Tagger, developed by Mike Pacey and Steve Fligelstone. Further details are given below, and also in Garside and Leech 1997 chapters 7-9. With such a large corpus, there was no opportunity to undertake post-editing3 i.e. disambiguation and correction of tagging errors produced by the automatic tagger, and so the errors (about 1.15 per cent of all words) remain. In addition, the corpus contains ambiguous taggings (c.3.75 per cent of all words), shown in the form of ambiguity tags (also called ‘portmanteau tags’), consisting of two C5 tags linked by a hyphen: e.g. VVD-VVN. These tags indicate that the automatic tagger was unable to determine, with sufficient confidence, which was the correct category, and so left two possibilities for users to disambiguate themselves, if they should wish to do so. For example, in the case of VVD-VVN, the first (more likely) tag, say for a word such as wanted, is VVD: past tense of lexical verb; and the second (less likely) tag is VVN: past participle of lexical verb. On the whole, the likelihood of the first tag of an ambiguity tag being correct is better than 3 to 1 — see, however, details of individual tags in Table 23. Estimated ambiguity and error rates for the whole corpus (fine-grained calculation) of the error report document.

After the automatic tagging, some manual tagging was undertaken to correct some particularly blatant errors, mainly foreign or classical words embedded in English text. CLAWS is not very successful at detecting these foreign words and tagging them with their appropriate tag (UNC), except when they form part of established expressions such as ad hoc or nom de plume - in which case they are normally given tags appropriate to their grammatical function, e.g. as nouns or adverbs.

The main purpose of the report on estimated error rates is to document the rather small percentage of ambiguities and errors remaining in the tagged BNC, so that users of the corpus can assess the accuracy of the tagging for their own purposes. Since not surprisingly we have been unable to inspect each of the 100 million tags in the BNC, we have had to estimate ambiguity rates and error rates on the basis of a manual post-editing of a corpus sample of 50,000 words. The estimate is based on twenty-four 2,000-word text extracts and two 1,000-word extracts, selected so as to be as far as possible representative of the whole corpus.

6.2 Tokenization: splitting the text into words

Regarding the segmentation of a text into individual word-tokens (called tokenization), our tagging practice in general follows the default assumption that an orthographic word (separated by spaces from adjacent words, with or without punctuation) is the appropriate unit for wordclass tagging. There are, however, exceptions to this. For example, a single orthographic word may consist of more than one grammatical word: in the case of enclitic verb contractions (as in she’s, they’ll, we’re) and negative contractions (as in don’t, isn’t, won’t), it is appropriate to assign two diferent wordclass tags to the same orthographic word. A full list of such contracted forms recognized by CLAWS and preserved in the XML markup is given in section 9.7 Contracted forms and multiwords.

Also quite frequent is the opposite circumstance, where two or more orthographic words are given a single wordclass tag: e.g. multiword adverbs such as of course and in short, and multiword prepositions such as instead of and up to are each assigned a single word tag (AV0 for adverbs, PRP for prepositions). Sometimes, whether such orthographic sequences are to be treated as a single word for tagging purposes depends on the context and its interpretation. In short is in some circumstances not an adverb but a sequence of preposition + adjective (eg. in short sharp bursts ). Up to in some contexts needs to be treated as a sequence of two grammatical words: adverbial-particle + preposition-or-infinitive-marker (eg. We had to phone her up to get the code.).

In the BNC XML edition, these multiword units are marked using an additional XML element (<mw>) which carries the wordclass assigned to the whole sequence. Within the <mw> element, the individual orthographic words are also marked, using the <w> element in the same way as elsewhere. For example, the multiword unit of course is marked up as follows:
<mw c5="AV0"> <w c5="PRF" hw="of" pos="PREP">of </w> <w c5="NN1" hw="course" pos="SUBST">course </w> </mw>
. Wordclass tags for the constituent tags of multiword units were automatically inserted using the table reproduced in 9.7 Contracted forms and multiwords; there may therefore be residual errors in their usage.

In one respect, we have allowed the orthographic occurrence of spaces to be criterial. This is in the tagging of compound words such as markup, mark-up and mark up. Since English orthographic practice is often variable in such matters, the same ‘compound’ expression may occur in the corpus tagged as two words (if they are separated by spaces) or as one word (if the sequence is printed solid or with a hyphen). Thus mark up (as a noun) will be tagged NN1 AVP, whereas markup or mark-up will be tagged simply NN1.

6.3 Tagging Guidelines and Borderline Cases

Many detailed decisions have to be made in deciding how to draw the line between the correct and the incorrect assignment of a tag. So that the concept of what is a ‘correct’ or ‘accurate’ annotation can be determined, there have to be detailed guidelines of tagging practice. These are constitute the Wordclass Tagging Guidelines.

The Guidelines have to give much attention to borderline phenomena, where the distinction between (say) an adjective and a verb participle in -ing is unclear, and to clarify criteria for differentiating them. To promote consistency of tagging practice, the guidelines may even impose somewhat arbitrary dividing lines between one word class and another. Consider the case of a word such as setting, which may be a present participle form of a verb(VVG), an adjective (AJ0) or a singular common noun (NN1). The difference may be illustrated by the three examples:
  • Oil prices are rising again. (verb, VVG)
  • the rising sun (adjective, AJ0)
  • the attempted rising was put down (noun, NN1)

The assignment of an example of ‘Verb+ing’ to the adjective category relies heavily on a semantic criterion, viz. the ability to paraphrase Verb+ing Noun by ‘Noun + Relative Clause that/which/who be Verb+ing’ or ‘that/which/who Verb(s)’ (e.g. the rising sun = the sun which is/was rising; a working mother = a mother who works). These contrast with a case such as dining table, where the first word dining is judged to be a noun. The reason for this is that the paraphrasable meaning of the expression is not ‘a table which is/was dining or dines’, but rather ‘a table (used) for dining’. Although somewhat arbitrary, this relative clause test is well established in English grammatical literature, and such criteria are useful in enabling a reasonable degree of consistency in tagging practice to be achieved, so that the success rate of corpus tagging can be checked and evaluated. (See further Adjective vs. noun)

It also has to be recognized that some borderline cases may occasionally have to be considered unresolvable. We may conclude, for example, that the word Hatching (occurring as a heading on its own, without any syntactic context) could be equally well analysed VVG or NN1, and in such a case one would be tempted to leave the ambiguity (VVG-NN1) in the corpus, showing uncertainty where any grammarian would be likely to acknowledge it. However, in our calculations of ambiguity, we have adhered to the common assumption that ideally, all tags should be correctly disambiguated. Other examples of unresolvability from the sample texts are:
  • the importance of weaving in the East (verb or noun? - VVG-NN1)
  • Armed with the knowledge (past participle verb or adjective? - VVN-AJ0)
  • the Lord is my shepherd (common noun or proper noun? - NN1-NP0)

In practice, in our post-edited sample, we chose the first tag to be correct in these cases.

6.4 Ambiguity tags, and the principle of asymmetry

As in the first version of the BNC, we have introduced only a limited number of ambiguity tags, to deal with particular cases where the tagger has difficulty in distinguishing two categories, and where incorrect taggings would otherwise result rather frequently. Ambiguity tags involve only the following 18 wordclass labels, and each of the ambiguity tags allows only two labels to be named:
  • AJ0 general adjective (positive)
  • AV0 general adverb
  • AVP adverbial particle
  • AVQ wh- adverb
  • CJS general subordinator
  • CJT subordinator: that
  • CRD cardinal numeral
  • DT0 determiner-pronoun
  • NN1 singular common noun
  • NN2 plural common noun
  • NP0 proper noun
  • PNI indefinite pronoun
  • PRP general preposition
  • VVB lexical verb: finite base form
  • VVD lexical verb: past tense;
  • VVG lexical verb: present participle (-ing form)
  • VVN lexical verb: past participle
  • VVZ lexical verb: -s form

The permitted ambiguity tags are listed in the Wordclass tagging guidelines ( Ambiguity Tag list).

It will be noted that overall 30 ambiguity tags are recognized. We also observe that each ambiguity tag (eg VVD-VVN) is matched by another ambiguity tag which is its mirror image (eg VVN-VVD). The ordering of tags is significant: it is the first of the two tags which is estimated by the tagger to be the more likely. Hence the interpretation of an ambiguity tag X-Y may be expressed as follows: ‘There is not sufficient confidence to choose between tags X and Y; however, X is considered to be more likely.’

6.5 Guidelines to the Wordclass Tagging

6.5.1 Preliminaries The BNC basic tagset
For completeness, we begin by listing the C5 tagset used throughout the BNC, followed by the ambiguity codes used:
Tag Description
AJ0 Adjective (general or positive) (e.g. good, old, beautiful)
AJC Comparative adjective (e.g. better, older)
AJS Superlative adjective (e.g. best, oldest)
AT0 Article (e.g. the, a, an, no)
AV0 General adverb: an adverb not subclassified as AVP or AVQ (see below) (e.g. often, well, longer (adv.), furthest.
AVP Adverb particle (e.g. up, off, out)
AVQ Wh-adverb (e.g. when, where, how, why, wherever)
CJC Coordinating conjunction (e.g. and, or, but)
CJS Subordinating conjunction (e.g. although, when)
CJT The subordinating conjunction that
CRD Cardinal number (e.g. one, 3, fifty-five, 3609)
DPS Possessive determiner-pronoun (e.g. your, their, his)
DT0 General determiner-pronoun: i.e. a determiner-pronoun which is not a DTQ or an AT0.
DTQ Wh-determiner-pronoun (e.g. which, what, whose, whichever)
EX0 Existential there, i.e. there occurring in the there is ... or there are ... construction
ITJ Interjection or other isolate (e.g. oh, yes, mhm, wow)
NN0 Common noun, neutral for number (e.g. aircraft, data, committee)
NN1 Singular common noun (e.g. pencil, goose, time, revelation)
NN2 Plural common noun (e.g. pencils, geese, times, revelations)
NP0 Proper noun (e.g. London, Michael, Mars, IBM)
ORD Ordinal numeral (e.g. first, sixth, 77th, last) .
PNI Indefinite pronoun (e.g. none, everything, one [as pronoun], nobody)
PNP Personal pronoun (e.g. I, you, them, ours)
PNQ Wh-pronoun (e.g. who, whoever, whom)
PNX Reflexive pronoun (e.g. myself, yourself, itself, ourselves)
POS The possessive or genitive marker 's or '
PRF The preposition of
PRP Preposition (except for of) (e.g. about, at, in, on, on behalf of, with)
PUL Punctuation: left bracket - i.e. ( or [
PUN Punctuation: general separating mark - i.e. . , ! , : ; - or ?
PUQ Punctuation: quotation mark - i.e. ' or "
PUR Punctuation: right bracket - i.e. ) or ]
TO0 Infinitive marker to
UNC Unclassified items which are not appropriately considered as items of the English lexicon.
VBB The present tense forms of the verb BE, except for is, 's: i.e. am, are, 'm, 're and be [subjunctive or imperative]
VBD The past tense forms of the verb BE: was and were
VBG The -ing form of the verb BE: being
VBI The infinitive form of the verb BE: be
VBN The past participle form of the verb BE: been
VBZ The -s form of the verb BE: is, 's
VDB The finite base form of the verb DO: do
VDD The past tense form of the verb DO: did
VDG The -ing form of the verb DO: doing
VDI The infinitive form of the verb DO: do
VDN The past participle form of the verb DO: done
VDZ The -s form of the verb DO: does, 's
VHB The finite base form of the verb HAVE: have, 've
VHD The past tense form of the verb HAVE: had, 'd
VHG The -ing form of the verb HAVE: having
VHI The infinitive form of the verb HAVE: have
VHN The past participle form of the verb HAVE: had
VHZ The -s form of the verb HAVE: has, 's
VM0 Modal auxiliary verb (e.g. will, would, can, could, 'll, 'd)
VVB The finite base form of lexical verbs (e.g. forget, send, live, return) [Including the imperative and present subjunctive]
VVD The past tense form of lexical verbs (e.g. forgot, sent, lived, returned)
VVG The -ing form of lexical verbs (e.g. forgetting, sending, living, returning)
VVI The infinitive form of lexical verbs (e.g. forget, send, live, return)
VVN The past participle form of lexical verbs (e.g. forgotten, sent, lived, returned)
VVZ The -s form of lexical verbs (e.g. forgets, sends, lives, returns)
XX0 The negative particle not or n't
ZZ0 Alphabetical symbols (e.g. A, a, B, b, c, d)

Total number of wordclass tags in the BNC basic tagset = 57, plus 4 punctuation tags

Ambiguity Tag list

In addition, there are 30 "Ambiguity Tags". These are applied wherever the probabilities assigned by the CLAWS automatic tagger to its first and second choice tags were considered too low for reliable disambiguation. So, for example, the ambiguity tag AJ0-AV0 indicates that the choice between adjective (AJ0) and adverb (AV0) is left open, although the tagger has a preference for an adjective reading. The mirror tag, AV0-AJ0, again shows adjective-adverb ambiguity, but this time the more likely reading is the adverb.

Ambiguity tag Ambiguous between More probable tag
AJ0-NN1 AJ0 or NN1 AJ0
AV0-AJ0 AV0 or AJ0 AV0
NN1-AJ0 NN1 or AJ0 NN1
NN1-NP0 NN1 or NP0 NN1
NP0-NN1 NP0 or NN1 NP0

Total number of wordclass tags including punctuation and ambiguity tags = 91. Appearance of wordclass tags and citations

Throughout this section, we will show text examples in a format which is different from the XML contained in the corpus but which will highlight the particular tag that is being discussed. The XML tagging (for example, paragraph and pause markers) is not generally relevant to the present discussion and is usually invisible when using concordancing software such as Xaira, BNCWeb, or WordSmith.

As noted above, each word in the corpus is marked by an XML <w> element which provides three additional pieces of information the wordclass, carried by the c5 attribute, a headword or lemma derived from the word, carried by the hw attribute, and a simplified wordclass derived from the c5 value, carried by the pos attribute.

In the XML source therefore, we will see sentences like this:
<w c5="AV0" hw="apparently" pos="ADV">apparently </w> <w c5="PNP" hw="we" pos="PRON">we </w> <w c5="VVB" hw="eat" pos="VERB">eat </w> <w c5="DT0" hw="more" pos="ADJ">more </w> <w c5="NN1" hw="chocolate" pos="SUBST">chocolate </w> <w c5="CJS" hw="than" pos="CONJ">than </w> <w c5="DT0" hw="any" pos="ADJ">any </w> <w c5="AJ0" hw="other" pos="ADJ">other </w> <w c5="NN1" hw="country" pos="SUBST">country</w> <c c5="PUN">.</c>
For simplicity of discussion throughout this section we have chosen not to present examples in this way, but instead to suppress the bulk of the XML markup. Only the wordclass attribute of the word (or words) being in question, we have preserved this and placed it after the word it relates to in the example sentences. Under subordinating conjunctions, for instance, the citation above appears as follows:

...apparently we eat more chocolate than_CJS any other country. [G3U.1000]

This is purely as an aid to reading the present document; in the corpus itself, all wordclass tagging is represented using the XML conventions shown above.
As noted above, any example from the BNC can be identified by means of the text identifier (a three character code such as GRU) and the number of the <s> element within it. We use this method throughout the following examples, where they are taken from the BNC. Thus, the example above is taken from s-unit 1000 of text G3U. In sections 6.5.9 Disambiguation Guide and Disambiguation by Word below, we occasionally cite cases where the POS-tagging in the corpus does not match the tag given in the citation, in that it is either an error or an ambiguity tag. This is to give an idea of the contexts in which the resolution of ambiguities has been less reliable. We list the tag found in the corpus next to the file reference with an asterisk, eg. in well we give the ideal tag as VVB, but the actual tag as AV0:

Tears well_VVB up in my eyes.[BN3.5 *AV0]

Note also that we occasionally use invented examples, rather than corpus citations, especially where a contrast between categories is being made.
Appearance and tagging of contracted forms
Contracted forms — including enclitics, eg he's, she'll, negatives eg don't and can't, and 'fused words', eg wanna and gimme — are broken down by the tagger into their component parts, with each part being assigned its own tags. No spaces are introduced in POS-tagged contracted words:

doesn't = does_VDZ n't_XX0
dunno = Du_VDB n_XX0 no_VVI
wanna = wan_VVB na_TO0 or wan_VVB na_AT0
gimme = Gim_VVB me_PNP

This procedure sometimes results in strange-looking word divisions, particularly with the fused words. However, they do provide a ready means of comparison with the full forms, such as want_VVB to_TO0 and give_VVB me_PNP.

Note that in the case of ain't it has been tricky to resolve the tag of the first part ( ai ) satisfactorily. Therefore in all contexts we have tagged this as an unclassified word, followed by the negative particle.

Ai_UNC n't_XX0 got yours yet [KCT.1281]

Appearance and tagging of multiwords

The term `multiwords' denotes multiple-word combinations to which CLAWS assigns a single wordclass tag - for example, a complex preposition, an adverbial, or a foreign expression naturalised into English as a compound noun. In the XML version of the corpus, these sequences are explicitly marked using an XML element (<mw>). The individual orthographic words of which the sequence is composed are also marked, in the same way as other words, using the <w> element.

For example, as noted above, in the XML source of the corpus, the multiword sequence of course is tagged as follows:

<mw c5="AV0"> <w c5="PRF" lemma="of" pos="PREP">of </w> <w c5="NN1" lemma="course" pos="SUBST">course </w> </mw>

When displaying examples which contain multiwords in this chapter, we display only the wordclass of the outermost <mw> element. Its boundaries are indicated, where possible, by extra highlighting:

Of course_AV0 I can. [H9V.212]

The wordclass tags assigned to constituent parts of multiword items are listed in 9.7 Contracted forms and multiwords. This part of the wordclass tagging was done automatically during the XML conversion process, and has not been checked by CLAWS.

Note that some multiwords can represent different categories according to context, e.g. in between in:

The stage in between_PRP the original negative and the dupe is called an interpositive [FB8.295]
The truth lies somewhere in between_AV0 [ABK.2834]

Moreover, sometimes it is more appropriate to tag a word combination as consisting of ordinary words than as a multiword sequence, as in the case of but for below:

but_CJC for_PRP years now darkness has been growing [F99.2027] cf.
which they would not have done but for_PRP the presence of the police. [H81.766]

Words joined by the slash character
Words which are joined together by a slash ( / ) but no whitespace, such as and/or, are not split up in tagged versions of the text.
  • if they are of the same wordclass they are assigned the same tag;
  • if they are of different wordclasses, the whole sequence is assigned the 'unclassified' tag, UNC.

A title and/or_CJC an author's name [H0S.358]
You should be a graduate in Electrical/Electronic_AJ0 Engineering, Physics , Mathematics , Computing or a related discipline. [CJU.1049]

6.5.2 Introduction to Word Classes Nouns
Common nouns
Singular common nouns are tagged NN1, while plurals take NN2:

A child_NN1.
Several children_NN2
An air_NN1 of distinction_NN1
Fifteen miles_NN2 away

Nouns which are morphologically invariant for number or which can take either a singular or plural verb, (so-called `neutral for number') are tagged NN0:

Now the government_NN0 is considering new warnings on steroids ... [K24.3057]
... the Government_NN0 are putting people's lives in jeopardy. [A7W.518]
I caught a fish_NN0.[KBW.316]
I had caught four fish_NN0 with hardly any effort[B0P.1387]

We make no special distinction between common nouns that can be mass (or `non-count') nouns (eg water, cheese), and other common nouns. All are tagged NN1 when singular and NN2 when plural:

Cheese_NN1 is a protein of high biological value. [ABB.1950]
three cheeses_NN2. [CH6.7834]
A car_NN1 glistens in the distance_NN1. [HH0.1035]
Three cars_NN2, two lorries_NN2 and a motorbike_NN1! [CHR.290]

In general we try to tag abbreviations for common nouns (and other word classes) as if they were written as full forms. Abbreviations for measurement nouns are generally tagged NN0 as they are invariant for number.

Crewe are top of div_NN1 3 by 8 points [J1C.961] (where div = division)
1 km_NN0
400 km_NN0 (km = 'kilometre' or 'kilometres')
1 oz_NN0.
6 oz_NN0 (oz = 'ounce' or 'ounces')

Nouns such as hundred, hundreds, dozens, gross, are all tagged as numbers, CRD, rather than nouns.

Proper nouns
The tag NP0 ideally should denote any kind of proper noun, but in practice the open-endedness of naming expressions makes it difficult to capture all possible types consistently. We have confined its coverage mainly to personal and geographical names, and to names of days of the week or months of the year. Within these, some rather arbitrary borderlines have had to be drawn.

Joe_NP0 Bloggs_NP0
Madame_NP0 Pompadour_NP0
Leonardo_NP0 da_NP0 Vinci_NP0
Lake_NP0 Tanganyika_NP0
New_NP0 York_NP0

Note that the distinction between singular and plural proper nouns is not indicated in the tagset, plural proper nouns being a comparative rarity:

John_NP0 Smith_NP0. All of the Smiths_NP0.

Note also that proper nouns are not processed as multiwords (though there may be good linguistic reasons for doing so). Each word in such a sequence gets its own tag.
A person's initials preceding a surname are tagged NP0, just as the surname itself. The choice whether to use a space and/or full-stop between initials (eg J.F. or J. F. or J F or JF) is determined by the original source text; the tagged version follows the same format.

John F. Kennedy = John_NP0 F._NP0 Kennedy_NP0
J. F. Kennedy = J._NP0 F._NP0 Kennedy_NP0
J.F. Kennedy = J.F._NP0 Kennedy_NP0

In the spoken part of the BNC, however, the components of names — and, in fact, most words — that are spelt aloud as individual letters, such as I B M, and J R in J R Hartley, are not tagged NP0 but ZZ0 (letter of the alphabet). See further Letter

Nouns of style
Preceding a proper noun, or sequence of proper nouns, style (or title) nouns with uppercase initial capitals are tagged NP0:

Pastor_NP0 Tokes_NP0
Chairman_NP0 Mao_NP0
Sub-Lieutenant_NP0 R_NP0 C_NP0 V_NP0 Wynn_NP0
Sister_NP0 Wendy_NP0

Contrast the last example with the following:

You remember your sister_NN1 Wendy_NP0... [HGJ.800]

where Wendy is in apposition to a common noun sister, in lowercase letters.
Geographical names
For names of towns, streets, countries and states, seas, oceans, lakes, rivers, mountains and other geographical placenames, the general rule is to tag as NPO. If the word the precedes, it is tagged AT0:

East_NP0 Timor_NP0
South_NP0 Carolina_NP0
Baker_NP0 Street_NP0
West_NP0 Harbour_NP0 Lane_NP0
the_AT0 United_NP0 Kingdom_NP0
the_AT0 Baltic_NP0
the_AT0 Indian_NP0 Ocean_NP0
Mount_NP0 St_NP0 Helens_NP0
the_AT0 Alps_NP0

Other tags are used for the constituents of more verbose (especially political) descriptions of placenames, or those that are not typically marked on maps:

Latin_AJ0 America_NP0
Western_AJ0 Europe_NP0
the_AT0 Western_AJ0 Region_NN1
the_AT0 People_NN0's_POS Republic_NN1 of_PRF China_NP0
the_AT0 Dominican_AJ0 Republic_NN1
the_AT0 Sultanate_NN1 of_PRF Oman_NP0

The examples show a little arbitrariness in application. For example, contrast

the_AT0 United_NP0 States_NP0
the_AT0 Soviet_AJ0 Union_NN1

Multiword names containing a compass point, ie. those beginning North, South, East, West, North East, South-west etc. nearly always become NP0, whereas those with Northern, Southern, Eastern, Western follow the non-NP0 pattern. Rare exceptions are:

Northern_NP0 Ireland_NP0
Western_NP0 Samoa_NP0

Non-personal and non-geographical names
Where names of organisations, sports teams, commercial products (incl newspapers), shops, restaurants, horses, ships etc. consist of ordinary words (common nouns, adjectives etc.), they receive ordinary tags (NN1, AJ0 etc.). Only if a word used as part of a name is an existing NP0 (typically a personal or geographical name), or a specially-coined word, is it tagged NP0. Some examples follow:
Organisations, sports teams etc.

Cable_NN1 and_CJC Wireless_NN1
Procter_NP0 and_CJC Gamble_NP0
Acorn_NN1 Marketing_NN1 Limited_AJ0
Minolta_NP0; IBM_NP0; NATO_NP0
Wolverhampton_NP0 Wanderers_NN2 ( football_NN1 club_NN1 )
Tottenham_NP0 Hotspur_NP0 (football_NN1 club_NN1 )
The_AT0 Chicago_NP0 Bears_NN2
Spartak_NP0 Moscow_NP0
World_NN1 Health_NN1 Organisation_NN1

There is a slight inconsistency here, in that acronyms of organisation names (WHO, NATO, IBM etc.) take NP0, whereas the expanded forms of these names take regular tags.
Products (including newspapers and magazines)

Windows_NN2 software_NN1
Lancashire_NP0 Evening_NN1 Post_NN1
Mars_NP0 bars_NN2
Time_NN1 Magazine_NN1
The_AT0 Reader_NN1 's_POS Digest_NN1
Perrier_NP0 water_NN1

Company names may sometimes be used to represent product names; in such cases the same tags apply. For example:

John drives a Volkswagen_NP0 Golf_NN1
John drives a Volkswagen_NP0.

Shops, pubs, restaurants, hotels, horses, ships etc.

Body_NN1 Shop_NN1
The_AT0 Grand_AJ0 Theatre_NN1
Sainsburys_NP0 supermarket_NN1
The_AT0 King_NN1 's_POS Arms_NN2
The_AT0 Ritz_NP0
Red_AJ0 Rum_NN1
The_AT0 Bounty_NN1
The_AT0 Titanic_NP0

Here again NP0 is reserved for parts of names that are specially coined, or derived from existing personal/geographical proper nouns. Verbs
The second character of a verb tag marks the type of verb as follows:
D Forms of do ( VDB VDD VDG VDI VDN VDZ)
H Forms of have ( VHB VHD VHG VHI VHN VHZ)
M Other modal verbs (VM0)
V Lexical verb (VVB VVD VVG VVI VVN VVZ)
The third character of a verb tag marks the verb inflection as follows:
B base form finite
D past tense
Z 3rd person sing present
N past participle
I infinitive
G present participle
be, have, and do
Auxiliary and main uses of these verbs are not distinguished: .

she is_VBZ playing her best tennis for six years. [CH3.1382]
she is_VBZ just a star. [CH3.6939]
John has_VHZ built a set of bookshelves. [C9X.121]
John has_VHZ great courage. [CA9.1869]
We did_VDD n't_XX0 see anybody. [KB2.702]
They do_VDB nice work. [ANY.514]

Note the variant form of have in non-standard English:

they shouldn't of_VHI left it the last minute [KD8.7288]
That could of_VHI been 'bout us [B38.322]

Lexical verbs
Tags beginning VV- apply to all other (lexical) verbs.

She travels_VVZ in every Saturday morning. [KRH.4013]
The young kids want_VVB to dance_VVI and have fun [CHA.1599]
I thought_VVD he looked_VVD a sad sort of a boy. [CDY.2831]
...after running_VVG out of coal, the crew were forced_VVN to burn_VVI timber and resin [HPS.269]

All modals are tagged VM0. We do not differentiate between so-called past and present forms:

We can_VM0 go there.
We could_VM0 go there.
We used_VM0 to_TO0 go there every year.

The form let's is treated as one verb:

Let's_VM0 go_VVI! [A61.1443]

Contracted forms
Contracted forms (can't, won't, gimme, dunno etc) are split into their component parts, which are tagged individually.

Are_VBBn't_XX0 you coming?[A0R.2215]
I du_VDB n_XX0 no_VVI [KR0.23]

Subjunctives and Imperatives
No special tags are used for these:

She suggested that they get_VVB married. [CBC.12107]
Please be_VBB patient. [CHJ.899]
Do_VDBn't_XX0 just stand there watching! [ACB.3470]

Catenative or semi-auxiliary verbs
Again, no special tagging is used for such forms as going to, ought to, or used to + infinitive:

you're not going_VVG to_TO0 get killed [KCE.6550]
you ought_VM0 to_TO0 let them know. [KCT.6115] Adjectives

Adjectives are given one of the wordclass tags AJ0, AJC, or AJS.

The general tag for adjectives (AJ0) subsumes:
Predicative and attributive uses

The ground was dry_AJ0 and dusty_AJ0 [GWA.118]
The dust from the dry_AJ0 ground [GWA.121]

Quasi-comparatives and quasi-superlatives
Adjectives which have a heightening or downtoning effect rather like that of comparatives and superlatives, but which do not behave syntactically like comparatives or superlatives, are treated as ordinary adjectives. Examples include utter, upper and uppermost:

Events in Eastern Europe were evidently uppermost_AJ0 in Mr Li's mind. [A95.366]
Family contacts were very important in uniting the upper_AJ0 classes [FB6.1495]

Adjectives used catenatively
For example, able and unable:

Will you be able_AJ0 to manage? (catenative)
Your son is very able_AJ0 (non-catenative)

Comparative adjectives receive the tag AJC; superlatives take AJS:

A faster_AJC car.
The best_AJS in its class.

Ambiguities frequently arise between adjectives and other wordclasses, in particular adverbs, nouns and participles.

6.5.3 Adverbs

Adverbs are given one of the tags AV0, AVQ, or AVP

AV0 is the default tag for adverbs. It incorporates a very mixed bag, including:
adverbs of time, manner, place etc.
Eg slowly; here; soon
degree adverbs
Eg very and rather in

very_AV0 tall_AJ0
rather_AV0 painfully_AV0

sentence adverbs
for example:

However_AV0, …
In addition_AV0

postnominal adverbs
for example:

aged between 2 and 11 years inclusive_AV0 [AMD.31]
the buildings thereon_AV0 [J16.813]
during 1986-91 inclusive_AV0 [FT0.1400]
Diamonds galore_AV0 [FPH.900]

discourse markers
such as well, right, like:

you know like_AV0, it's worthwhile opening a cinema at 4 o'clock... [F7A.358]

Note that adverbs, unlike adjectives, are not tagged as positive, comparative, or superlative. This is because of the relative rarity of comparative and superlative adverbs.

Interrogative and relative wh-adverbs (when, where, how, why, wherever) are tagged AVQ whether the word occurs in interrogative or relative use.

"When_AVQ do your courses start?" [A0F.3117]
"...if you let me know when_AVQ the police are called in." [BMU.2291]
Yet why_AVQ is that so? [CR7.3089]

Ordinal-type adverbs (including first, fourth, etc.) are treated separately with the ORD tag

Prepositional Adverbs (also known as "Adverbial Particle") are treated as prepositions and tagged AVP: see Prepositions

6.5.4 Articles, determiners & pronouns

Articles, definite or indefinite, are tagged AT0. Pronouns which act as determiners of various kinds (all, which, your etc.) are given tags DPS, DT0, or DTQ, and distinguished from pronouns which do not have a determiner function. These are marked using one of the tags PNP, PNI, PNQ, or PNX depending on their function.

All articles are tagged AT0. An article is defined here as a determiner word which typically begins a noun phrase, but which cannot occur as the head of a noun phrase. Examples include a/an, the, no and every:

Have a_AT0 break
Every_AT0 year
There's no_AT0 time

Recognising that there is a high degree of formal and functional overlap between determiners and pronouns, we have conflated under the D-- heading words that are capable of either function. We distinguish three classes of determiner pronouns:
Words such as few, both, another are tagged DT0:

free secondary education for all_DT0 [ECB.1610]
Few_DT0 diseases are incurable [GV1.1129]
for the benefit of the few_DT0 [HHX.10183]

Interrogative determiner-pronoun
The wh- (interrogative) determiner-pronoun is tagged DTQ. Which and what are always tagged DTQ:

Which_DTQ country do you live in? [A7N.979]
And she didn't say which_DTQ? [KCF.351 ]
What_DTQ time is it? [A0N.406]

Prenominal possessive determiner pronoun
Forms such as my, your, etc are always tagged DPS, for example:

my_DPS hat

Compare this with the nominal use:

That is your way. This is mine_PNP [ASD.726-7]

Tags beginning P-- indicate pronouns which do not share the determiner function, for example I, it , anyone. Pronouns are differentiated according to whether they are:
  • personal (PNP), eg I, him, they, us. Note also: it is included here.
  • reflexive personal (PNX), eg herself, themselves
  • indefinite pronouns (PNI), anyone, everything, nobody
  • interrogative (PNQ), eg who, whoever
Relative pronouns
Which as a relative (or interrogative) pronoun is grouped with the other determiner-pronouns, and tagged DTQ:

Give 4 details which_DTQ should appear on an order form [HBP.417]

Meanwhile, that as a relative clause complementizer is treated with that as a complement clause complementizer, and tagged CJT:

I got some currants that_CJT are left over [KST.3733]
this girl that_CJT Claire knows [KC7.1101]
He dismissed reports that_CJT his party was divided over tactics [A28.11]
We both knew that_CJT enough was enough. [FEX.268]

Note, however, that that takes the tag DT0 when it functions as a demonstrative pronoun or determiner:

Look at that_DT0 bear! [KP8.1547]
I guess I was sad about that_DT0.[BMM.239]

6.5.5 Prepositions and prepositional adverbs

Most prepositions are tagged PRP, including a large number of multiword items. Examples include:

at_PRP the Pompidou Centre in_PRP Paris [A04.325]
I use humour as_PRP a protection [FBL.356]
Heard about_PRP this have you? [KE6.9556]
According to_PRP ancient tradition, ...[A04.784]
Many disputes are dealt with by bodies other than_PRP courts. [F9B.4]
Nice walls and a big sky to look at_PRP. [A25.122]

The preposition of is assigned a special tag PRF because of its frequency and its almost exclusively postnominal function. Examples:

a couple of_PRF cans of_PRF Coke[ AJN.283]
DNA consists of_PRF a string of_PRF four kinds of_PRF bases [AE7.107]

Note that numerous multiwords contain of, eg in front of, in light of, by means of, etc.
Prepositional adverbs/particles
Preposition-type words which have no complement are tagged AVP. Typical uses of AVP are in phrasal verb constructions, or when it functions as a place adjunct:

We gave up_AVP after two hours. [KSV.1029]
there were a lot of horses around_AVP. [HR7.3101]

There are many instances of ambiguity between PRP and AVP.

6.5.6 Conjunctions

Co-ordinating conjunction
Co-ordinators such as and, or, but, nor etc are tagged CJC:

Fish and_CJC chips
James laughed and_CJC spilled wine. [A0N.136]
She was paralysed but_CJC she could still feel the pain. [FLY.529]

Subordinating conjunction
All subordinating conjunctions are all tagged CJS and introduce one of:
an adverbial clause (of time, reason, condition etc.)

"When_CJS you 've done it , you should go home,"[CRE.949]
I still stayed there after_CJS I heard the shooting [HW8.3263]
As_CJS you may know Scorton will again enter the Best Kept Village competition in 1992 [HPK.768]
Do send me an interim copy as_CJS soon as you can [HD3.69]
If_CJS it's wet just take your time. [KCL.554]

a comparative clause
introduced by than or as, and occurring with or without ellipsis:

It was worse than_CJS she could have imagined.[CH0.1315]
...apparently we eat more chocolate than_CJS any other country.[G3U.1000]
"it's as good as_CJS it's going to get."[K9K.199]
make the transporter as light as_CJS possible. [CA1.1113]

a nominal wh-clause
containing whether or if

Can you tell me whether_CJS ivies do damage trees. [C9C.720]

Complementary clause
The conjunction that at the start of a clause introducing reported speech and thought, and also at the start of a relative clause is tagged CJT:

Historians knew that_CJT this was nonsense.[G3C.363]
China announced that_CJT it was ending martial law in the Tibetan capital Lhasa. [KRU.95]
The problem that_CJT he was having was that_CJT she was his legal wife 's sister [HE3.210]

6.5.7 Numerals

Cardinal numbers and similar items are tagged CRD. Ordinal numbers and similar items are tagged ORD.

Numbers and fractions
All cardinal numbers, numeral nouns, fractions and so on take the tag CRD, whether they are written as words or numerals, and whether functioning nominally or prenominally. Examples:

5_CRD out of 10_CRD[CGM.525]
one_CRD striking feature of the years 1929-31_CRD [A6G.134]
his first_ORD innings, when he scored forty-two_CRD, with seven_CRD fours_CRD [KJT.128]
Hundreds_CRD of people audition each year [K1S.2239]
About a dozen_CRD there. [HEU.131]

Ordinal numbers and similar
Ordinal numbers are assigned ORD in all syntactic positions, including adverbial positions, as in

We only came fourth_ORD in the county championship last_ORD year[EDT.1629]

Note that ORD is also assigned to less overtly numeric words like next and last, even in clear adverbial, adjectival or nominal contexts. This is because next and last function like ordinals both syntactically and semantically.
Currency and measurement expressions
Measurement expressions, consisting of numbers and a unit of measurement of some kind (together as one word), are assigned a noun tag, usually NN0 (neutral for number) or NN2 (plural):


Other sequences of numeric and alphabetic characters are assigned UNC (unclassified) tags:

Figure 2b_UNC [FTC.250]
Serial no. S835508_UNC [C9H.2282]
A4_UNC sheet of paper [CN4.296]
Mark drove home along the M1_UNC [AC2.2210]

6.5.8 Miscellaneous other tags

Existential there
The tag EX0 is used for there when it merely states that something exists or existed. It occurs at the beginning of a clause and is usually followed by the verb be and an indefinite noun phrase; for example

There_EX0 was a long pause and then a smile [A4H.416]
Waiter! Waiter! There_EX0's an awful film on my soup! [CHR.657-9]
There_EX0 appears to be little alternative [ECE.2139]

Compare this with there when it has a clear locative meaning ('in/to that place'):

Don't stand there_AV0 grinning like a stuck pig [C85.1553]

The tag ITJ is used for any interjection:

Hello_ITJ, Nell.
Oi_ITJ - come here!
Yes_ITJ , please_AV0 do
No_ITJ not_XX0 yet_AV0

( For the distinction between ITJ and the unclassified tag, UNC, see Interjection vs. unclassified)
Genitive morpheme
The tag POS is used for the genitive morpheme 's (singular) or ' (plural after an s):

teacher_NN1 's_POS pet
teachers_NN2 '_POS pet

Note the lack of space between the noun and the following POS, as 's is tokenized in the same way whether it represents a genitive or a contracted verb. See further on tagging of 's in Apostrophe S
Infinitive marker
The tag TO0 is used for the infinitive marker. This includes elliptical uses.

"Do you want to_TO0 talk about it?" [EFG.1935]
In the summer holidays I can , I can get up early if I want to_TO0 . [KPG.4153]

Note the morphological variation of to in the following colloquial forms:

We got_VVN ta_TO0 go
We wan_VVBna_TO0 stay.

Unclassified words
The tag UNC is used for unclassified (or unclassifiable) words. It is applied in contexts where no other wordclass tag seems appropriate, including
  • "Noise words" and pause fillers in spoken utterances; imitations of animal or machine sounds:

    blah_UNC blah_UNC blah_UNC
    er_UNC I think so

  • Certain fused forms (in written or spoken data) for which no other tag would be appropriate:

    That ai_UNC n't_XX0 right.
    0.5 cm increments/30_UNC seconds [HWT.282]
    Fits with most lap/diagonal_UNC seat belts. [BNX.392]

  • Truncated words in speech. Partial words that are not completed by a speaker, whether through hesitation or an interruption, are also usually marked with the XML tags <trunc>; for example the partial word bathr in the following:

    The bathr_UNC data. er you can't beat a white bathroom suite anyway. [KCF.771]

  • Partial repetitions of multiwords in spoken data.
    Occasionally in spoken data, when a multiword sequence is used, it appears to be repeated, but only partially so. In the following example, the orthographic word sort is used twice:

    we're going to sort sort of summarize... [G5X.106]

    We treat the first sort as an incomplete multiword, and tag it UNC (rather like truncated words, above). The complete multiword sort of is tagged AV0, as normally.

    we're going to sort_UNC sort of_AV0 summarize...

See 6.5.10 Features of spoken corpus tagging for further examples; for the distinction between UNC and ITJ see Interjection vs. unclassified.
Negative particle
XX0 is the tag for the negative particle not, and also for its contracted or fused form,

Brown did_VDDn't _XX0 see it that way. [A6W.338]
no, that is not_XX0 correct. [JK0.257]

ZZ0 is used for a free-standing letter of the alphabet such as A, X, x, p, r . If however, the letter clearly represents a separate word, or an abbreviation of a separate word, we have tried to assign the appropriate POS-tag for the full form of that word, rather than ZZ0.For example,
  • I as personal pronoun is PNP rather than ZZ0.
  • a as indefinite article is tagged AT0
  • F as in John F. Kennedy is tagged NP0
  • v meaning 'versus' is tagged PRP in

    Italy v_PRP New Zealand ... Hungary v_PRP Thailand [A1N.507].

    Although the same should apply to v. the full-stop has sometimes incorrectly produced a new sentence break. (See eg CHS.1076, EB2.19, EDL.313)
  • In spoken texts, words which are spelt out by the speaker are transcribed letter by letter, and each letter is tagged ZZ0.

    I_ZZ0 B_ZZ0 M_ZZ0 compatible [JYM.6]
    children who go to the E_ZZ0 N_ZZ0 T_ZZ0 clinic [KB8.3807]

6.5.9 Disambiguation Guide

The following is a guide to resolution of the most common tagging ambiguities. It states the principles by which we have drawn the line between the "correct" and the "incorrect" assignment of a tag in particular contexts (as applied in the report on tagging error rates.) Note that in the next two sections, we also cite examples where the POS-tagging in the corpus is less reliable and does not match that given for the citation. In such cases we append the actual tag in the corpus to the file reference with an asterisk. Eg. under Adjective vs Adverb (next section), the preferred tag for long is AV0, but the actual tag is ambiguous AV0-AJ0:

You're not supposed to keep medicine that long_AV0. [H8Y.1976 *AV0-AJ0]

Note also that in this section we use a number of invented examples (in addition to corpus citations) to clarify the distinction between categories. Disambiguation by Tag Pair
Adjective vs. adverb
After a verb or an object, there is sometimes a difficult choice between AJ0 and AV0, or between AJC and AV0. e.g.:

We arrived tired_AJ0, but safe_AJ0 [CCP.529]

Here, both tired and safe are AJ0. The main test is to see whether one can express the relation between these words and their logical subjects using the verb be: They arrived tired but safe implies 'They were tired but safe'. The word tagged AJ0 refers to a property of a noun, rather than to a property of an event or situation. Contrast:

After a little he remembered it and sang out loud_AV0.[A0N.1144]-->

This sentence does not imply that he was loud, but is more or less equivalent to He sang out loudly. It means that his singing was loud.

It follows that when, in colloquial English, a word which we normally expect to be an adjective is used as an adverb, we should tag it AV0:

You did great_AV0 though. [HH0.3248 *AV0-AJ0]

Here is another pair of examples, where the AJ0/AV0 word follows an object:

everyone below 25 grew their hair too long_AJ0. [ARP.590 *AV0-AJ0]
(i.e. 'their hair was too long'.)
Try not to keep her too long_AV0. [FAB.3620 *AV0-AJ0]
(i.e. NOT 'she will be too long.')

Also note the similar distinction between AJC and AV0:

They'll have to make the taxes higher_AJC. ('the taxes will be higher')
We can make this piece higher_AJC if you want to. [BNG.2268]
You'll have to aim higher_AV0. (NOT 'you will be higher')
You should aim higher_AV0 [ACN.984 *AJC]

Similar considerations arise for the choice between AJS and AV0:

I thought it best_AJS to call. [AT4.3239]
I liked the cartoons best_AV0 [CAM.194]

Adjective vs. noun
There are many words in English which can be tagged either adjective (AJ0) or noun (NN1). Colour words like black, white and red are fairly consistent in allowing the two tags, and may be used to illustrate the difference. In attributive (premodifying) or predicative (complementing) positions without further modification these words are normally adjectives:

a white_AJ0 screen, The screen is white_AJ0.

When the word is the head of a noun phrase, on the other hand, it is a noun:

Red_NN1 is my favourite colour.
They painted the wall a brilliant white_NN1.

Sometimes a word cannot be used predicatively as an adjective, but can occur attributively in a way which suggests adjectival use. For example, past and present are adjectives in

All past_AJ0 and present_AJ0 employees of the branch are invited. [K99.216]

We do not find present or similar words being used as predicative adjectives, however:

*These needs are past, present, and future.

(Note that present can be used as a predicative adjective meaning the opposite of absent; but this meaning is not comparable to the temporal meanings of past, present and future above.)

Contrast K99.216 above with cases where past, present etc. are heads of noun phrases, e.g. following the definite article, and are clearly nouns:

You're living in the past_NN1. [HGS.1045]
I don't even want to think about the future_NN1. [JY4.2864]

The only reason for treating past and present in the example above as adjectives is that they have an institutionalized meaning as modifiers, which is rather different from the meaning they have as nouns. Further examples of this type are words such as model in model behaviour, giant in a giant caterpillar and vintage in vintage cars.

Words ending in -ing are a particular problem: when they premodify a noun, they can be tagged either NN1 (noun) or AJ0 (adjective). Contrast:

new spending_NN1 plans [CEN.5922]
a working_AJ0 mother [ED4.153]
his reading_NN1 ability [CFV.1897]
in the coming_AJ0 weeks [HKU.1333]

The guideline is as follows. If X-ing + Noun is equivalent in meaning to Noun who/which X-es (or X-ed or BE + X-ing), then X-ing is an adjective (AJ0). That is, a word ending -ing is an adjective when it is the notional subject of the noun it premodifies. For example:

two smiling_AJ0 children [HTT.743] ('two children who are smiling')

In other cases, X-ing is generally a noun (NN1). In such cases, it is often possible to paraphrase X-ing + Noun by a more explicit phrase in which X-ing is clearly a noun:

new spending_NN1 plans ('new plans for spending')
his reading_NN1 ability ('his ability in reading')

Further examples:

a mating_AJ0 animal [GU8.2142]
the mating_NN1 game [ECG.336 *AJ0-NN1]
a falling_AJ0 rate of unemployment [KR2.2129]
slimming_NN1 tablets. [KCA.941 *NN1-VVG]

Determiner-pronoun vs. adverb
More and less can be assigned to either of the tags DT0 or AV0. The difference between them is that DT0 is for noun-phrase-like (and determiner-like) uses of the word in question, whereas AV0 is for adverbial uses. The two can be hard to distinguish, particularly after a verb:

(a) You should relax more_AV0.
(b) You should spend more_DT0.

Since relax is an intransitive verb in (a), more cannot be a noun phrase following it. Instead, more can be paraphrased roughly as 'to a greater extent' or 'to a greater degree'. On the other hand, spend in (b) is a transitive verb, and so more is a determiner-pronoun form following it. As confirmation of this, note that sentence (b) could be turned into a passive with more as subject: More should be spent.... There are unfortunately some verbs for which the distinction is less clear than in the above examples, e.g.:

You should eat more.
You should read more.
You should smoke less.

In these cases, the verb may be used transitively or intransitively with almost identical meanings, so that the syntactic structures of the immediate and/or surrounding context are the only clues as to which is the case:

Do you smoke? (Intransitive)
How many do you smoke in a week? (Transitive)

Contrast (c) and (d) below:

(c) At the moment we have 23 fixtures per season. Personally, I would rather play more_DT0.
(d) You should work less and play more_AV0.

(In (d) the adverb more has roughly the meaning of 'more often'.)

Note. The automatic disambiguation of determiners and adverbs is not reliable, because transitivity has not been encoded in the tagger. Sentences like (c) and (d), where more follows the verb at end of a sentence, are invariably tagged AV0.

Adjective vs. participle

Another area of borderline cases is the tagging of words as adjectives (AJ0) or as participles (VVG or VVN).

One test is to see whether a degree adverb like very can be inserted in front of the word: e.g. in We were very surprised, surprised is an AJ0.

Another test, having the opposite effect, is to see whether there is an agent by-phrase following the word in -ed or -en. If so it is a VVN:

We were surprised_VVN by pirates.

Even where it is not present, the possibility of adding the by-phrase, without changing the meaning of the word, is evidence in favour of VVN. (However, this criterion can clash with the preceding one — since it occasionally happens that an -ed word is both preceded by an adverb like very and followed by a by-phrase: E.g. I was so irritated by his behaviour that I put the phone down. When these do occur, we give preference to AJ0.)
A third test is negative: to see whether the word in question can be placed before a noun. e.g.:

The effect is lasting_AJ0 (compare a lasting_AJ0 effect).
The door is locked_AJ0 (compare the locked_AJ0 door.)

This shows that lasting or locked can easily be (but need not be) an AJ0. If the word could not be placed (with the same meaning) before the noun, this would be evidence that the word is a participle.
Even though an -ing word is normally a VVG after the verb be, it is generally treated as an AJ0 before a noun:

The man was dying_VVG. [HTM.1494 *VVG-AJ0]
the dying_AJ0 man. [FSF.1787]

However, when the -ing or -ed forms part of a premodifying phrase, the VVG or VVN tag is preferred:

an interest_NN1 earning_VVG account
a hypothesis_NN1 driven_VVN approach

In these examples the NN1+VVG/VVN sequence has the character of a premodifying adjective compound. We can therefore imagine the two words bracketed together forming an adjective: an interest-earning_AJ0 account. But within the adjective, the VVG and VVN tags retain their verbal character, with the initial noun acting as object of the verb (cf. the account earns interest).

The same applies when the premodifying compound phase is noun-like:

a shanty_NN1 singing_VVG competition[K4W.2952]

If the verb be can be replaced by another verb such as seem or become, without changing the meaning of the following AJ0 / VVN word, this is a strong indication that the construction is not properly a passive, and that the word is an AJ0:

The building was infested_AJ0 with cockroaches
(cf.: The building seemed/became infested with cockroaches)

A further distinction which can be used to test with 'event' verbs is that the AJ0 refers to a 'resultant state', whereas the VVN refers to an event:

Bill was married_AJ0. (i.e. he was not single)
Bill was married_VVN to Sarah on the 15th May. (i.e. the actual event)

This is a manifestation of the general semantic character of adjectives (which typically refer to states or qualities) and verbs (which typically refer to events or actions).
However, this criterion is not definitive, as VVG and VVN can also sometimes refer to states, when the meaning of the verb is stative:

She is not disturbed_VVN by that sort of threat.
The tourists were standing_VVG around a map of the city.

Finally, here is a test which clearly identifies an -ing form as a verb. A verb takes following complements such as a noun phrase, an adjective or an adverbial. These cannot follow the same word as adjective. E.g.:

Are you expecting_VVG someone?[G01.2610]
The arithmetic is looking_VVG good. [K1M.3611]
Turning_VVG suddenly, she ran for the safety of the car [CK8.297]


His manner was insulting_AJ0.

where insulting could not normally be followed by an object:

* insulting us.

Preposition vs. prepositional adverb vs. general adverb
This kind of ambiguity occurs frequently, particularly in spoken texts. Compare:

(a) She ran down_PRP the hill.
(b) She ran down_AVP her best friends.

In (a), down is a preposition, because:
  • An adverb could be inserted before it:

    She ran quickly down the hill.
    (But not: *She ran viciously down her best friends.)

  • It can be moved (somewhat awkwardly) to the front of a wh-word:

    This is the hill down_PRP which he ran.
    Down_PRP which slopes do you like ski-ing?

In (b), down is an adverbial particle because:
  • It can be placed before or after the noun phrase acting as object of the verb:

    She ran her best friends down_AVP.
    (But not: *She ran the hill down.)

  • If the noun phrase is replaced by a pronoun, the pronoun has to be placed in front of the particle:

    She ran them down_AVP. (= her best friends)
    (But not: *She ran down them.)


    The dentist took all my teeth out_AVP. (The dentist took them out)

Notice that the syntactic distinction between (for example) down as an adverbial particle and down as a preposition is independent of the semantic distinction between locative and non-locative interpretations of down.

When the verb is simply followed by down or out, etc., without a following noun phrase, it is normally an AVP:

Income tax is coming down_AVP.
The decorations are put up_AVP on Christmas Eve.

However, it is important to recognize 'stranded' prepositions, which have been deprived of the company of their noun phrase, the prepositional complement, because it has been fronted or omitted through ellipsis (e.g. in relative clauses, with passives, in questions, etc.):

This is the hill (which) she ran down_PRP.
(Cf. This is the hill down which she ran.)
The poor were looked down on_PRP by the rich.
(Here on is the stranded preposition)
Which car did she arrive in_PRP?

The same tests apply to words which are tagged either as prepositions or as general adverbs (AV0), such as across, past and behind.

Note, additionally, the use of about as a degree adverb.

Interjection vs. unclassified

The borderline between interjections or exclamatory particles (tagged ITJ) and unclassified 'noise' words (tagged UNC) is drawn as follows:

ITJ is used for 'institutionalized' interjections or discourse particles such as good-bye, oh, no, oops, hallelujah, whoa, wow ; however Well, right and like functioning as discourse markers are tagged AV0.

UNC is used in contexts where no other wordclass tag seems appropriate:
  • 'noise' words and pause fillers in spoken utterances; this includes imitations of animal or machine sounds:

    blah_UNC blah_UNC blah_UNC
    er_UNC I think so.
    Erm_UNC nope_ITJ.

  • certain fused forms which cannot easily be broken down into separate word classes:

    ai_UNC n't_XX0

  • constituent <w> elements within multiword expressions for which no unique C5 code can be found

The contraction ain't is a special case: its first half is tagged UNC because it abbreviates so many different verb forms (am not, is not, are not, has not, have not) that no single tag can be applied to it (unless one were to invent a special tag for that purpose). Disambiguation by Word
In this section we discuss some common words which belong to more than one word class, and are among the most problematic for disambiguation. As in section 3, if the tag stated in the example differs from the actual tag in the corpus, we append the latter to the file reference number in the next line. Eg *AV0 in

Tears well_VVB up in my eyes. [BN3.5 *AV0]

Apostrophe S
In the BNC the two-character sequence 's is generally tagged as a separate wordform, following without a space the immediately preceding word.
Contracted forms
When it represents a shortened form of is, has or (rarely) does, it has the appropriate verb tag. Occasionally, for example with auxiliaries followed by past participles, there are difficulties determining what the full form of the verb should be. Examples:

That_DT0's_VBZ perfect is that one... (= That is...) [KCX.1254]
She_NP0 's_VHZ got tickets. (= She has...) [KPV.6479]
well, what_DTQ 's_VDZ he do?, is he a plumber? (= What does...) [KD6.310]

Britain_NP0's_POS small businesses [HMH.67]
After today_AV0's_POS announcement [K6F.39]
's plural
When 's acts as a marker of the -s plural, or as part of the verb form let's, it is part of a single word, and is not assigned its own tag. E.g.:

success in the three R_ZZ0's [EVY.59]
in the 1980_CRD's [HJ1.22024]
Let_VM0's go_VVI. [A61.1443]

Note that let's is not considered a contraction of let us, but is treated as a single 'verbal particle', tagged VM0, on the grounds that it is closely analogous to modal auxiliaries.

Degree adverb:
When about has an approximating meaning, typically premodifying a quantifying expression, it is tagged AV0 (not PRP): was about_AV0 three weeks ago [FAJ.1714]
about_AV0 half the size of a grain of rice [AJ4.33]

Note also the multiword just about, as in:

We're just about_AV0 ready.

Preposition vs. particle:
See further at Preposition vs. prepositional adverb vs. general adverb

my mother was reading a novel about_PRP gypsies... . [ARJ.2068]
How did this transformation come about_AVP? [A11.786]

Comparative constructions:
As is a degree adverb (AV0) when it occurs before an adjective, adverb or determiner (and sometimes other words) in phrases of the type as X as Y, or simply as X (where the comparative clause or phrase as Y) is omitted but understood:

I go to see them as_AV0 often as I can . [AC7.1189]
and they employ ninety people, twice as_AV0 many as last year. [K1C.3540]
And every bit as_AV0 good .[EEW.1132 *CJS]

In the first and second examples above, the second as introduces a comparative construction which expresses 'equal comparison', as contrasted with the unequal comparison of more X than Y. When as is a word introducing such a comparative construction, it is tagged CJS:

Capitalism is not as_AV0 good as_CJS it claims. [CFT.2042]
Linked together, they can crunch numbers as_AV0 fast as_CJS any mainframe.[CRB.271]
She will deposit as_AV0 many as_CJS a dozen eggs there. [F9F.424]

Notice that as in this comparative use is tagged CJS whether or not it introduces a clause. Often it introduces a noun phrase. In the following example, it introduces an adjective:

always reply as_AV0 quickly as_CJS possible. [C9R.989]

Introducing other clauses:
The tag CJS is also used when introducing other subordinate clauses, such as adverbial clauses of time or reason:

New York called just as_CJS I was leaving. [APU.1543]
As_CJS you've gone to so much trouble , it would seem discourteous to refuse [KY9.2107]

The tag PRP is used for as functioning clearly as a preposition:

Consider it as_PRP a kind of insurance [AD0.1641]
As_PRP head of information, Christina will lead a team of four TEC staff... [BM4.2830]

Usually the meaning is related to the equative meaning of the verb be. However, the guideline restricts PRP to cases where as is followed by the normal noun phrase or nominal, as is normal for prepositions. Where the as is followed by an adjective or a past participle clause, it is tagged CJS, even though it may retain the equative type of meaning:

We regard these results as_CJS encouraging. [B1G.184]
I very much hope that you will in fact support the motion as_CJS originally intended. [KGX.93]

As is part of many multiwords which get tagged with a single tag: e.g. as soon as, such as, in so far as, as long as, as well as. The sequence as well as, for example, is tagged as a preposition (PRP) in such examples as

Sometimes as well as_PRP going this way we actually need to go in this was too. [G5N.31]

Note that this is different from the multiword adverb as well (meaning also); it is also different from the sequence of as well as as three separate words, e.g. in:

She's as_AV0 well_AJ0 as_CJS can be expected. [F9X.2095]

The coordinating conjunction CJC is overwhelmingly the most common use of but. The following other cases can also be detected:
But is an adverb when its meaning is similar to 'only':

She can spare you but_AV0 a few minutes [CCD.82 *CJC] There is but_AV0 one penalty. [ALS.185 *CJC]

Subordinating conjunction or preposition:
But is either a conjunction (CJS) or a preposition (PRP) if it has the meaning of 'except (for)', 'other than' or 'apart from'. CJS is used when it introduces a clause, and PRP is used when it introduces a phrase:

...mediocre albums that do nothing but_CJS take up shelf space [C9M.1014]
I couldn't help but_CJS notice. [JY0.5323 *CJC]
I always feel they are open meetings in everything but_PRP name. [HJ3.5520]
No one had guessed she was anything but_PRP a boy. [C85.517]

Coordinating conjunction:
Otherwise but is a coordinating conjunction, tagged CJC, linking units of the same kind (e.g. clauses or adjective/adverb phrases). Its function is to express contrastive or 'adversative' meaning:

God and minds do exist , but_CJC materially so . [ABM.1265]
And that's it for another week but_CJC don't forget the late news at eleven thirty. [J1M.2520]
Hares ( but_CJC not rabbits ) are particularly vulnerable... [B72.892]

Note also multiwords such as but for (PRP):

The fare increases would have been bigger but for_PRP the governments last minute intervention. [K6D.124]

As a locative adverb, home has no determiner or article preceding:

We stayed home_AV0. [FAP.313]
This is my home_NN1. [AMB.1805]

Discoursal function:
In speech, when like has a discoursal function as a 'hedge', we tag it AV0:

well she says like_AV0, I won't be a minute [KCY.1518]
I'm driving along, you know like_AV0 <trunc> wha</trunc> when you're in the car by yourself and everything's turning over in your head [KBU.1096]

Other functions:
Like very frequently occurs as a preposition or as a verb. The noun and adjective uses are fairly rare:

...but I like_VVB Monday best. [FU4.1089]
He didn't look like_PRP a goodie. [H0M.1353]
... fuel, weapons, ground crew and the like_NN1. [JNN.105 *AJ0-NN1]
Churchill and Eden were not of like_AJ0 minds... [ACH.1297]

The meaning of little (AJ0) is the opposite of big:

Bless their dear little_AJ0 faces. [HRB.722]
Little_AJ0 green shoots of recovery are stirring. [CEL.968]

The meaning of little (DT0) is 'not much':

I have little_DT0 to say. [G1Y.1133]
...there was little_DT0 food left. [FSJ.720]

As an adverb (AV0) little also has the meaning 'not much':

I care very little_AV0 about petty-minded, selfish "rules". [B0P.211]

A little
Note that a little can also be a multiword adverb (AV0):

They are all a little_AV0 drunk. [G0F.2117]

However, the quantifier a little meaning 'a small amount' is not tagged as a multiword 4 but as AT0 + DT0

You couldn't let me have a_AT0 little_DT0 milk? [GUM.1656]

[See Determiner-pronoun vs. adverb ]

Much_DT0 of this work has to be done on the spot. [C8R.24]
I've spent too much_DT0 money. [KPV.62659]


Thanks very much_AV0. [A73.5]
I didn't sleep much_AV0 last night [ALH.1495]

See also Determiner-pronoun vs. adverb

See Determiner-pronoun vs. adverb for a fuller discussion. Further examples:

You deserve more_DT0 than a medal. [K97.3705]
More_DT0 haste, less_DT0 speed. [J10.4543]
...this will make him more_AV0 tired than usual [A75.282]
But I couldn't agree more_AV0 [BMD.3]

More than as a multiword premodifier counts as an AV0:

more than_AV0 one in a million [K5N.46]


No_AT0 problem_NN1. [H4H.227]

As a noun, no is usually an abbreviation for number:

quoting Ref_NN1 No_NN1 BCE90_UNC [CJU.673]


but the matter was taken no_AV0 further_AV0. [ARF.183 no: *AT0]
To put it no_AV0 more_AV0 strongly_AV0, it has not been proved beyond doubt that.... [EW7.125]

No is tagged as an interjection (ITJ) where it functions as the opposite of Yes.

"...See how easy my job can be?"
"Frankly, no_ITJ". [HR4.2329]

The clearest cases of CRD are in a quantifying noun phrase, typically allowing the substitution of another numerical expression (e.g. one chip contrasts with two chips) or of the digit 1 (1 chip):

Can I have one_CRD chip, please? [KDB.1416]
So are there criticisms? Just one_CRD. [CG2.1489]
... one_CRD in five sufferers never tells their partners. [CF5.8 *PNI]
Orford Ness is one_CRD of Britain's most unusual coastal features. [CF8.86]

In such noun phrases, one functions like a determiner-pronoun such as some.
Indefinite Pronoun:
The clearest cases of PNI are:
  • As a substitute form, standing for an understood noun or noun phrase:

    The channel was not a broad one_PNI [AEA.1457]

    In this use, one has a plural form ones.
  • As a generic personal pronoun, meaning 'people in general':

    And I think one_PNI might go on to argue that far from saving labour it creates it. [J17.1915]

Note that the reliability of the ambiguity tag PNI-CRD (in which the pronoun is rated more likely) is somewhat low. See 6.6 POS-tagging Error Rates


As both an adverb (AV0) and an adjective (AJ0) right means the opposite of 'wrong' and also the opposite of 'left'. As a noun, it generally means 'entitlements': e.g. I have a right_NN1 to know. The uses of right as a verb are very rare.

Less obvious points:
Discoursal function:
As a discourse marker, right is tagged AV0:

Right_AV0, how you doing there? [KBL.4671]
Right_AV0, er, members, any questions ? [F7V.138]

Degree adverb (intensifier):
In dialectal usage, right can be an intensifier, and is tagged AV0:

it's a ... it's a right_AV0 soft carpet. [KB2.1242-4]

  • In most cases so is tagged as an adverb (AV0):

    So_AV0 this is where you work... [H8M.2964]
    Right, so_AV0 what's fifty three per cent as a decimal? [JP4.357]
    They waited but nothing happened so_AV0 they made a fuss. [FU1.2484]

  • As a pro-form meaning 'thus' or standing for a clause or predicate, so is tagged AV0:

    So_AV0 say I and so_AV0 say the folk. [G11.228]
    "Yes, I think so_AV0." [CCM.151]

  • As a degree adverb or intensifier, so is tagged AV0:

    tough and long lasting - that's why they're so_AV0 popular. [BN4.929]
    There would not be so_AV0 many lonely people in our land [B1Y.1262]

  • Introducing purpose clauses, so is tagged CJS (subordinating conjunction):

    Drink your tea so_CJS they can have your cup. [KB2.1767]

  • Note that so is frequently part of a multiword: so that, so far, so as to, (in) so far as, etc. See the list of multiwords
  • As a demonstrative (pronoun or determiner), that is tagged DT0

    That_DT0's_VBZ my coat yeah. [KBS.1309]
    he's getting hooked on the taste of vaseline, that_DT0 dog. [KCL.197]

  • As a clause-initiating conjunction, that is tagged CJT. This applies to that as a complementizer:

    Many experts claim that_CJT it is good for your growing baby, too. [G2T.1091]

    and also to that as a relativizer (introducing a relative clause):

    A ship that_CJT never enters harbour. [BPA.1326]

    This is different from the more traditional analysis which treats that introducing a relative clause as a relative pronoun.
  • As a degree adverb (intensifier):

    It wasn't all that_AV0 bad. [KPP.321]

  • That occurs commonly in multiwords such as so that, in that, in order that.
In all functions except clear adjectival usage (AJ0, usually following the), then receives the tag AV0:

And then_AV0 she spoke. [H8T.2675]
"Come on, then_AV0." [K8V.1722]
Mr Willi Brandt, the then_AJ0 Mayor of West Berlin. [A87.84]
...the then_AJ0 state governor , who wasn't then_AV0 Bill Clinton [A87.84]

Infinitive marker
When used with an infinitive, to is always tagged TO0. Note elliptical uses of the pre-infinitival to, especially in informal spoken texts:

In the summer holidays, I can, I can get up early if I want to_TO0. [KPG.4153]

Note also the common colloquial spelling of want to, got to, and going to as fused words:

wanna = wan_VVB na_TO0
gotta = got_VVN ta_TO0
gonna = gon_VVG na_TO0

When used as a preposition, to is always tagged PRP. Prepositions are normally followed by a noun phrase or nominal clause. Where the preposition is 'stranded' (i.e. where the noun phrase associated with the preposition has been moved or ellided ) it can be confused with an adverbial particle:

That 's the school that Terry goes to_PRP. [KB8.2442]
...what you're entitled to_PRP by law is money back [FUT.360]
"Where to_PRP?""The_PRP moon." [FNW.240-1]

Adverbial particle
The adverbial particle to is rare but does occur, for example in come to meaning 'regain consciousness'.
By far the most common function for well is as an adverb:

She's playing well_AV0

Discoursal function:
When well has the function of a discourse marker, it is treated as an adverb (AV0):

Oh well_AV0! That'll be the finish! [FX6.196-7]
I bet he doesn't get up till about, well_AV0, it's eleven now. [KBL.3808]

Degree adverb:
Well is tagged AV0, too, where it has an intensifying function: e.g.

It was dark outside and well_AV0 past your bedtime. [ASS.898]

Well is tagged as an adjective where it means 'in good health':

You don't look well_AJ0. [HPR.107]

As a verb, well is very rare, but occurs in the phrasal verb well up. NB. This use has not been accurately tagged in the corpus:

Tears well_VVB up in my eyes. [BN3.5 *AV0]

When can introduce three types of clauses: an adverbial clause, a nominal clause, or a relative clause. Where it introduces an adverbial clause, it is tagged CJS. Otherwise it is tagged AVQ. The AVQ tag is also used for when introducing a question. Examples:
Adverbial clause:

When_CJS I got back to my flat, I decided to ring Toby. [CS4.1265]
the crowd left quietly when_CJS the police arrived. [APP.1017] (when = at the time at which)
If you smoke when_CJS you're pregnant... [A0J.1598] (when = whenever)

Note that when is also a subordinating conjunction in abbreviated adverbial clauses which lack a subject and finite verb, such as when in doubt, when ready, when completed.
Nominal clause

I can't remember when_AVQ we last had a frost. [KBF.11728]
"Do you remember when_AVQ we used to go with Daddy in the boat on Saturdays?" [A6N.2022]
You never know when_AVQ the next big story will break. [HJ6.100]

Before an infinitive, when is also tagged AVQ:

Otto knew when_AVQ to change the subject. [FAT.1603]

Also when the rest of the infinitive clause is understood:

Tell me when_AVQ.

Relative clause

in the year when_AVQ I was born (when = in which)
the moment when_AVQ he arrived (when = at which)

Note that when can often be omitted in relative clauses: the moment he arrived.
Direct questions

When_AVQ did you find out?

Where is like when in that it can be a wh- adverb (AVQ) or a subordinating conjunction (CJS). However, with where the CJS tag is much less likely. Examples:
In adverbial clauses hit him where_CJS it hurts. [CEN.2816]

In other contexts
  • Nominal clause:

    I don't know where_AVQ she picked them up. [G1D.1163]

  • Relative clauses

    It was the house where_AVQ the poor woodcutter lived with Hansel and Gretel

  • Direct questions:

    Where_AVQ are you going? [KB9.2650]

worth is tagged PRP where it could answer a question such as 'How much is X worth?' or 'What is X worth?'

these pictures are worth_PRP a small fortune. [FNT.1060]
That makes him worth_PRP about $60m. [CT3.479]
'Darling, it's not worth_PRP getting upset. [HH9.2308]

worth also occurs as a 'stranded preposition' in questions used to elicit such responses, and in some other common constructions:

how much d'ya think it's worth_PRP? [KCX.1344]
share prices say nothing about what a company is worth_PRP. [A9U.305 *NN1]
Please go ahead and push Grapevine for all you are worth_PRP. [AP1.575]

worth is tagged NN1 when it is an obvious noun (meaning 'value'). Typically this occurs following expressions of quantity, whether or not the quantity is expressed by a possessive or genitive (e.g. its, 's).

Baker showed his worth_NN1 for Ipswich in the 20th minute [CF9.102]
hundreds of pounds' worth_NN1 of damage. [A0H.15]
2,500 WORTH_NN1 OF PRIZES [ECJ.1147]

6.5.10 Features of spoken corpus tagging

The spoken and written texts of the BNC have been tagged in the same way, except that the following phenomena occur almost entirely in the spoken part of the corpus.
Individual letters
Words spelt out by a speaker as individual letters have been transcribed letter by letter, each being tagged ZZ0.

children who go to the E_ZZ0 N_ZZ0 T_ZZ0 clinic [KB8.3805]
...ten ninety minute tapes! T_ZZ0 D_ZZ0 K_ZZ0 tapes! [KPG.3534-5]

In the written corpus these items would nearly always be written and tagged as whole words (ENT or TDK in the above example).

Truncated words
Words that are left incomplete by the speaker are enclosed within an XML <trunc> element and tagged UNC. Examples include bathr and su in the following

The <trunc> bathr_UNC </trunc> er you can't beat a white bathroom suite anyway. [KCF.721]
Aye, they only came in the <trunc> su_UNC </trunc> they only came up here in the summer. [GYS.127]

Partial repetition of multiwords
Occasionally in spoken data it happens that only a portion of a multiword sequence is repeated. In this example, the word sort is used twice; in both cases it appears to function not as a separate word but as part of the multiword adverb sort of.

we're going to sort sort of summarize... [G5X.106]

We treat the first sort as an incomplete multiword, and tag it UNC (rather like truncated words, above). The complete multiword sort of is tagged AV0, as normally.

we're going to sort_UNC sort of_AV0 summarize...

Further examples of incomplete multiwords are the as long in as long as (conjunction), of in because of (preposition) and the in in in general (adverb) below

As_UNC long_UNC As_CJS long as everyone recognizes that for an area of that size... [J9T.258]
because_PRP of the <pause> of_UNC the drought. When we were away it didn't get watered in. [KCH.982]
I know that in_UNC in_UNC in_AV0 general, in in in erm, imperial measure, it is <trunc> f </trunc> five feet eight inches [JK1.480]

The second example shows that when words are repeated, the incomplete portion of a multiword is not necessarily immediately adjacent to the fully formed multiword. In the last example, the three instances of in before erm, imperial measure have not been analysed as part of the multiword in general; they are instead tagged as ordinary words (in this case, ambiguous between preposition and prepositional adverb: PRP-AVP). There are a few cases where the tagger has probably been over-zealous in spotting repeated portions of multiwords:

What happens now_UNC, now_CJS that you are winched down? [HEF.9]

Here, the first instance of now would probably have better been interpreted as a single word adverb (='at this time'), not part of the multiword conjunction now that5.
Er and erm inside multiwords
Generally (in both written and spoken texts) the pause fillers er and erm take the tag UNC. This applies also when they appear within a multiword sequence, as in every er so often. The code assigned to the surrounding <mw> element is identical to that which would have been assigned if the filler were not present.

And your homework was handed in every er so often_AV0, you know [G64.152]

something had gone wrong with the <pause> gas pipes because erm of_PRP <pause> flooding. [KB8.5356]

these kind of books were, er, generally er, at , at er best_AV0 ignored [HUN]

Note that in the last example the word at preceding the multiword at er best is treated as a partial repetition of that multiword, and therefore tagged UNC.

6.6 POS-tagging Error Rates

This section reports on the accuracy of the results of the improved tagging programs.

6.6.1 Levels of estimation

Based on the findings from the 50,000-word test sample, the estimated ambiguity and error rates for the BNC are shown below in three different degrees of detail.:
  • First, as a general assessment of accuracy, the estimated rates are given for the whole corpus. (See Table 23. Estimated ambiguity and error rates for the whole corpus (fine-grained calculation) below.)
  • Secondly, separate estimates of ambiguity rates and error rates are given for each of the 57 word tags in the corpus. This will enable users of the corpus to assign appropriate degrees of reliability to each tag. Some tags are always correct; other tags are quite often erroneous. For example, the tag VDD stands for a single form of the verb do: the form did. Since the spelling did is unambiguous, the chances of ambiguity or error, in the use of the tag VDD, are virtually nil. On the other hand, the tag VVB (base finite form of a lexical verb) is not only quite frequent, but also highly prone to ambiguity and error. 15 per cent of the occurrences of VVB are errors - a much higher error rate than any other tag. (See Table 25. Estimated ambiguity rates and error rates by tag below.)
  • Thirdly, separate estimates of ambiguity rates and error rates are given for ‘wrong-tag--right-tag’ pairings XXX, YYY, consisting of (i) the actually-occurring erroneous tag XXX, and (ii) the correct tag YYY which should have occurred in its place. However, because the number of possible tag-pairs is large (572), and most of these tag-pairs have few or no errors, only the more common pairings of erroneous tag and correct tag are separately listed, with their estimated probability of occurrence. This list of tag-pairings will help users further, in enabling them to estimate not merely the reliability of a tag, but, if that tag is incorrect, the likelihood that the correct tag would have been some other particular tag. In this way, the frequency of grammatical word classes, or individual words in those classes, can be estimated more accurately for the whole BNC. (See Table 26. Estimated frequency of selected tag-pairs below.)

6.6.2 Presentation of Ambiguity Rates and Error Rates (fine-grained mode of calculation)

In this section, we examine ambiguities and errors using a ‘fine-grained’ mode of calculation, treating each error as of equal importance to any other error. In Presentation of Ambiguity and Error Rates (coarse-grained calculation) we look at the same data in terms of a ‘coarse-grained’ mode of calculation, ignoring errors and ambiguities involving subcategories of the same part of speech. Overall estimated ambiguity and error rates: based on the 50,000 word sample

As the following table shows, the ambiguity rate varies considerably between written and spoken texts. (However, note that the calculation for speech is based on a small sample of 5,000 words.)

Table 23. Estimated ambiguity and error rates for the whole corpus (fine-grained calculation)
Sample tag count Ambiguity rate (%) Error rate (%)
Written texts 45,000 3.83% 1.14%
Spoken texts 5,000 3.00% 1.17%
All texts 50,000 3.75% 1.15%

It will be noted that written texts on the whole have a higher ambiguity rate, whereas spoken texts have a slightly greater error rate.

The success of an automatic tagger is sometimes represented in terms of the information-retrieval measures of precision and recall, rather than ambiguity rate and error rate as in Table 23. Estimated ambiguity and error rates for the whole corpus (fine-grained calculation). Precision is the extent to which incorrect tags are successfully discarded from the output. Recall is the extent to which all correct tags are successfully retained in the output of the tagger, allowing, however, for more than one reading to occur for one word (i.e. ambiguous tagging is permitted). According to these measures, the success of the tagging is as follows:

Precision Recall
Written texts 96.17% 98.86%
Spoken texts 97.00% 98.83%
All texts 96.25% 98.85%

However, from now on we will continue to use ‘ambiguity rate’ and ‘error rate’, which appear to us more transparent. Estimated ambiguity and error rates for each tag (fine-grained mode of calculation)

The estimates for individual tags are again based on the 50,000 sample, and the ambiguity rate for each tag is based on the number of ambiguity tags which begin with a given tag. The table also specifies the estimated likelihood that a given tag, in the first position of the ambiguity tag, is the correct tag.

In Table 25. Estimated ambiguity rates and error rates by tag, column (b) shows the overall frequency of particular tags (not including ambiguity tags). Column (c) gives the overall occurrence of ambiguity tags, as well as of particular ambiguity tags, beginning with a given tag. (Ambiguity tags marked * are less ‘serious’ in that they apply to two subcategories of the same part of speech, such as past tense and past participle of the verb - see 4.1 below.) Column (d) shows which tags are more or less likely to be found as the first part of an ambiguity tag. For example, both NP0 and VVG have an especially high incidence of ambiguity tags. Column (e) tells us, given that we have observed an ambiguity tag, what is the likelihood of the first tag’s being correct? Overall, there is more than a 3-1 chance that the first tag will be correct; but there are some exceptions, where the chances of the first tag’s being correct are much lower: for example, PNI (indefinite pronoun). Note that (f) and (g) exclude errors where the first tag of an ambiguity tag is wrong; contrast Table 28. Estimated error rates for the whole corpus, and Table 29. Estimated error rates (by tag) column (c), below.

Table 25. Estimated ambiguity rates and error rates by tag
(a) Tag (b) SingleTag count (out of 50,000 words) (c) Ambiguity Tag count (out of 50,000 words) (d) Ambiguity rate (%)(c / b + c) (e) 1st tag of ambiguity tag correct (% of all ambiguity tags) (f) Error count (g) Error rate (%)(f / b)
AJ0 3412 all 338 9.01% 282 (83.43%) 46 1.35%
(AJ0-AVO 48)
(AJ0-NN1 209)
(AJ0-VVD 21)
(AJ0-VVG 28)
(AJ0-VVN 32)
AJC 142 0.0% 4 2.82%
AJS 26 0.0% 2 7.69%
AT0 4351 0.0% 2 0.05%
AV0 2450 all 45 1.80% 37 (82.22%) 57 2.33%
(AV0-AJ0 45)
AVP 379 all 44 10.40% 34 (77.27%) 6 1.58%
(AVP-PRP 44)
AVQ 157 all 10 5.99% 10 (100.00%) 9 5.73%
(AVQ-CJS 10)
CJC 1915 0.0% 3 0.16%
CJS 692 all 39 5.34% 30 (76.92%) 18 2.60%
(CJS-AVQ 26)
(CJS-PRP 13)
CJT 236 (all) 28 10.61% 3 1.27%
(CJT-DT0 28 )
CRD 940 all 1 0.11% 0 (0.00%) 0 0.00%
DPS 787 0.0% 0 0.00%
DT0 1180 all 20 1.67% 16 (80.00%) 19 1.61%
(DT0-CJT 20)
DTQ 370 0.0% 0 0.00%
EX0 131 0.0% 1 0.76%
ITJ 214 0.0% 2 0.93%
NN0 270 0.0% 10 3.70%
NN1 7198 all 514 6.66% 395 (76.84%) 86 1.19%
(NN1-AJ0 130)
(NN1-NP0 92)*
(NN1-VVB 243)
(NN1-VVG 49)
NN2 2718 all 55 1.98% 48 (87.27%) 30 1.10%
(NN2-VVZ 55)
NP0 1385 all 264 16.01% 224 (84.84%) 31 2.24%
(NP0-NN1 264)*
ORD 136 0.0% 0 0.00%
PNI 159 all 8 4.79% 3 (37.50%) 5 3.14%
PNP 2646 0.0% 0 0.00%
PNQ 112 0.0% 0 0.00%
PNX 84 0.0% 0 0.00%
POS 217 0.0% 5 2.30%
PRF 1615 0.0% 0 0.00%
PRP 4051 all 166 3.94% 154 (92.77%) 24 0.59%
(PRP-AVP 132)
(PRP-CJS 34)
TO0 819 0.0% 6 0.73%
UNC 158 0.0% 4 2.53%
VBB 328 0.0% 1 0.30%
VBD 663 0.0% 0 0.00%
VBG 37 0.0% 0 0.00%
VBI 374 0.0% 0 0.00%
VBN 133 0.0% 0 0.00%
VBZ 640 0.0% 4 0.63%
VDB 87 0.0% 0 0.00%
VDD 71 0.0% 0 0.00%
VDG 10 0.0% 0 0.00%
VDI 36 0.0% 0 0.00%
VDN 20 0.0% 0 0.00%
VDZ 22 0.0% 0 0.00%
VHB 150 0.0% 1 0.67%
VHD 258 0.0% 0 0.00%
VHG 16 0.0% 0 0.00%
VHI 119 0.0% 0 0.00%
VHN 9 0.0% 0 0.00%
VHZ 116 0.0% 1 0.86%
VM0 782 0.0% 3 0.38%
VVB 560 all 84 13.04% 56 (66.67%) 84 15.00%
(VVB-NN1 84)
VVD 970 all 90 8.49% 62 (58.89%) 50 5.15%
(VVD-AJ0 11)
(VVD-VVN 79)*
VVG 597 all 132 18.11% 112 (84.84%) 9 1.51%
(VVG-AJ0 83)
(VVG-NN1 49)
VVI 1211 0.0% 7 0.58%
VVN 1086 all 158 12.70% 113 (71.52%) 27 2.49%
(VVN-AJ0 50)
(VVN-VVD 108)*
VVZ 295 all 26 8.10% 14 (53.85%) 11 3.73%
(VVZ-NN2 26)
XX0 363 0.0% 0 0.00%
ZZ0 75 0.0% 3 4.00% Estimated error rates specifying the incorrect tag and the correct tag (fine-grained calculation)

The next table, Table 26. Estimated frequency of selected tag-pairs, gives the frequency, as a percentage, of error-prone tag-pairs where XXX is the incorrect tag and YYY is the correct tag which should have occurred in its place. In the third column, the number of the specified error-type is listed, as a frequency count from the sample of 50,000 words. In the fourth column, this is expressed as a percentage of all the tagging errors of word category XXX (in Table 25. Estimated ambiguity rates and error rates by tag column (f)). The fifth column answers the question: if tag XXX occurs, what is the likelihood that it is an error for tag YYY? Where the number of occurrences of a given error-type is less than 5 (i.e. 1 in 10,000 words), they are ignored. Hence, Table 26. Estimated frequency of selected tag-pairs is not exhaustive: only the more likely error-types are listed. In the second column, we add, where useful, the individual words which trigger these errors.

Table 26. Estimated frequency of selected tag-pairs
(1) Incorrect tag XXX (2) Corrected tag YYY (3) No. of occurrences of this error type (4) % of all incorrect uses of tag(XXX) (5) % of all tags XXX
AJ0 AVO 12 26.1% 0.4%
NN1 12 26.1% 0.4%
NP0 5 10.9% 0.1%
VVN 8 17.4% 0.2%
AV0 AJ0 6 10.5% 0.2%
AJC 8 14.0% 0.3%
DT0 24 42.1% 1.0%
EX (there) 5 8.8% 0.2%
PRP 5 8.8% 0.2%
AVQ CJS (when, where) 6 66.7% 3.8%
CJS PRP 10 55.6% 1.4%
DTO AV0 15 78.9% 1.3%
NN1 AJ0 13 15.1% 0.2%
NN0* 8 9.3% 0.1%
NP0* 22 25.6% 0.3%
UNC 9 10.5% 0.2%
VVI 13 15.1% 0.2%
NN2 NP0* 14 46.7% 0.5%
NP0 NN1* 10 32.3% 0.7%
NN0* 5 16.1% 0.4%
PRP AV0 7 29.2% 0.2%
AVP 5 20.8% 0.1%
TO0 PRP (to) 6 100.0% 0.7%
VVB AJ0 7 8.3% 1.3%
NN1 7 8.3% 1.3%
VVI* 55 65.5% 9.8%
VVD AJ0 6 12.0% 0.6%
VVN* 44 88.0% 4.5%
VVG NN1 9 100.0% 1.5%
VVI NN1 5 71.4% 0.4%
VVN AJ0 7 25.9% 0.6%
VVD* 17 63.0% 1.6%
VVZ NN2 8 72.7% 2.7%

Similar to before, the asterisk * indicates a ‘less serious’ error, in which the erroneous and correct tags belong to the same major category or part of speech. As the table shows, the most frequent specific error types are within the verb category: VVB ? VVI (55, or 9.8% of all VVB tags) and VVD ? VVN (44, or 4.5% of all VVD tags).

6.6.3 A further mode of calculation: ignoring subcategories of the same part of speech Presentation of Ambiguity and Error Rates (coarse-grained calculation)

Yet a further way of looking at the ambiguities and errors in the corpus is to make a coarse-grained calculation in counting these phenomena. In a fine-grained measurement, which is the one assumed up to now, each tag is considered to define its own word class which is different from all other word classes. Using the coarse-grained calculation, on the other hand, we consider words to belong to different word classes (parts of speech) only when the major category is different. If we consider the pair NN1 (singular and common noun) and NP0 (proper noun), the coarse-grained calculation says that the ambiguity tag NN1-NP0 or NP0-NN1 does not show tagging uncertainty, since both the proposed tags agree in categorizing the word as the same part of speech (a noun). So this does not add to the ambiguity rate. Similarly, the coarse-grained point of view on error is that, if a word is tagged as NN1 when it should be NP0, or vice versa, then this is not error, because both tags are within the noun category. To summarize: in the fine-grained calculation, minor differences of wordclass count towards the ambiguity and error rates; in the coarse-grained calculation, they do not.

In this section, the same calculations are made as in section 3, except that errors and ambiguities which are confined within a major category (noun, verb, etc.) are ignored. In practice, most of the errors and ambiguities of this kind come from the difficulty the tagger finds in recognizing the difference between NN1 (singular common noun) and NP0 (proper noun), between VVD (past tense lexical verb) and VVN (past participle lexical verb), and between VVB (finite present tense base form, lexical verb) and VVI (infinitive lexical verb). Thus the ambiguity tags NN1-NP0, VVD-VVN and their mirror images do not occur in the relevant table (Table 28. Estimated error rates for the whole corpus) below. However, since there are no ambiguity tags for VVB and VVI, the problem of distinguishing these two shows up only in the error calculation.

The three tables in this section correspond with the three tables in the preceding section.

Table 27. Estimated ambiguity and error rates for the whole corpus
Sample tag count Ambiguity rate (%) Error rate (%)
Written texts 45,000 2.78% 0.69%
Spoken texts 5,000 2.67% 0.87%
All texts 50,000 2.77% 0.71%

It will be noted from Table 27. Estimated ambiguity and error rates for the whole corpus that this method of calculation reduces the overall ambiguity rate by c.1 per cent, and the overall error rate by c.0.5 per cent. We will not present coarse-grained tables corresponding to Table 25. Estimated ambiguity rates and error rates by tag and Table 26. Estimated frequency of selected tag-pairs above: these tables would be unchanged from the fine-grained calculation, except that the rows marked with an asterisk (*) would be deleted, and the other calculations changed as necessary. Different modes of calculation: eliminating ambiguities

Given that the elimination of errors was beyond our capability within the time frame and budget we had available, the corpus in its present form, containing ambiguity tags as well as a small proportion of errors, is designed for what we believe will be the most common type of user, who will find it easier to tolerate ambiguity than error. However, other users may prefer a corpus which does not contain ambiguities, even though its error rate is higher. For this latter type of user, the present corpus is easy to interpret as a corpus free of ambiguities, simply by deleting or ignoring the second tag of any ambiguity tag, and accepting the first tag as the only one. In what follows, we therefore allow two modes of calculation: in addition to the "safer" mode, in which ambiguities are allowed and consequently errors are relatively low, we allow a "riskier" mode in which ambiguities are abolished, and errors are more frequent. In fact, if ambiguity tags are eliminated, the overall error rate rises to almost 2 per cent.

Table 28. Estimated error rates for the whole corpus
Sample tag count Error rate (%)
Written texts 45,000 2.01%
Spoken texts 5,000 1.92%
All texts 50,000 2.00%

The following table gives an error count (c) for each tag: i.e. the number of errors in the 50,000 word sample where that tag was the erroneous tag. [Cf. the "safer" error count in Table 26. Estimated frequency of selected tag-pairs, column (f).] In addition, each tag has a correction count (d): i.e. the number of erroneous tags for which that tag was the correct tag. If we subtract the Error count (c) from the Tag count (b), and add the Correction count (d) to the result, we arrive at the "Real tag count" (e) representing the number of occurrences of that tag in the corrected sample corpus. Not included in the table is the small number of ‘multiword’ errors which resulted in two tags being replaced by one (error count), or one tag being replaced by two (correction count), due to the incorrect non-use or use of multiword tags. The last column divides the error count by the tag count to provide the error rate (as a percentage).

Table 29. Estimated error rates (by tag)
(a) Tag (b) Tag count (c) Error count (d) Correction count (e) Real tag count (b - c + d ) (f) Error rate (%) (c / b)x 100
AJ0 3750 102 (132) 3780 2.72%
AJC 142 4 (12) 150 2.82%
AJS 26 2 (0) 24 7.69%
AT0 4351 2 (3) 4352 0.05%
AV0 2495 65 (67) 2497 2.61%
AVP 423 16 (17) 424 3.78%
AVQ 167 9 (6) 164 5.39%
CJC 1915 3 (1) 1913 0.16%
CJS 731 27 (5) 709 3.69%
CJT 264 3 (15) 276 1.14%
CRD 940 1 (11) 950 0.11%
DPS 787 0 (0) 787 0.00%
DT0 1200 23 (29) 1206 1.92%
DTQ 370 0 (0) 370 0.00%
EX0 131 1 (5) 135 0.76%
ITJ 214 2 (2) 214 0.93%
NN0 270 10 (16) 276 0.37%
NN1 7712 205 (152) 7659 2.66%
NN2 2773 37 (29) 2765 1.33%
ORD 136 0 (2) 138 0.00%
NP0 1649 71 (102) 1680 4.31%
PNI 167 10 (1) 158 5.99%
PNP 2646 0 (1) 2647 0.00%
PNQ 112 0 (0) 112 0.00%
PNX 84 0 (1) 85 0.00%
POS 217 5 (6) 218 2.30%
PRF 1615 0 (0) 1615 0.00%
PRP 4217 36 (45) 4226 0.85%
TO0 819 6 (1) 814 0.73%
UNC 158 4 (29) 183 2.53%
VBB 328 1 (0) 327 0.30%
VBD 663 0 (0) 663 0.00%
VBG 37 0 (0) 37 0.00%
VBI 374 0 (0) 374 0.00%
VBN 133 0 (0) 133 0.00%
VBZ 640 4 (5) 641 0.63%
VDB 87 0 (0) 87 0.00%
VDD 71 0 (0) 71 0.00%
VDG 10 0 (0) 10 0.00%
VDI 36 0 (0) 36 0.00%
VDN 20 0 (0) 20 0.00%
VDZ 22 0 (0) 22 0.00%
VHB 150 1 (0) 151 0.67%
VHD 258 0 (0) 258 0.00%
VHG 16 0 (0) 16 0.00%
VHI 119 0 (1) 120 0.00%
VHN 9 0 (0) 9 0.00%
VHZ 116 1 (0) 115 0.86%
VM0 782 3 (0) 779 0.38%
VVB 644 112 (13) 545 17.39%
VVD 1060 78 (60) 1042 7.36%
VVG 729 29 (29) 729 3.98%
VVI 1211 7 (73) 1277 0.57%
VVN 1244 72 (87) 1259 5.79%
VVZ 321 23 (12) 310 7.17%
XX0 363 0 (0) 363 0.00%
ZZ0 75 3 (4) 76 4.00%

It is clear from this table that the amount of error in the tagging of the corpus varies greatly from one tag to another. The most error prone-tag, by a large margin, is VVB, with more than 17 per cent error, while many of the tags are associated with no errors at all, and well over half the tags have less than a 1 per cent error. The final table gives figures for the third level of detail, where we itemise individual tag pairs XXX, YYY, where XXX is the incorrect tag, and YYY is the correct one which should have appeared but did not. Only those pairings which account for 5 or more errors are listed. This table differs from Table 26. Estimated frequency of selected tag-pairs in that here the second tags of ambiguity tags are not taken into account ("riskier mode"). It will be seen that the errors which occur tend to fall into a relatively small number of major categories.

The percentages in columns 4 and 5 of this table are calculated with respect to the figures given in Table 25. Estimated ambiguity rates and error rates by tag.
Table 30. Estimated frequency of selected tag-pairs
Incorrect tag XXX Correct tag YYY No. of occurrences of this error type % of all incorrect uses of tag XXX % of all tags XXX
AJ0 AV0 22 21.57% 0.59%
NN1 41 40.19% 1.09%
NP0 5 4.90% 0.13%
VVG 14 13.73% 0.37%
VVN 14 13.73% 0.37%
AV0 AJ0 9 13.85% 0.36%
AJC 8 12.31% 0.32%
DT0 26 40.00% 1.04%
EX0 (there) 5 7.69% 0.20%
PRP 6 9.23% 0.24%
AVP CJT 6 94.12% 1.42%
AVQ CJS (when, where) 6 66.67% 3.59%
CJS PRP 15 55.56% 2.05%
DTO AV0 (much, more, etc) 15 65.22% 1.25%
NN1 AJ0 63 30.73% 0.82%
NN0 8 3.90% 0.10%
NP0 74 36.10% 0.96%
UNC 9 4.39% 0.12%
VVB 9 4.39% 0.12%
VVG 13 6.34% 0.17%
VVI 13 6.34% 0.17%
NN2 NP0 14 37.84% 0.50%
UNC 9 24.32% 0.32%
VVZ 10 27.02% 0.36%
NN0 UNC 7 70.00% 2.59%
NP0 NN1 50 70.42% 3.03%
NN2 5 7.04% 0.30%
PNI CRD (one) 9 90.00% 5.39%
PRP AV0 8 22.22% 0.19%
TO0 PRP (to) 6 100.00% 0.73%
VVB AJ0 7 6.25% 1.09%
NN1 35 31.25% 5.43%
VVI 55 49.11% 8.54%
VVN 5 4.46% 0.85%
VVD AJ0 14 17.95% 1.32%
VVN 64 82.05% 6.04%
VVG AJ0 11 37.93% 1.51%
NN1 18 62.07% 2.47%
VVI NN1 5 71.43% 0.41%
VVZ NN2 20 86.96% 6.23%

Some of the error types above are associated with one or two particular words, and where these occur they are listed. For example, the AV0 - EX0 type of error occurs invariably with the one word there.

Finally, we list here the text samples used to constitute the manually-conducted 50,000-word error analysis. Each sample consisted of 2,000 words taken from the BNC texts listed below, except that two samples, one of written and one of spoken English, consisted of 1,000 words only. These samples are marked "*" in the list below. The reason for using half-length samples in two cases was to maintain the proportion of written and spoken data as 90% - 10%, so as to keep the proportions of the sample the same as the proportions in the BNC as a whole. The BNC text files are cited by the three-character code used in the BNC Users Reference Guide.
Written imaginative writing
Written informative writing
Natural Science
Applied Science
Social Science
CLH, EE8, *A6Y
World Affairs
A4J, CMT, EE2, EB7
Commerce and finance
HGP, B27
C9U, G1N
Belief and thought
Spoken demographic
Spoken context-governed

6.6.4 POS-Tagging Workflow

The first four phases were carried out automatically, using CLAWS4, an automatic tagger which developed out of the CLAWS1 automatic tagger (authors: Roger Garside and Ian Marshall 1983) used to tag the LOB Corpus. The advanced version CLAWS4 is principally the work of Roger Garside, although many other researchers at Lancaster have contributed to its performance in one way or another. Further information about CLAWS4 can be obtained from Leech, Garside and Bryant 1994 and Garside and Smith 1997. CLAWS4 is a hybrid tagger, employing a mixture of probabilistic and non-probabilistic techniques. The fifth and sixth phases used other systems,described in the appropriate section below. A. Tokenization

The first major step in automatic tagging is to divide up the text or corpus to be tagged into individual (1) word tokens and (2) orthographic sentences. These are the segments usually demarcated by (1) spaces and (2) sentence boundaries (i.e. sentence final punctuation followed by a capital letter). This procedure is not so straightforward as it might seem, particularly because of the ambiguity of full stops (which can be abbreviation marks as well as sentence-demarcators) and of capital letters (which can signal a naming expression, as well as the beginning of a sentence). Faults in tokenization occasionally occur, but rarely cause tagging errors.

In tokenization, an orthographic word boundary (normally a space, with or without accompanying punctuation) is the default test for identifying the beginning and end of word-tokens. (See, however, the next paragraph and D. Idiom-Tagging below.) Hyphens are counted as word-internal, so that a hyphenated word such as key-ring is given just one tag (NN1). Because of the different ways of writing compound words, the same compound may occur in three forms: as a single word written ‘solid’ (markup), as a hyphenated word (mark-up) or as a sequence of two words (mark up). In the first two cases, CLAWS4 will give the compound a single tag, whereas in the third case, it will receive two tags: one for mark and the other for up.

A set of special cases dealt with by tokenization is the set of enclitic verb and negative contractions such as 's, 're, 'll and 'nt, which are orthographically attached to the preceding word. These will be given a tag of their own, so that (for example) the orthographic forms It's, they're, and can't are given two tags in sequence: pronoun + verb, verb + negative, etc. There are also some 'merged' forms such as won't and dunno, which are decomposed into more than one word for tagging purposes. For example, dunno actually ends up with the three tags for do + n't + know (for a list of these contracted forms, see 9.7 Contracted forms and multiwords). B. Initial assignment of tags

The second stage of CLAWS POS-tagging is to assign to each word token one or more tags. Many word tokens are unambiguous, and so will be assigned just one tag: e.g. various AJ0 (adjective). Other word tokens are ambiguous, taking from two to seven potential tags. For example, the token paint can be tagged NN1, VVB, VVI, i.e. as a noun or as a verb; the token broadcast can be tagged as VVB, VVI, VVD, VVN (verb which is either present tense, infinitive, past tense, or past participle). In addition, it can be a noun (NN1) or an adjective (AJ0), as in a broadcast concert.

To find the list of potential tags associated with a word, CLAWS first looks up the word in a lexicon of c.50,000 word entries. This lexicon look-up accounts for a large proportion of the word tokens in a text. However, many rarer words or names will not be found in the lexicon, and are tagged by other test procedures. Some of the other procedures are:
  • Look for the ending of a word: e.g. words in -ness will normally be nouns.
  • Look for an initial capital letter (especially when the word is not sentence-initial). Rare names which are not in the lexicon and do not match other procedures will normally be recognized as proper nouns on the basis of the initial capital.
  • Look for a final -(e)s. This is stripped off, to see if the word otherwise matches a noun or verb; if it does, the word in -s is tagged as a plural noun or a singular present-tense verb.
  • Numbers and formulae (e.g. 271, *K9, +) are tagged by special rules.
  • If all else fails, a word is tagged ambiguously as either a noun, an adjective or a lexical verb.

When a word is associated with more than one tag, information is given by the lexicon look-up or other procedures on the relative probability of each tag. For example, the word for can be a preposition or a conjunction, but is much more likely to be a preposition. This information is provided by the lexicon, either in numerical form, or where numerical data available are insufficient, by a simple distinction between 'unmarked', 'rare' and 'very rare' tags.

Some adjustment of probability is made according to the position of the word in the sentence. If a word begins with a capital, the likelihood of various tags depends partly on whether the word occurs at the beginning of a sentence. For instance, the word Brown at the beginning of a sentence is less likely to be a proper noun than an adjective or a common noun (normally written brown). Hence the likelihood of a proper noun tag being assigned is reduced at the beginning of a sentence. C. Tag selection (or disambiguation)

The next stage, logically, is to choose the most probable tag from any ambiguous set of tags associated with a word token by tag assignment (but see D. Idiom-Tagging below). This is another probabilistic procedure, this time making use of the context in which a word occurs. A method known as Viterbi alignment uses the probabilistic estimates available, both in terms of the tag-word associations and the sequential tag-tag likelihoods, to calculate the most likely path through the sequence of tag ambiguities. (The model employed is largely equivalent to a hidden Markov model.) After tag selection, a single 'winning tag' is selected for each word token in a text. (The less likely tags are not obliterated: they follow the winning tag in descending probability order.) However, the winning tag is not necessarily the right answer. If the CLAWS tagging stopped at this point, only c.95-96% of the word-tokens would be correctly tagged. This is the main reason for including an additional stage (or rather a set of stages) termed 'idiom-tagging'. D. Idiom-Tagging
Idiom-tagging is a stage of CLAWS4's operation in which sequences of words and tags are matched against a template. Depending on the match, the tags may be disambiguated or corrected. In practice, there are two main reasons for idiom-tagging:
  • The correct tag can only be selected if CLAWS looks at a word+tag sequence as a whole. In tag selection, this was not done, since the program merely used 'bigrams' consisting of two tags in sequence. In other words, idiom-tagging is more powerful than the Viterbi disambiguation algorithm because it is able to operate on a 'window' of several word tokens at once.
  • There are many cases in English where a sequence of orthographic words is best assigned a single tag. Such cases include because of (a preposition), so long as (a conjunction), and of course (an adverb). These so-called multiwords are the opposite of the contracted forms such as don't and there's, where one orthographic word is assigned more than one tag. Thus idiom-tagging here plays the role of adjusting tokenization to larger units.
Idiom-tagging is a matching procedure which operates on lists of rules which might loosely be termed ‘idioms’. Among these are:
  • a list of multiwords (just described) such as because of, so long as and of course.
  • a list of place name expressions (e.g. Mount X , where X is some word beginning with a capital).
  • a list of personal name expressions (e.g. Dr. (X) Y, where X and Y are words beginning with a cap.; the word X may or may not appear in the matching word sequence).
  • a list of foreign or classical language expressions used in English (e.g. de jure, hoi polloi)
  • a list of grammatical sequences where there are typically 'slots' in the sequence which may or may not be filled: e.g. Modal + (adverb/negative) + (adverb/negative) + Infinitive. This matches a sequence such as would not necessarily like. The recognition that the word token like here is an infinitive verb (rather than, say, a present-tense verb or a preposition) could not be trusted if the tagger was not equipped with an idiom-tagging component, but had to rely simply on tag-pair probabilities.

The idiom-tagging component of CLAWS is quite powerful in matching 'template' expressions in which there are wild-card symbols, Boolean operators and gaps of up to n words. They are much more variable than ‘idioms’ in the ordinary sense, and resemble finite-state networks.

Another important point about idiom-tagging is that it is split up into two main phases which operate at different points in the tagging system. One part of the idiom-tagging takes place at the end of Stage C., in effect retrospectively correcting some of the errors which would otherwise occur in CLAWS output. Another part, however, actually takes place between Stages B. and C. This means it can utilise ambiguous input and also produce ambiguous output, perhaps adjusting the likelihood of one tag relative to another. As an example, consider the case of so long as, which can be a single grammatical item - a conditional conjunction meaning 'provided that'. The difficulty is that so long as can also be a sequence of three separate grammatical items: degree adverb + adjective/adverb + conjunction/preposition. In this case, the tagging ambiguity belongs to a whole word sequence rather than a single word, and the output of the idiom-tagging has to be passed on to the probabilistic tag selection stage. Hence, although we have called idiom-tagging ‘Stage D’, it is actually split between two stages, one preceding C. and one following C.

When the text emerges from Stages C. and D., each word has an associated set of one or more tags associated with it, and each tag itself is associated with a probability represented as a percentage. An example is:

entering VVG 86% NN1 14% AJ0 0%

Clearly VVG (-ing participle of the verb enter) is judged by CLAWS4 to be the most likely tag in this case. E. After CLAWS: the Template Tagger

The error rate with CLAWS4 averages around 3%.6 For the BNC Tagging Enhancement project, we decided to concentrate our efforts on the rule-based part of the system, where most of the inroads in error reduction had been made. This involved (a) developing software with more powerful pattern-matching capabilities than the CLAWS Idiomlist, and (b) carrying out a more systematic analysis of errors, to identify appropriate error-correcting rules.

The next program, known as Template Tagger, supplements rather than supplants CLAWS. It takes a CLAWS output file as its input, and "patches"7 any erroneous tags it finds by using hand-written template rules. Figure 1 above shows where Template Tagger fits in the overall tagging scheme. Effectively, it is an elaborate 'search and replace' tool, capable of matching longer-distance and more variable dependencies than is possible with the Idiomlist:
  • it can refer to information at the level of the word, or tag, or by user-defined categories grouping lexical, grammatical, semantic or other related features
  • it can handle a wide and variable context window, incorporating
    • repetition of the value in (a) a specified number of times, or indefinitely up to the left or right sentence boundary (or other delimiter) from any given word or tag; and
    • different levels of optionality: necessarily present, optional, and necessarily excluded.

These features can best be understood by an example. In BNC1 there were quite a number of errors disambiguating prepositions from subordinating conjunctions, in connection with words like after, before, since and so on. The following rule corrects many such cases from subordinating conjunction (CJS) to prepositions (PRP) tags. It applies a basic grammatical principle that subordinating conjunctions mark the start of clauses and generally require a finite verb somewhere later in the sentence. #AFTER [CJS^PRP] PRP, ([!#FINITE_VB/VVN])16, #PUNC1

The two commas divide the rule into three units, each containing a word or tag or word+tag combination. Square brackets contain tag patterns, and a tag following square brackets is the replacement tag (ie the action part of the rule). #AFTER refers to a list of words like after, before and since, that have similar grammatical properties. These words are defined in a separate file; not all conjunction-preposition words are listed - as, for instance, can be used elliptically, without the requirement for a following verb. (See Tagging Guidelines under as). The definition for #FINITE_VB contains a list of possible POS-tags (rather than word values), eg VVZ/VV0/VM0. Finally #PUNC1 is a 'hard' punctuation boundary (one of . : ; ? and ! ). The patching rule can be interpreted as: 'If a sequence of the following kind occurs: a word like after, before or since, which CLAWS has identified as most likely being a subordinating conjunction, and less likely a preposition; an interval of up to 16 words, none of which has been tagged as a finite verb or past participle 8 (NB [! … ] negates the tag pattern.); a 'hard' punctuation boundary then change the conjunction tag to preposition.'

The rule doesn't always work accurately, and doesn't cater for all preposition-conjunction errors. (i) It relies to a large extent on CLAWS having correctly identified finite verb tags in the right context of the preposition-conjunction; sometimes, however, a past participle is confused with a past tense form. (We therefore added VVN, ie past participle, as a possible alternative to #FINITE_VB in the second part of the pattern. The downside of this was that Template Tagger ignored some conjunction-preposition errors containing genuine use of VVN in the right context). (ii) The scope of the rule doesn't cover long sentences where more than 16 non-finite-verb words occur after the conjunction-preposition. A separate rule had to be written to handle such cases. (iii) Adverb uses of after, before and since etc. need to be fixed by additional rules.

Targetting and writing the Template rules

The Templates are targetted at the most error-prone categories introduced (or rather, left unresolved) by CLAWS. As with the preposition-conjunction example just shown, many disambiguation errors congregate around pairs of tags, for example adjective and adverb, or noun and verb. Sometimes a triple is involved, eg a past tense verb (VVD), past participle (VVN) and adjective (AJ0) in the case of surprised.

A small team of researchers sought out patterns in the errors by concordancing a training corpus that contained two parallel versions of the tagging: the automatic version produced by CLAWS and a hand-corrected version, which served as a benchmark. A concordance query of the form "tag A | tag B", would retrieve lines where the former version assigned an incorrect tag A and the latter a correct tag B. An example is shown below, in which A is a subordinating conjunction and B a preposition.

the company which have occurred | since CJS [PRP] | the balance sheet date . nd-green shirt with epaulettes . | Before CJS [PRP] | the show , the uniforms were approved by rt towards the library catalogue | since CJS [PRP] | the advent of online systems . The overall ales . There have been no events | since CJS [PRP] | the balance sheet date which materially af n in demand , adding 13p to 173p | since CJS [PRP] | the end of October . Printing group Linx h Hugh Candidus of Peterborough . | After CJS [PRP] | the appointment of Henry of Poitou , a sel boys would be in the Ravenna mud | until CJS [PRP] | the spring . Our landlady obviously liked ution in treatment brought about | since CJS [PRP] | the arrival of penicillin and antibiotics

By working interactively with the parallel concordance, sorting on the tags of the immediate context, testing for significant collocates to the left and right, and generally applying his/her linguistic knowledge, the researcher can often detect sufficient commonality between the tagging errors to formulate a patching rule (or a set of rules) such as that shown above. It took several iterations of training and testing to refine the rules to a point where they could be applied by Template Tagger to the full corpus.9

It should be said that some categories of error were easier to write rules for than others. Finding productive rules for noun-verb correction was especially difficult, because of the many types of ambiguity between nouns, verb and other categories, and the widely differing contexts in which they appear. The errors and ambiguity tags associated with NN1-VVB and NN2-VVZ in BNC2 in the error report testify to this problem. Here a more sophisticated lexicon, detailing the selectional restrictions of individual verbs and nouns (and other categories) would have undoubtedly been useful.

Ordering the rules

In some instances the ordering of rules was important. When two rules in the same ruleset compete, the longer match applies. Clashes arise in the case of the multiply ambiguous word as, for instance. Besides the clear grammatical choices between a preposition and a complementiser introducing an adverbial clause, there are many "interfering" idiomatic uses (as well as, as regards etc) and elliptical uses ( The TGV goes as fast as the Bullet train [sc.goes]). To avoid interference between the rules, we found it preferable to let an earlier pass of the rules handle more idiomatic (or exceptional) structures, and let a later pass deal with the more regular grammatical dependencies.

In many rule sets, however, we found that ordering did not affect the overall result, as we tried to ensure each rule was 'true' in all cases. Since, however, more than one rule sometimes carried out the same tag change to a particular word, the system was not optimised for speed and efficiency.

Besides the ordering of rules within rulesets, it is worth considering the placement of Template Tagger within the tagging schema (Figure 1). Ideally, it would be sensible to exploit the full pattern-matching functionality of the Template Tagger earlier in the schema, using it in place of the CLAWS Idiomlist not just after statistical disambiguation, where it is undoubtedly necessary, but also before it. In this way Template Tagger could have precluded much unnecessary ambiguity passing to Stage C. above. The reason we did not do this was pragmatic, that TT was in fact developed as a general-purpose annotation tool (See Fligelstone, Pacey and Rayson 1997), and not exclusively for the POS-tagging of BNC2. In future versions of the tagging software we hope to integrate Template Tagger more fully with CLAWS. F. Postprocessing, including Ambiguity tagging
The post-processing phase has the task of producing output in the form in which the user is going to find it most usable.
  • The text is produced in a horizontal format, so that it can be read from left to right across the page or across the screen.
  • The tags are enclosed in angle-brackets as follows: <NN1> according to the standard TEI-based CDIF mark-up of the British National Corpus.
  • Normally the word will be output with a single tag - the one which CLAWS4 calculates to be most probable.
  • "Ambiguity tags" (such as <NN1-AJ0>) are output if the difference between the probability of the first tag and of the second fails to reach a pre-decided threshold.

The final phase, "ambiguity tagging", merits a little further discussion. The requirement for such tags is clear when one observes that even using Template Tagger on top of CLAWS, there remains a residuum of error, around 2%, in the corpus. By permitting ambiguity tags we are effectively able to "hedge" in many instances that might otherwise have counted as errors - improving the chances of retrieving a particular tag, but at the cost of retrieving other tags as well. We considered that a reasonable goal would be to employ sufficient ambiguity tags to achieve an overall error rate for the corpus of 1%.

Because CLAWS's reliability in statistical disambiguation varies according to the POS-tags involved, we calculated the thresholds for application of ambiguity tags separately for each relevant tag-pair A-B (where A is CLAWS's first-choice and B its second-choice tag). First, the tag-pairs were chosen according to their error frequencies in a training corpus of 100,000 words. The proportion of A-B errors to the total number of errors indicated how many errors of that type would be allowed in order to achieve the 1% error rate overall; we will refer to this figure as "the target number of errors" for A-B. We then
  1. collected each instance of A-B error, noting the difference in probability score between A and B.
  2. plotted each error against the probability difference
  3. found the threshold on the difference axis that would yield the target number of errors. Below this threshold each instance of A-B would be converted to an ambiguity tag.

As we report under Error rates, the BNC in fact contains a higher error rate than 1%. This is because some thresholds applied at the 1% rate incurred a very high frequency of potential ambiguity tags: we hand-adjusted such thresholds if permitting a slight rise in errors led to a substantial reduction in the number of ambiguities. Further comments on stages E. and F. can be found in Smith 1997. Additional annotation in BNC XML
As noted above, the linguistic annotation of the corpus was enhanced in the BNC XML edition in three respects:
  • multiwords and their constituent items are explicitly tagged using the <mw> and <w> XML elements
  • an additional wordclass scheme, using a much simplified version of the C5 tagset was deployed
  • lemmatization of each word was carried out automatically on the basis of manually-defined rules.

The simplified wordclass scheme used for the second of these enhancements is listed in 9.8 Simplified Wordclass Tags of the manual, where the mapping between these values and the C5 tags from which they are derived is also specified.

The lemmatization procedure adopted derives ultimately from work reported in Beale 1987, as subsequently refined by others at Lancaster, and applied in a range of projects including the JAWS program (Fligelstone et al 1996) and the book Word Frequencies in Written and Spoken English (Leech et al 2001). The basic approach is to apply a number of morphological rules, combining simple POS-sensitive suffix stripping rules with a word list of common exceptions.

This process was carried out during the XML conversion, using code and a set of rules files kindly supplied by Paul Rayson.

Up: Contents Previous: 5 The header Next: 7 Software for the BNC

edited by Lou Burnard. Date: January 2007
This page is copyrighted