BNC2 The Automatic POS-Tagging of BNC2

BNC2 POS-tagging Manual

Automatic POS-Tagging of the Corpus

Introduction

This document describes the overall process of POS-tagging texts for version 2 of the British National Corpus. Figure 1 below shows the main stages involved: stages A-D are handled by the CLAWS4 tagger; stage E, Template Tagger is a corrective phase for CLAWS; the main part of F., Ambiguity Tagging, describes the conversion of some less reliable tags into dual tags containing more than one part of speech.

A.	Tokenization
B.	Initial tag assignment
C.	Tag selection (disambiguation)
D.	Idiomtagging
E.	Template Tagger
F.	Postprocessing: including Ambiguity tagging

Figure 1. Wordclass Tagging schema for BNC2.

CLAWS POS-tagging (Stages A-D)

The BNC2 was automatically tagged using CLAWS4, an automatic tagger which developed out of the CLAWS1 automatic tagger (authors: Roger Garside and Ian Marshall 1983) used to tag the LOB Corpus. The advanced version CLAWS4 is principally the work of Roger Garside, although many other researchers at Lancaster have contributed to its performance in one way or another. Further information about CLAWS4 can be obtained from Leech, Garside and Bryant 1994 and Garside and Smith 1997.

CLAWS4 is a hybrid tagger, employing a mixture of probabilistic and non-probabilistic techniques. It assigns a tag (or sometimes two tags) to a word as a result of four main processes:

Tokenization

The first major step in automatic tagging is to divide up the text or corpus to be tagged into individual (1) word tokens and (2) orthographic sentences. These are the segments usually demarcated by (1) spaces and (2) sentence boundaries (i.e. sentence final punctuation followed by a capital letter). This procedure is not so straightforward as it might seem, particularly because of the ambiguity of full stops (which can be abbreviation marks as well as sentence-demarcators) and of capital letters (which can signal a naming expression, as well as the beginning of a sentence). Faults in tokenization occasionally occur, but rarely cause tagging errors.

In tokenization, an orthographic word boundary (normally a space, with or without accompanying punctuation) is the default test for identifying the beginning and end of word-tokens. (See, however, the next paragraph and D below.) Hyphens are counted as word-internal, so that a hyphenated word such as key-ring is given just one tag (NN1). Because of the different ways of writing compound words, the same compound may occur in three forms: as a single word written ‘solid’ (markup), as a hyphenated word (mark-up) or as a sequence of two words (mark up). In the first two cases, CLAWS4 will give the compound a single tag, whereas in the third case, it will receive two tags: one for mark and the other for up.

A set of special cases dealt with by tokenization is the set of enclitic verb and negative contractions such as 's, 're, 'll and 'nt, which are orthographically attached to the preceding word. These will be given a tag of their own, so that (for example) the orthographic forms It's, they're, and can't are given two tags in sequence: pronoun + verb, verb + negative, etc. There are also some 'merged' forms such as won't and dunno, which are decomposed into more than one word for tagging purposes. For example, dunno actually ends up with the three tags for do + n't + know. [ View the list of contracted forms ]
Initial assignment of tags

The second stage of CLAWS POS-tagging is to assign to each word token one or more tags. Many word tokens are unambiguous, and so will be assigned just one tag: e.g. various AJ0 (adjective). Other word tokens are ambiguous, taking from two to seven potential tags. For example, the token paint can be tagged NN1, VVB, VVI, i.e. as a noun or as a verb; the token broadcast can be tagged as VVB, VVI, VVD, VVN (verb which is either present tense, infinitive, past tense, or past participle). In addition, it can be a noun (NN1) or an adjective (AJ0), as in a broadcast concert.

To find the list of potential tags associated with a word, CLAWS first looks up the word in a lexicon of c.50,000 word entries. This lexicon look-up accounts for a large proportion of the word tokens in a text. However, many rarer words or names will not be found in the lexicon, and are tagged by other test procedures. Some of the other procedures are:
- Look for the ending of a word: e.g. words in -ness will normally be nouns.
- Look for an initial capital letter (especially when the word is not sentence-initial). Rare names which are not in the lexicon and do not match other procedures will normally be recognized as proper nouns on the basis of the initial capital.
- Look for a final -(e)s. This is stripped off, to see if the word otherwise matches a noun or verb; if it does, the word in -s is tagged as a plural noun or a singular present-tense verb.
- Numbers and formulae (e.g. 271, *K9, ß+) are tagged by special rules.
- If all else fails, a word is tagged ambiguously as either a noun, an adjective or a lexical verb.
When a word is associated with more than one tag, information is given by the lexicon look-up or other procedures on the relative probability of each tag. For example, the word for can be a preposition or a conjunction, but is much more likely to be a preposition. This information is provided by the lexicon, either in numerical form, or where numerical data available are insufficient, by a simple distinction between 'unmarked', 'rare' and 'very rare' tags.

Some adjustment of probability is made according to the position of the word in the sentence. If a word begins with a capital, the likelihood of various tags depends partly on whether the word occurs at the beginning of a sentence. For instance, the word Brown at the beginning of a sentence is less likely to be a proper noun than an adjective or a common noun (normally written brown). Hence the likelihood of a proper noun tag being assigned is reduced at the beginning of a sentence.
Tag selection (or disambiguation)

The next stage, logically, is to choose the most probable tag from any ambiguous set of tags associated with a word token by tag assignment (but see D below). This is another probabilistic procedure, this time making use of the context in which a word occurs. A method known as Viterbi alignment uses the probabilistic estimates available, both in terms of the tag-word associations and the sequential tag-tag likelihoods, to calculate the most likely path through the sequence of tag ambiguities. (The model employed is largely equivalent to a hidden Markov model.) After tag selection, a single 'winning tag' is selected for each word token in a text. (The less likely tags are not obliterated: they follow the winning tag in descending probability order.) However, the winning tag is not necessarily the right answer. If the CLAWS tagging stopped at this point, only c.95-96% of the word-tokens would be correctly tagged. This is the main reason for including an additional stage (or rather a set of stages) termed 'idiomtagging'.
Idiomtagging

Idiomtagging is a stage of CLAWS4's operation in which sequences of words and tags are matched against a 'template'. Depending on the match, the tags may be disambiguated or corrected. In practice, there are two main reasons for idiomtagging:
- There are many cases in English where a sequence of orthographic words is best assigned a single tag. Such cases include because of (a preposition), so long as (a conjunction), and of course (an adverb). These so-called multi-words are the opposite of the contracted forms such as don't and there's, where one orthographic word is assigned more than one tag. Thus idiomtagging here plays the role of adjusting tokenization to larger units.
Idiomtagging is a matching procedure which operates on lists of rules which might loosely be termed 'idioms'. Among these are:
- a list of multi-words (just described) such as because of, so long as and of course.
- a list of place name expressions (e.g. Mount X , where X is some word beginning with a capital).
- a list of personal name expressions (e.g. Dr. (X) Y, where X and Y are words beginning with a cap.; the word X may or may not appear in the matching word sequence).
- a list of foreign or classical language expressions used in English (e.g. de jure, hoi polloi)
- a list of grammatical sequences where there are typically 'slots' in the sequence which may or may not be filled: e.g. Modal + (adverb/negative) + (adverb/negative) + Infinitive. This matches a sequence such as would not necessarily like. The recognition that the word token like here is an infinitive verb (rather than, say, a present-tense verb or a preposition) could not be trusted if the tagger was not equipped with an idiomtagging component, but had to rely simply on tag-pair probabilities.
The idiomtagging component of CLAWS is quite powerful in matching 'template' expressions in which there are wild-card symbols, Boolean operators and gaps of up to n words. They are much more variable than 'idioms' in the ordinary sense, and resemble finite-state networks.

Another important point about idiomtagging is that it is split up into two main phases which operate at different points in the tagging system. One part of the idiomtagging takes place at the end of Stage C., in effect retrospectively correcting some of the errors which would otherwise occur in CLAWS output. Another part, however, actually takes place between Stages B. and C. This means it can utilise ambiguous input and also produce ambiguous output, perhaps adjusting the likelihood of one tag relative to another. As an example, consider the case of so long as, which can be a single grammatical item - a conditional conjunction meaning 'provided that'. The difficulty is that so long as can also be a sequence of three separate grammatical items: degree adverb + adjective/adverb + conjunction/preposition. In this case, the tagging ambiguity belongs to a whole word sequence rather than a single word, and the output of the idiomtagging has to be passed on to the probabilistic tag selection stage. Hence, although we have called idiomtagging 'Stage D.', it is actually split between two stages, one preceding C. and one following C.

When the text emerges from Stages C. and D., each word has an associated set of one or more tags associated with it, and each tag itself is associated with a probability represented as a percentage. An example is:

entering VVG 86% NN1 14% AJ0 0%

Clearly VVG (-ing participle of the verb enter) is judged by CLAWS4 to be the most likely tag in this case.

[ CLAWS POS-tagging | Ambiguity Tagging ]

E. After CLAWS: the Template Tagger

The error rate with CLAWS4 averages around 3%.¹ For the BNC Tagging Enhancement project, we decided to concentrate our efforts on the rule-based part of the system, where most of the inroads in error reduction had been made. This involved (a) developing software with more powerful pattern-matching capabilities than the CLAWS Idiomlist, and (b) carrying out a more systematic analysis of errors, to identify appropriate error-correcting rules.

The new program, known as Template Tagger, supplements rather than supplants CLAWS. It takes a CLAWS output file as its input, and "patches"² any erroneous tags it finds by using hand-written template rules. Figure 1 above shows where Template Tagger fits in the overall tagging scheme. Effectively, it is an elaborate 'search and replace' tool, capable of matching longer-distance and more variable dependencies than is possible with the Idiomlist:

it can refer to information at the level of the word, or tag, or by user-defined categories grouping lexical, grammatical, semantic or other related features
it can handle a wide and variable context window, incorporating
- repetition of the value in (a) a specified number of times, or indefinitely up to the left or right sentence boundary (or other delimiter) from any given word or tag; and
- different levels of optionality: necessarily present, optional, and necessarily excluded.

These features can best be understood by an example. In BNC1 there were quite a number of errors disambiguating prepositions from subordinating conjunctions, in connection with words like after, before, since and so on. The following rule corrects many such cases from subordinating conjunction (CJS) to prepositions (PRP) tags. It applies a basic grammatical principle that subordinating conjunctions mark the start of clauses and generally require a finite verb somewhere later in the sentence.

#AFTER [CJS^PRP] PRP, ([!#FINITE_VB/VVN])16, #PUNC1

The two commas divide the rule into three units, each containing a word or tag or word+tag combination. Square brackets contain tag patterns, and a tag following square brackets is the replacement tag (ie the action part of the rule). #AFTER refers to a list of words like after, before and since, that have similar grammatical properties. These words are defined in a separate file; not all conjunction-preposition words are listed - as, for instance, can be used elliptically, without the requirement for a following verb. (See Tagging Guidelines under as). The definition for #FINITE_VB contains a list of possible POS-tags (rather than word values), eg VVZ/VV0/VM0. Finally #PUNC1 is a 'hard' punctuation boundary (one of . : ; ? and ! ). The patching rule can be interpreted as:

'If a sequence of the following kind occurs:

a word like after, before or since, which CLAWS has identified as most likely being a subordinating conjunction, and less likely a preposition

an interval of up to 16 words, none of which has been tagged as a finite verb or past participle ³ (NB [! … ] negates the tag pattern.)

a 'hard' punctuation boundary

then change the conjunction tag to preposition.'

The rule doesn't always work accurately, and doesn't cater for all preposition-conjunction errors. (i) It relies to a large extent on CLAWS having correctly identified finite verb tags in the right context of the preposition-conjunction; sometimes, however, a past participle is confused with a past tense form. (We therefore added VVN, ie past participle, as a possible alternative to #FINITE_VB in the second part of the pattern. The downside of this was that Template Tagger ignored some conjunction-preposition errors containing genuine use of VVN in the right context). (ii) The scope of the rule doesn't cover long sentences where more than 16 non-finite-verb words occur after the conjunction-preposition. A separate rule had to be written to handle such cases. (iii) Adverb uses of after, before and since etc. need to be fixed by additional rules.

Targetting and writing the Template rules

The Templates are targetted at the most error-prone categories introduced (or rather, left unresolved) by CLAWS. As with the preposition-conjunction example just shown, many disambiguation errors congregate around pairs of tags, for example adjective and adverb, or noun and verb. Sometimes a triple is involved, eg a past tense verb (VVD), past participle (VVN) and adjective (AJ0) in the case of surprised.

A small team of researchers sought out patterns in the errors by concordancing a training corpus that contained two parallel versions of the tagging: the automatic version produced by CLAWS and a hand-corrected version, which served as a benchmark. A concordance query of the form "tag A | tag B", would retrieve lines where the former version assigned an incorrect tag A and the latter a correct tag B. An example is shown below, in which A is a subordinating conjunction and B a preposition.

 the company which have occurred | since CJS [PRP] | the balance sheet date . **11;7898;ptr **1
nd-green shirt with epaulettes . | Before CJS [PRP] | the show , the uniforms were approved by 
rt towards the library catalogue | since CJS [PRP] | the advent of online systems . The overall
ales . There have been no events | since CJS [PRP] | the balance sheet date which materially af
n in demand , adding 13p to 173p | since CJS [PRP] | the end of October . Printing group Linx h
 Hugh Candidus of Peterborough . | After CJS [PRP] | the appointment of Henry of Poitou , a sel
boys would be in the Ravenna mud | until CJS [PRP] | the spring . Our landlady obviously liked 
ution in treatment brought about | since CJS [PRP] | the arrival of penicillin and antibiotics

Figure 2. Parallel concordance showing conjunction-preposition tagging errors

By working interactively with the parallel concordance, sorting on the tags of the immediate context, testing for significant collocates to the left and right, and generally applying his/her linguistic knowledge, the researcher can often detect sufficient commonality between the tagging errors to formulate a patching rule (or a set of rules) such as that shown above. It took several iterations of training and testing to refine the rules to a point where they could be applied by Template Tagger to the full corpus. ⁴

It should be said that some categories of error were easier to write rules for than others. Finding productive rules for noun-verb correction was especially difficult, because of the many types of ambiguity between nouns, verb and other categories, and the widely differing contexts in which they appear. The errors and ambiguity tags associated with NN1-VVB and NN2-VVZ in BNC2 in the error report testify to this problem. Here a more sophisticated lexicon, detailing the selectional restrictions of individual verbs and nouns (and other categories) would have undoubtedly been useful.

Ordering the rules

In some instances the ordering of rules was important. When two rules in the same ruleset compete, the longer match applies. Clashes arise in the case of the multiply ambiguous word as, for instance. Besides the clear grammatical choices between a preposition and a complementiser introducing an adverbial clause, there are many "interfering" idiomatic uses (as well as, as regards etc) and elliptical uses ( The TGV goes as fast as the Bullet train [sc.goes]). To avoid interference between the rules, we found it preferable to let an earlier pass of the rules handle more idiomatic (or exceptional) structures, and let a later pass deal with the more regular grammatical dependencies.

In many rule sets, however, we found that ordering did not affect the overall result, as we tried to ensure each rule was 'true' in all cases. Since, however, more than one rule sometimes carried out the same tag change to a particular word, the system was not optimised for speed and efficiency.

Besides the ordering of rules within rulesets, it is worth considering the placement of Template Tagger within the tagging schema (Figure 1). Ideally, it would be sensible to exploit the full pattern-matching functionality of the Template Tagger earlier in the schema, using it in place of the CLAWS Idiomlist not just after statistical disambiguation, where it is undoubtedly necessary, but also before it. In this way Template Tagger could have precluded much unnecessary ambiguity passing to Stage C. above. The reason we did not do this was pragmatic, that TT was in fact developed as a general-purpose annotation tool (See Fligelstone, Pacey and Rayson 1997), and not exclusively for the POS-tagging of BNC2. In future versions of the tagging software we hope to integrate Template Tagger more fully with CLAWS.

[ CLAWS POS-tagging | Postprocessing with Template Tagger ]

F. Postprocessing, including Ambiguity tagging

The post-processing phase has the task of producing output in the form in which the user is going to find it most usable.

The text is produced in a horizontal format, so that it can be read from left to right across the page or across the screen.
The tags are enclosed in angle-brackets as follows: <w NN1> according to the standard TEI-based CDIF mark-up of the British National Corpus.
Normally the word will be output with a single tag - the one which CLAWS4 calculates to be most probable.
"Ambiguity tags" (such as <w NN1-AJ0>) are output if the difference between the probability of the first tag and of the second fails to reach a pre-decided threshold.

The final phase, "ambiguity tagging", merits a little further discussion. The requirement for such tags is clear when one observes that even using Template Tagger on top of CLAWS, there remains a residuum of error, around 2%, in the corpus. By permitting ambiguity tags we are effectively able to "hedge" in many instances that might otherwise have counted as errors - improving the chances of retrieving a particular tag, but at the cost of retrieving other tags as well. We considered that a reasonable goal would be to employ sufficient ambiguity tags to achieve an overall error rate for the corpus of 1%.

Because CLAWS's reliability in statistical disambiguation varies according to the POS-tags involved, we calculated the thresholds for application of ambiguity tags separately for each relevant tag-pair A-B (where A is CLAWS's first-choice and B its second-choice tag). First, the tag-pairs were chosen according to their error frequencies in a training corpus of 100,000 words. The proportion of A-B errors to the total number of errors indicated how many errors of that type would be allowed in order to achieve the 1% error rate overall; we will refer to this figure as "the target number of errors" for A-B. We then

collected each instance of A-B error, noting the difference in probability score between A and B.
plotted each error against the probability difference
found the threshold on the difference axis that would yield the target number of errors. Below this threshold each instance of A-B would be converted to an ambiguity tag.

As we report under Error rates, the BNC2 in fact contains a higher error rate than 1%. This is because some thresholds applied at the 1% rate incurred a very high frequency of potential ambiguity tags: we hand-adjusted such thresholds if permitting a slight rise in errors led to a substantial reduction in the number of ambiguities.

The form and frequency of ambiguity tags are explained in the documents Guidelines to Wordclass tagging and Error rates respetively. Further comments on stages E. and F. can be found in Smith 1997.

Notes

1. That is, the error rate based on CLAWS's first choice tag only.

2. We borrow the term "patching" from Brill (1992), although for his tagging program the patches are discovered by an automatic procedure.

3. The repetition value of up to 16 words was reached at by trial and error; an occurrence of a finite verb beyond that range was rarely in the same clause as the #AFTER-type word.

4. Training and testing were mostly carried out on the BNC Sampler corpus of 2 million words. For less frequent phenomena we needed to use sections from the full BNC. None of the texts used for the tagging error report are contained in the Sampler.

References

Brill, E. (1992) 'A simple rule-based part-of-speech tagger'. Proceedings of the 3rd conference on Applied Natural Language Processing. Italy: Trento.

Fligelstone S., Pacey M., and Rayson P. (1997) 'How to Generalize the Task of Annotation'. In Garside et al. (1997)

Garside R., Leech G. and McEnery A. (eds.) (1997) Corpus Annotation. London: Longman.

Garside R., and Smith N. (1997) 'A hybrid grammatical tagger: CLAWS4'. In Garside et al. (1997)

Leech, G., Garside, R., and Bryant, M. (1994). CLAWS4: The tagging of the British National Corpus. In Proceedings of the 15th International Conference on Computational Linguistics (COLING 94). Japan: Kyoto. (pp.622-628.)

Marshall, I. (1983). 'Choice of Grammatical Word-class without Global Syntactic Analysis: Tagging Words in the LOB Corpus'. Computers and the Humanities 17, 139-50.

Smith, N. (1997) 'Improving a Tagger'. In Garside et al. (1997)

Date: 17 March 2000