2 General limitations

2.1 Speed and reliability

We have been connecting from Forli' to the experimental server at OUCS in Oxford, over a congested network where connections are neither fast nor robust. As a general policy it has proved highly unadvisable to attempt to download more than 50 solutions to any query, which at the best of times takes 2 or 3 minutes. Speed would be improved were we to have the BNC installed on a local server, but we currently have neither the disk space nor the necessary UNIX competence to do this. However, insofar as this situation seems likely to reflect that of many potential ELT users --- both teaching establishments and individual learners --- we feel that our experience may be relatively typical in these respects.

Speed is also drastically reduced where queries include very common words or mark-up features. If you want to find cases of `to be or not to be', you should go and have lunch after sending the query. Even though this query has under 20 solutions, to find them SARA first has to look through all the occurrences of its component words, each of which occurs many hundreds of thousands of times. This makes SARA an inappropriate tool for studying many grammatical features --- uses of `if' clauses, say, or personal pronoun use in different text-types.

2.2 Pattern matching

The usefulness of SARA as a tool for grammatical investigations is also limited by the index structure. The word index is alphabetical, and lookup involves regular expression pattern-matching. This means that it is possible to find all the words beginning with a specified string, but not all the words which end in one: you cannot list only `-ing' forms, for example. Nor can this be done by reference to the part-of-speech tagging. As it is only possible to specify part-of-speech for a specific word, you can look for occurrences of `start' as a verb, but not occurrences where `start' is followed by a verb. Taken together with the speed problem for common words, this means that grammatical information can only generally be obtained in relation to specific lexically-defined environments. While you can find out the frequencies of `start screaming' and `start to scream', there is no satisfactory way to investigate complementation patterns of `start' in general.

2.3 Errors and inconsistencies

The BNC contains its fair share of misprints, variant spellings, grammatical errors, performance errors in speech, transcription errors, and errors and inconsistencies of encoding --- which means a very large number indeed in absolute terms. Any of these can lead users --- particularly language learners --- to misinterpret results, particularly numerical ones, as a consequence of not realising their effects on precision and recall. The solutions found may not all be of the phenomenon in question; other, variant forms of the phenomenon in question may exist in the corpus. The corpus encoding poses particularly unpleasant traps for the unwary. Not all features have been encoded in all the texts, so that looking for quotations or keyword categories, to cite two instances, provides only partial recall. The most delicate areas relate to the part-of-speech tagging: words in the corpus are defined by the presence of a POS tag rather than by orthographic spaces, and CLAWS4's notion of a word is frequently both unintuitive and erratic. Many common or foreign phrases, such as `in spite of' and `hoi polloi', are treated as single words; conversely, contracted forms such as `won't', `gonna', and `I'd've' are treated as more than one word. Thus the word index lists `as' (517595 occurrences), `as well' (11739), and `as well as' (16461) separately, not to mention the fascinating words `wo' (16267) and `ve'. While lists of these multi-word and compound words are provided in the manual (Burnard 1995), severe problems arise where their tagging is consistent. As well as `wo' and `n't', the word index also includes `won't', which turns out to have a frequency of 3 --- all cases where CLAWS has treated it as a single finite form of a main verb. Or again, the word index lists `annus mirabilis' and `annus horribilis' as single words, but each of these expressions also occurs tagged as two separate words. SARA does its best to cope with these inconsistencies by allowing for different word parsings when looking for solutions to queries, but users need to know not to rely on the index in such cases, and not to panic when a query for `annus horribilis' finds more solutions than the number listed in the word index.

Also a limited blessing, from this point of view, is the part-of-speech information attached to each word. When so-called "portmanteau tags", which state two alternative categories for the word in question, are included, CLAWS vaunts an overall accuracy of 97%. However the figure plummets for ambiguous word-forms, particularly for the less common of the possible parts-of-speech associated with them. Thus while tagging of `can' as a modal verb is highly reliable, its tagging as a main verb (as in `can tomatoes') is very much less so --- every one of a random selection of 50 purported main verb occurrences turned out on inspection to be modals.

2.4 Limits for the learner

However annoying at times, none of these limitations seems wholly negative for the language learner. Few learners would seem likely to want --- or indeed be able --- to inspect and analyse more than 50 solutions in any detail at a time. As far as common words and grammatical features are concerned, 50 solutions would in any case generally be inadequate, given the tendency for common features to have a wide range of uses. And the potential for inaccuracies and inconsistencies may encourage learners to think carefully when formulating queries, and to read solutions closely before leaping to conclusions --- in particular to treat SARA's frequency counts as indicative heuristics rather than as prefabricated truths about the language.


Previous
Up
Next