3 Preparing the learner

All the limitations just described obviously have to be explained to would- be users, leading to the question of the preparation that learners need to use the BNC independently. The experimental group in Forli already had experience of small corpora, having regularly used MicroConcord (Scott and Johns 1993) with its 1-million-word collections of newspaper and academic texts. They were consequently used to carrying out searches using word patterns, alternatives, and collocates; to sorting and thinning solutions; and to identifying collocational and colligational patterns, semantic prosodies, and associations with particular text-types and topics in concordance displays (Aston 1996). The experimental group took part in a 10-session weekly seminar in which the following points were focussed on:

general information relating to (a) very large, mixed corpora in general, and the BNC in particular; (b) SGML encoding; (c) the SARA software;
problems in formulating appropriate queries;
problems in interpreting counts and solutions.

3.1 General information

3.1.1 About the BNC

To formulate appropriate queries and interpret solutions from a corpus of this size presupposes considerable knowledge of the composition of the corpus itself. Obviously, no user can be familiar with 4100 texts: but they can be helped by knowing what texts are, how they were selected and what they look like.

3.1.1.1 What is a text?

Texts in the BNC may be written, written-to-be-spoken, or spoken. In the latter case they take the form of transcripts which cannot be assumed to be fully reliable, and which omit or idealise many aspects of the original. It is obviously inappropriate, for example, to investigate punctuation or spelling on the basis of the entire corpus. But texts in the BNC do not only differ in their modes of production. Just as a word in the corpus may be less or more than an orthographic word, a text may be less or more than what we would generally call a text in everyday usage. Many texts in the BNC are incomplete samples from their source. Others consist of a series of smaller texts, such as articles from the same periodical, or conversations recorded by the same respondent. This makes it unwise, say, to look for regularities in the first or last sentences of texts, or to assume that recurrence within the same text should be interpreted as lexical cohesion. And because of the sampling procedure employed, in some cases the same article or short text can be found repeated in two different texts in the corpus.

3.1.1.2 What kinds of texts are there in the corpus?

A major attraction of the BNC is that one can compare different text-types. To do so, however, users need to be aware of the categories of texts, and the numbers and proportions in each category. Thus as well as the basic written/spoken mode distinction, they need to know the various domains and mediums represented among written texts (one imaginative and six informative domains; books, periodicals, unpublished, and written-to-be- spoken mediums); the division of spoken texts into a "context-governed" component of public speech (dialogue and monologue) and a "demographic" one of private conversation, with the various domains represented in the former (educational, institutional, business, leisure), and the demographic factors taken into account in the latter (age/sex/social class/region). They need to realise that even in a corpus this size, few combinations of categories will be adequately represented (the spoken demographic component contains only a handful of AB class female speakers from the North aged 60+), and that descriptive categories other than those used to design the corpus will not generally be reliably sampled.

3.1.2 Encoding

Learners found the basic principles of SGML, with its use of start- and end-tags, and of element attributes and values, relatively easy to grasp. Fortunately, only a few element- and attribute-types had to be explained, the majority being rarely used or of little interest to learners, such as those indicating typographical features of the source text.

3.1.2.1 Text structure

the overall structure of each text as a <bncDoc> element, composed of a <header> element and a (written) <text> or (spoken) <stext> element;
the composition of a <text> as a hierarchy of
<p> and <s> elements,
the composition of an <stext> as a hierarchy of
<u> and <s> elements, with <ptr> elements indicating overlap.

Most other frequent structural elements, such as and <note> in written texts, and <pause>. , and <gap> in spoken ones, are fairly transparent. We also used the client custom configuration option to create customised formats which provided conventionalised displays of these elements.

3.1.2.2 Text description

All the main information about the nature of the text is given in a single element of the text header, <catRef>, whose attributes specify such features as mode, medium, domain, author-type and audience-type. One of the less obvious aspects of the encoding was the distinction between author-type in written texts, whose values are accessed as <catRef> attributes (i.e. in the text header), and speaker-type in spoken texts, normally accessed via the values of "who" attributes on<u> (utterance) elements (i.e. in the text itself).

3.1.2.3 Part-of-speech annotation (.elements)

The only logical way to explain the part-of-speech annotation seemed to be to explain how the tagging had been carried out (Garside 1996). Learners were given lists of the CLAWS4 POS categories, and shown how these differed from the classifications of the dictionaries and grammars they were familiar with.

3.1.3 SARA

Here it was necessary to explain the client --- server operation (not least so that periods of waiting and error messages would be interpretable), the organisation of the word index, and the on-line help system. The various query types (Word, Phrase, Part-of-speech, Pattern, and SGML) were demonstrated, and above all the QueryBuilder, which allows the user to combine different word, phrase, pattern, and/or SGML queries in and/or relationships, restricting the entire query to the scope of a particular SGML element or a maximum number of words. Learners were also shown the principal options for displaying, sorting, thinning, saving and printing solutions; how to find out collocate frequencies; and how to find out bibliographic details and view source texts.

3.2 Formulating queries

SARA calls for much more forethought in formulating queries than these learners, used to MicroConcord and smaller, more homogeneous corpora, were expecting. Lengthy lookups and large numbers of solutions needed to be avoided as far as possible. Precision was more important, since the proportionally greater number of solutions available makes it more tedious to weed out spurious ones by inspection, while unpredicted spurious solutions are more likely. The more detailed markup of the corpus and the variety of query-types available however means that queries can be more finely-tuned to match this need. A fairly typical case comes from one area where the BNC proved particularly useful. The variability of idiomatic expressions is a matter on which dictionaries offer the learner very little help. It is not however straightforward to design queries which will recall possible variants in tense, voice, number, and indeed lexis, without a dramatic loss in precision.

3.2.1 Planning

We took the saying `kill two birds with one stone'. Learners first discussed possible lexical substitutions, variations relating to number, tense, aspect and voice, pre- and post-modification, and changes in word class (such as nominalisation), and their effects on the form of the phrase. Given the difficulty of including common words in a query, such as `with', `one', and `two', the best option seemed to be to look for co-occurrences of two of the lemmas `kill', `bird', and `stone', thereby allowing for lexical variation of the third. Having just encountered the punning `Beware the Vibes of Marx' in another concordance, they realised that even with three queries, each containing two of the lemmas, perfect recall could not be guaranteed. They then had to decide how to allow for the possibility of (a) different forms of the selected lemmas, either by specifying a list of these forms (e.g. `bird'|`birds') or a pattern (kill.*); (b) different orders of their occurrence (as in `one stone killed two birds'); (c) different distances between them, given the possibility of modification (`two birds with only one stone'). They then needed to minimise unrelated co-occurrences of these lemmas by selecting an appropriate SGML element or span as the scope within which they had to co-occur (e.g. within the same sentence, or within n words), and to avoid unwanted homographs (e.g. the verb `stone'). The discussion led to formulation of the following queries:

 ({kill.*}#(bird|birds))/5
 ({kill.*}#("stone"=NN1|"stones"=NN2))/10
 ((bird|birds)#("stone"=NN1|"stones"=NN2))/10

where NN1/NN2 represent singular and plural common nouns respectively, | either/or, # in either order, and / the scope (number of words).

3.2.2 Revision

Regardless of the amount of forethought, initial formulations of queries are frequently inappropriate. SARA however offers a number of heuristics to assess queries without actually downloading all their solutions. By exploiting these, learners can not only make queries more effective: they may also acquire a considerable amount of information about the language.

a) The word index can be used to find out which forms will match a pattern (thus the proposed pattern kill.* was rejected as having too many spurious matches --- `killarney', `kill-joy', etc. --- and replaced by killi?n?g?e?d?r?s?). Many of these spurious matches involved unknown words, and learners carried out additional queries on some of them out of curiosity.

b) The frequency information provided by the word index can be used to assess the probable speed of the query. All the queries in this example seemed relatively fast, insofar as each of the lemmas `kill', `bird', and `stone' had total frequencies for their various forms of under 20,000. These indications also provide a means of estimating the incidence of particular forms on recall and precision: for instance, it mattered little that the revised pattern for `kill' also matched `killie' and `killies', since these forms only occur three and five times respectively in the entire corpus. For the learner, the frequency information also provides a rough indication of the importance of a form in the language: as a rule-of-thumb, we suggested that unknown word-forms with positive z-scores (i.e. occurring more than 172 times in the BNC) might be particularly worth investigating to expand their vocabulary.

c) Where the number of solutions exceeds a specified number, SARA first reports the number of hits found on the server in a "Too many solutions" dialogue box. Where this is excessively large, the user can either abort the query and edit it to increase precision, or, if uncertain how to do so, download a random set of solutions from which to identify patterns which might be excluded. Once a few solutions have been downloaded, it is also possible to find out how many of the hits found on the server include particular collocates. For instance, in the initial version of the `kill stone' query, which had 72 solutions, the frequency of `birds' as a collocate of `stone' within a span of 5 words was 35, suggesting a reasonable trade-off between recall and precision in this case. The Too Many Solutions display can also be used a measure of the effect of revising a query --- to see, for instance, the extent to which adding an alternative or altering its scope changes the number of solutions. Thus increasing the span of the `kill stone' query from 10 to 15 words only changed the number of hits from 72 to 81 --- suggesting that little was to be gained by increasing the scope in this case; whereas reducing it to 5 reduced the number of hits to 23, 14 of which had `dead' as an adjacent collocate of `stone'.

3.2.3 Learning by formulating and revising queries

The need to plan and revise queries seems one of the main ways in which using the BNC may aid language learning. Learners can learn, that is, not only from examining solutions which match their queries, but also from the process of designing queries which offer adequate recall and precision. Getting a query reasonably "right" is a matter of learning from one's mistakes, be these mistaken hypotheses about the language, the corpus, or the software. What has been striking about our experience with SARA has been learners' willingness to tolerate the quirks of the corpus and software and concentrate on the language: the process of query design stimulated awareness of a wide range of language patternings, and led to the incidental discovery of a wide assortment of curiosities.

3.3 Interpreting solutions

In their final form, none of the three queries relating to the `kill bird stone' combinations produced more than 60 solutions. Once spurious and duplicated solutions had been thinned, there were some 40 occurrences with forms of the saying. Of these, only half used the form (kill.* two birds with one stone). The others omitted `kill' or replaced it with `catch'; involved seven birds, several birds, many birds, a variety of birds, troublesome birds, or proverbial ones; used one big stone, the same stone, the one stone --- even a seminar stone. All were however in the `kill --- bird --- stone' order: there were no passives, and no instrumental subjects. These learners also noted a recurrent use of the hedge `as it were' (`kill, as it were, two birds with one stone'; `kill two birds with one stone, as it were').

3.3.1 Reading

The most important point to emerge in dealing with solutions was the need to read them carefully, and preferably not just as one-line KWIC contexts. As well as being necessary to decide which hits were valid, prior to thinning, and to identify repeated formal patterns which might suggest sorting criteria, it was also frequently essential to understand discourse function and register. The only top-down information as to the source of a particular solution in a SARA display or printout is an arbitrary text- identifier code and a sentence number: bibliographic data is only available on-line, and even this will not always clarify the domain of a particular text. For this reason, learners were given a printout of Kilgarriff's Bibliographic database to the BNC (ftp://itri.bton.ac.uk/pub/bnc), which gives the mode, domain and medium values for each text identifier code. Reading through the hits also brought curiosities to light, with ideas for other queries.

3.3.2 Interpreting numbers

Rather than proposing that learners should apply formal statistical tests, we stressed the role of reading, commonsense, and follow-up queries in assessing frequency, underlining the need to identify clearly recurrent patterns and distributions to justify conclusions. There were three main reasons for this, one statistical and two pedagogical. First, statistical tests will rarely be applicable to limited numbers of solutions. Second, we saw learners as involved in acquiring partial --- and only partially accurate - knowledge of patterns in the language, rather than in rivalling professional descriptive linguists. Third, we were concerned that learners might treat statistical findings as linguistic truths: hence the stress on careful reading and follow-up queries in evaluating numerical results.

Association with particular text-types was a clear case of the need to treat numbers with care: the fact that all but two of the occurrences of `kill two birds with one stone' in the BNC come from written texts does not permit the inference that it is more common in writing than speech, since the overall composition of the corpus (90% written) means that we would not expect to find more than three or four spoken occurrences in any case. In contrast, one case where it seemed reasonable to infer a primarily written use involved the unknown word `erstwhile'. The evidence lay not only in the fact that only one solution came from speech, but that in that one example, its use was greeted with laughter from its hearers. While a statistical approach would have led to this solution being ignored, a reading approach led to its being analysed in detail, providing not only a confirmation of the regularity, but a memorable instance of the interactional effects of inappropriate register.

A further issue related to the dispersion of occurrences across texts, with the need to pay attention not only to the number of texts from which hits are taken, but also to possible clusters of particular patterns within particular source texts. A preliminary study of the words `price' and `prices' in texts from the commerce and finance domain showed that `dirty' and `clean' were quite frequent collocates. However a subsequent specific query revealed that all these instances came from a single text.

3.4 Methodological implications

However banal, the clearest principle to emerge overall was that interpretations can be no better than the solutions, which in turn can be no better than the query which retrieved them. The possible pitfalls in interpretation seem less insidious than those in query design: inadequate recall, generally due to the failure to include valid alternatives; inadequate precision, due to the failure to realise that a query will match a host of unwanted occurrences. The main way we tried to increase learners' awareness of these difficulties, and help them develop appropriate query strategies, was by getting them to work on the same problem in several groups, and subsequently holding a joint discussion of their queries, as well as of their treatment and interpretations of the solutions obtained.

Previous
Up
Next