GEOFF LEECH'S COMMENTS ON THE OUP CORPUS DESIGN DOCUMENT AND ON THE OUCS RESPONSE 3 JUNE 1991 --------------------------------------------------------------- 1. Having recently read both the above documents, I was worried that the "corpus design" issue might degenerate into the pursuit of various red herrings. I am conscious of the need to make progress on corpus design as soon as possible, viz. at the meeting on 5 June 1991, and at the same time, I am aware that corpus design can easily become a contentious area. 2. Bearing in mind that above, I feel it is important that we have a "working basis" on which corpus collection can proceed as soon as possible, even if all the details have not yet been finally decided. 3. The "working basis" has to take account of the competing needs of (a) JUSTIFIABILITY, in terms of some general reasoned account or rationale about what purposes the corpus should fulfil, and how corpus design should best fulfil them, and (b) PRACTICALITY - e.g. we need to make use of electronic text already collected by Longman, etc. 4. Under "JUSTIFIABILITY" we may also wish to pay attention to "FACE VALIDITY" - in so far as the BNC is potentially a high-profile project, we want to able to give an understandable and coherent account of our activities to the media, and to conferences, etc. However, I do not think we should be too much influenced by papers given by Sinclair and Renouf at a recent conference in Leeds: they have to justify their "Bank of English" alongside the potentially more prestigious BNC, and so have to argue that no corpus can be representative, etc. etc. We have to argue, on the other hand, that the tradition of corpus-compilation on systematic principles (following Brown, Survey of English Usage, LOB, etc.) is a reasonable and proper one to be following. For this purpose, I find Doug Biber's Pisa paper "Representativeness in Corpus Design" crucial reading. (Jem and Lou will have received copies of this paper last Jan. - I think Della received a copy - Michael is bringing a copy to the June 5 meeting for anyone who needs it.) 5. Why is Biber's paper important? Because, although a lot of people have tried their hand at corpus design, this paper is unique, I think, in approaching the issue in a cogent, theoretically- and statistically-informed point of view. I would like to suggest that Biber's paper should be regarded as the theoretical underpinning of what we are doing - as contrasted with the "monitor corpus" idea underlying the Sinclair et al "Bank of English". (For practical reasons, we will not be able to follow Biber's thinking in all respects - but that is another issue.) That is, we should be following the "spirit" of Biber's arguments, without in any way making his proposals a straightjacket. In fact, in practice, I think we can follow most of the recommendations of the OUP Corpus Design document, without having to worry about the implications of Biber in detail. 6. Examples of what Biber says: (I can't do more than pick out a few things) (a) "Representativeness refers to the extent to which a sample includes the full range of variability in a population". How nice to have representativeness defined! (b) "Definition of the target population has at least two aspects: 1) the boundaries of the population...; and 2) ...what text categories are included in the population". (c) "A sampling frame is an operational definition of the population" - here Biber praises the Brown and LOB Corpora for having clearly defined sampling frames - viz. all 1961 publications as documented in national bibliographies. (d) Stratified sampling is needed: viz. "sub-groups are identified within the target population, and then each of those 'strata' are sampled using random techniques". (e) Text typologies are either FUNCTIONAL (related to social dimensions of variation, etc.) or LINGUISTIC (defined by linguistic criteria, e.g. in the manner of Biber and Finegan's publications using factor analysis and cluster analysis). The functional criteria are a priori criteria "external" to the corpus - and are what people generally use when they identify parameters of variation. The linguistic criteria are "internal", and can lead to a text typology only by laborious analysis of existing corpora. (f) Biber talks (p.9) about organizing the sampling of a corpus with respect to (i) text production (the conditions under which people produce texts), (ii) text reception (the conditions under which texts are read or listened to, and (iii) texts as products. The spoken English part of the BNC will be demographically sampled, and therefore of type (i). The written part of the corpus will be presumably based on text typology/parameters, and therefore of type (iii). It is easy to define the population for a sample of the written (printed) language in terms of texts as products, because that is what can be provided by the British National Bibliography, Willings Press Guide, etc. But it is impossible to arrive at a similar definition of the population for spoken (conversational) language - there is no publications list of conversations engaged in by UK citizens - hence the use of the demographically defined sampling. It seems to me that there is no problem in justifying this need for different sampling practices for the two kinds of material - written (printed) and spoken (conversation). (What I didn't find in the OUP document, however, was a rationale for stratified sampling of the intermediate material which is neither printed/published nor conversational speech: viz. scripted speech, monologue public speech, etc. Some kind of stratified sampling of broadcast materials, for example, would seem to be necessary.) (g) Regarding sample size, Biber illustrates how for most linguistic purposes short samples of c.1,000 words are adequate - this provides a counterweight to the OUCS suggestion that samples should consist of whole texts - of whatever length they might be. The minimum useful size of a sample is therefore quite low. The maximum size of a sample, in contrast, would be primarily determined by the need for the corpus to be as representative of the full range of variability in the spoken and written language. I am inclined to agree with the OUP suggestion of a norm of c.40,000 words. I do not believe (alongside OUCS) that for most applications complete texts are needed. Perhaps one should distinguish a sample from a text: a sample may be (i) a collection of shortist texts, (ii) a segment of a long text, or (iii) a complete text. Seeking a norm for sample-length would be independent, on this basis, of the length of texts. (h) On p.33, Biber presents his thesis that corpus design proceeds in a cyclical fashion, viz. 1) "pilot empirical investigation/theoretical analysis" -- 2) "corpus design" -- 3) "compile portion of corpus" -- 4) "empirical investigation" -- 2) - - 3) -- 4) -- 2) -- 3) -- 4) etc. The argument is that only when you have compiled a corpus can you do the research which will show how far that corpus is representative of the population. Hence the need to revise the contents of a corpus progressively, at each stage evaluating its design parameters and modifying as necessary. 6. Although Biber's cyclic corpus design thesis cannot be faulted on theoretical grounds, for practical reasons (viz. the DTI/SERC didn't give us as much money as we would like) we cannot follow it. But we can nevertheless argue that the BNC project is the first stage of such a cyclic progression. After the corpus is compiled, an evaluation of its design, using statistical techniques, will indicate in what areas it is underrepresentative or overrepresentative. In an ideal world, this would lead to a "2nd generation BNC", etc. And perhaps that will indeed happen. As a next best thing, however, it would be possible, after undertaking the above empirical evaluation of the BNC, to carve out of the 100 million words of the BNC a sizeable subcorpus in which selection of material for inclusion or exclusion would be determined by the goal of equally satisfactory representation of the different parameters of the corpus. This slimmed-down BNC would be suitable for particular purposes such as derived statistics for probabilistic NLP. 7. My conclusion, then, is that by following the *spirit* of Biber, we can arm ourselves against counterarguments, justify what we are doing in terms of a "theory of corpus design", and at the same time give reasons why (for practical economic reasons) we cannot follow the theory to its full conclusions. 8. Speaking more cynically, I may suggest that Biber allows one to muddle along in corpus design, using ad hoc commonsense categories much in the way people have in previous corpus compilations, but at the same time to argue that this "muddling along" has a theoretical rationale behind it! 9. Against all this background, it will probably be clear why I do not intend to get terribly worked up about the specifics of corpus design (e.g. what percentage of the corpus will be of this, that, or the other type of text?). In the main I do not disagree with the categories and percentages proposed in the OUP draft document. I think we should use whatever instruments of commonsense rationality we can find to arrive at a reasonable carve-up of the 100 m. words. This includes stratified random sampling, and where appropriate using percentages proportionate to percentages in the whole population if determinable (e.g. by reference to national bibliographies). We should also take what measures we can to make sure the 100 m. words cover as wide a range of linguistic variability as possible, given the assumed 30-year time-span of the corpus. 10. By the way, in reference to this 30 year corpus time-span: I do not think the corpus can reasonably be regarded as diachronic, for reasons mentioned in the OUCS commentary. We would not be able to ensure, in a 100 m. corpus, the stratified sampling of the corpus for (say) 5-year slices between 1960 and 1990, for example. So the corpus would indeed have to be seen as a synchronic corpus having a 30-year-wide span: a "fuzzy snapshot", as OUCS describe it. Given this 30-year span, there could be unfortunate chronological bias in the corpus unless we safeguard against it. E.g. it would be unfortunate - to take an artifically extreme case - if all the novels were selected from the 1960s and all the newspaper material from the 1990s. One way to safeguard against this kind of bias would be to aim for equal and proportionate distribution of data across the thirty years - in so far as this was manageable. Another way to safeguard would be to blatantly avoid equal distribution across the thirty years - e.g. we could try to compose as much as the corpus as possible from the years 1990-1. This would strengthen the claim to "synchronicity", and would incidentally tie in with the Longman spoken material which would presumably also be from a short time period. However, it would then be much less easy to justify the spreading of the rest of the material across 30 years - the argument of expediency would perhaps have to bear too much weight here. I do not know which of these two policies to adopt - but I think we should not leave the spread of corpus data across the 30 years to pure chance/expediency. 11. I have pencilled quite a lot of comments on the OUP and OUCS documents, but I will leave Michael to verbalise these as necessary at the corpus design meeting on 5 June.