Corpus Research: Sharing Interpretations

University of Birmingham, Friday September 20

Claire Warwick, Oxford University Computing Services


Introduction

A seminar on Corpus Research: Sharing interpretations was held at the University of Birmingham on Friday September 20. It was intended to be the first in a series of such seminars, which are designed to enable those doing research into various aspects of corpus linguistics to find out about and discuss work in progress. Each speaker talked about their work for about ten minutes, and this was then followed by a discussion period of a further twenty minutes. This demonstrated that the main purpose of the seminar was discussion of current research, rather than a formal presentation of finished work.

Papers Presented

Susan Jones (Centre for Interactive Systems Research at City University) discussed the application of word co-occurrence data in probablistic document retreival.She described her work on the OKAPI information retrieval system. Her research is concerned with the use of collocations to improve the performance of information retrieval systems, since if words commonly occur near each other this should make the task of finding expressions within texts easier.

Tim Johns (University of Birmingham) described how he has used corpora in EFL teaching. He has used the 1994 Guardian and Observer (0.5 Million words), New Scientist(2.25 Million words) and Nature(0.5 million words). He uses Microconcord to search these corpora during individual sessions with advanced language learners, to teach them about what might otherwise be subtle and difficult features of language use. A particularly popular session has been one about internet neologisms like 'surf', 'hack', 'spam', and 'flame'.

Gill Francis and Susan Hunston of COBUILD discussed their work with the Bank of English on discovering the patterns of nouns. They have identified distinct patterns which they will use to illustrate the functions of nouns in their forthcoming book in the Collins Cobuild Grammar, series.

Rosamund Moon(COBUILD) talked about her research on the use of proverbs and idioms, using the Bank of English and some spoken material which was later included in the BNC. She discussed the problems she had faced in doing this, and especially her discovery that idioms seemed to occur at a much lower frequency than might be expected. Psycholinguistic studies indicate that native speakers think that they use idiom relatively frequently in conversation, however, Dr. Moon's use of corpus analysis proved that this was untrue, and that in fact the high frequency of usage was to be found in written journalism.

After lunch Mike Hoey (Liverpool) spoke about his research into the Firthian concept of colligation (the relationship of words to grammatical classes). He illustrated the use of this concept with what he described as the 'drinking problem hypothesis'. That is that 'a drinking problem' tends to connote alcoholism, whereas 'a problem drinking' merely indicated a mechanical difficulty. Thus meaning is affected by the colligations of the participle 'drinking'. It was unclear which corpus he had used, but the research he described had been into the colligates of the word 'reason'. He was also interested to find out if colligations might change over time, which would have to be done using diachronic corpora.

In order to further his research he expressed a need for software which would be able to search for groups of words and their relationship to each other. He also wanted to be able to search for paragraph breaks. Both Tim Johns and I assured him that both Microconcord and SARA were already able to do this.

Simon Botley ( UCREL Lancaster) discussed a corpus-based study of anaphora. He used Corpora from the Associated Press, the American Printing House for the Blind (prose fiction) and Canadian Hansard, to investigate different types of anaphora. He had then created his own tag set to distinguish these, and to determine the relative frequencies of their occurrences.

Chris Gledhill (Aston) talked about his research into the language of Cancer research articles, using a corpus of 0.5 million words. He has been looking especially at the collocations and colligations of grammatical words like 'but', which is most often used in abstracts and 'who' used in the mail text of articles. He has discovered that this reveals important features of the discourse and culture of scientific research. For example subjects are never passively experimented upon or given drugs, but actively participate in trials, or in as in one instance, mice voluntarily 'took part in' an experiment.

Heok-Seung Kwon (Birmingham) discussed his work on negative prefixation . He has used a huge range of corpora both synchronic and diachronic, including BNC, and English dictionaries dating from the sixteenth century. He has been investigating the differing usage of negative prefixes over time. He suggested that the some forms like un- and in- reflecting latinate and English usages may differ according to the growth of awareness of English nationalism. This lead to an interesting discussion of whether it might be possible to investigate whether different communities, for example the opposing sides in the English Civil war had favoured one usage over the other

Conclusion

This was an excellent opportunity for some of those involved in all aspects of corpus research to meet and discuss their work. Those present, both speakers and observers, did indeed openly share interpretations and make constructive comments on the work that had been presented. The plan to hold further seminars was warmly welcomed by all those present