Register and Corpus Dynamics

Second Aston Corpus Seminar

Aston University, April 11, 1997

Claire Warwick, OUCS


Introduction

This was the second in a series of small, specialist seminars which are aimed at sharing the results of work in progress with others doing research in the field. It was organised by Chris Gledhill and was attended by thirty academic researchers, some university teachers, and others graduate students, working on various aspects of corpus linguistics. Roughly half of them came from British universities, and the others from continental Europe.

Papers Presented

Wolfgang Teubert (Mannheim), gave a paper on Corpora for Multilingual Applications in which he discussed the need, within a multilingual Europe, for a translation platform for the social and cultural register. Bilingual dictionaries can be used by humans, but this model of translation does not work for machine translation. Using several passages of translation from French into German and English, he demonstrated how the production of an idiomatic translation relies upon the translator's ability to work with phrases and multi-word units. In order to develop a machine-translation system which will use similar procedures, there is a need to use large parallel corpora, and to work with collocations found within them.

Ylva Berglund (Uppsala) discussed The compilation of a subcorpus: working with the BNC. She is interested in expressions of future. Since these are so numerous in the entire corpus she has decided to compile a subcorpus of the BNC. She explained the way in which she had selected her subcorpus and the problems inherent in the selection of representative material. She was particularly interested in the use of the phrase 'to be going to' as opposed to 'gonna' in spoken language. Despite the fact that people always deny using 'gonna' she has found it to be very common, especially in teenage language. She is therefore working on the way that these expressions are used by people of different ages, social groups and sexes.

Laura Gavioli (Bologna) presented a paper on Learning genre conventions through small corpora concordancing: an experiment in a language course for interpreters and translators. Groups of advanced language learners had collected their own small, specialized corpora, using for example articles on Hepatitis or about socio-linguistics. They then used Microconcord to identify common patterns of language use, which they could identify as particular to that genre of language. They could then check the result gained against larger, more general corpora.

Christina Su-Hsun Tsai (London) discussed A comparative Analysis of Computer mediated Communication (CMC) versus Non CMC texts Along the dimension of Abstract vs non-Abstract Information.She stressed that the language of CMC texts has cause linguists to rethink the theories of what differentiates written from spoken language. Her research draws heavily on the work done by Biber, on the abstract content of written and spoken language. She described in detail the preparation of her corpus of articles form English language teaching email lists such as TESL-L.

Marco Rocha (Sussex), presented A description of an annotation scheme to analyse anaphora in dialogues. He is working with the London-Lund Corpus, which, in common with other large corpora contains POS markup, but does not contain any markup to aid discourse analysis. He demonstrated the methods he is using to develop a system of markup which will allow him not note and categorize anaphora in the corpus.

Hans Lindquist (Vaxjo) discussed The genitive versus the of-consctruction in different text types. He is keen to test the thoery posited by certain grammarians such as Sapir (1921) that the genitive construction (as in 'the car's performance') is dying out, and being replaced by the of construction 'the performance of the car'. He has tested this assertion using Cobuild Direct. He has discovered that this rule is broadly true, but there are certain notable exceptions, particularly in the case of journalistic language, and cases in which the noun is personified, especially when it is the name of a country. This has lead to difficult questions of how genres of language in which the genitive is still used may be defined.

Magnus Ljung (Stockholm) presented a paper with the shortest title of the day: Sources. In it he discussed the way in which British and American newspapers use different strategies to avoid taking direct responsibility for statements which they make. He began his investigation with the use of phrases such as 'sources close to the Minister', which help to distance the journalist from the claim being made. He also investigated usages such as 'it is _ that' where _ replaces, 'claimed', 'thought', 'reported' etc. His preliminary work with a relatively small number of British and American broadsheet dailies suggests that British papers tend to use this formulation more than American ones, but that there are significant differences in style form one newspaper to another. In order to determine whether this can be explained by house style or national legal requirements, he aims to increase the amount and variety of newspapers included in his corpus.

Marc Weeber (Groningen) discussed Information on side effects in Medical corpora: Differences between side effect words. He has used abstracts of articles on the medical database, Medline to construct a corpus. This has been analysed to extract data about side effects, and the type of terms used to describe them. They can then be divided into three types; drug-specific, class-of-drug specific and general side effect. These words have then been referred to human experts to test the accuracy of the machine-generated results.

Rochdi Oueslati (Strasbourg) gave a paper on Terms and Patterns analysis in sublanguages, in which he described the development of a tool called ManTex, which is used for analyzing technical texts. The tool searches for clusters repeated word sequences. These can then be examined by a terminologist, to determine whether they are in fact recognisable terms.

Conclusion

The seminar was a very successful day which allowed a wide variety of areas of interest to be discussed within the common theme of corpora and terminology. It was particularly useful to be able to listen to results of work in progress, and to hear the presenter's views on how these projects might develop in future.