TGA14: comments from OUCS 1. The representativeness issue We are not convinced by the argument in 2.3 concerning the statistical validity of the sampling procedure proposed. We see no reason to believe that `demographic representativeness' (whatever that is) is transferable from the recruits to the participants: on the contrary, our intuition is that most people habitually converse with people just like themselves, so that counting all participants (as opposed to recruits) as part of the sampling vector actually decreases the variety of speech sampled and thus its `representativeness'. We share the Advisory Council's concern that the sampling procedure used should not be regarded as any more significant than it actually is -- a random sampling with no claim to anything but maximal variety. 2. Regional issues We would like to have more information about the regional distribution within both the `demographic' and the `context- governed' samples. Are the same sampling areas used in each? How are they defined? Does 'Great Britain' include 'Northern Ireland', to take the obvious example? We note that so far all transcription has been done on samples from one region, and moreover a region presumably fairly easy for Longmans' transcribers to understand. The throughput on samples from other regions is likely to be much worse, unless local transcribers are found (in which case there will be an additional cost in training and recruiting them) We do not understand why trades union activities should not be sampled regionally in the same way as other political speeches, business meetings or other local organisations. 3. Categories in the `context-governed' samples We are a bit puzzled by some of the categories listed on pp 9-11, some of which seem to overlap considerably (e.g. parliamentary proceedings come under two distinct headings; `talks to WI, clubs etc', are hard to distinguish from 'club meetings'). We think there should be a clear distinction between 'setting' (i.e. where the language is produced) and 'type' (i.e. what kind of language it is), and strongly favour basing the sampling explicitly on the former. We propose a more balanced consideration of situational parameters: the current list seems to have been determined entirely by what kinds of institutions are likely to allow material to be gathered from them. A (very preliminary) alternative approach is sketched out in section 9 below. 4. Definition of conversation We are concerned at the notion of a `conversation' as an independent unit. It's not clear whether the sequence in which `conversations' occur will be recoverable from the corpus. This seems essential when, as implied in 3.4 stage 2, pieces of language have been chopped up during post editing: when this happens, we think some indication of the length of omitted chunks of silence should be included. Likewise, we would like to see some indication of whether the recording is coterminous with the `conversation' e.g. whether the recorder was switched on after the conversation began, whether the tape ran out while it was in full swing etc. 5. Participant information We think it's important to provide more information about the participants in a conversation. We appreciate the difficulty of obtaining full details analogous to those provided by the recruits about themselves, but it should at the very least be possible to know that person labelled <2> in conversation 234 is the same as the person labelled <3> in conversation 233. We also think it should be possible to know when participants enter and leave a conversation. 6. Information provided by recruits We still think that some sort of free form description of the conversation provided by the recruit would be very useful. This could be elicited during the post recording interview mentioned on p.18, using the log itself as a prompt. It seems to us to be of at least as much use to know what the recruits think is important to understand the conversations they have recorded as to know what their reading habits or lifestyle variables might be. We would suggest recording and transcribing this interview as well. We dont think recruits are likely to characterise dialects very helpfully (especially if they are judged incapable of deciding race). First language is enough. In general, the questions seem of more use for market research purpose than for linguistic research. 7. Telephone conversation Is this excluded from the `demographic' corpus on purely pragmatic grounds? It seems a serious omission if `representativeness' is a serious concern. Even if only half of the conversation is transcribable, that's still better than nothing. Why not issue recruits with a telephone answering machine (of the kind that can record both sides of a conversation) as well as a taperecorder? Telephone talk should certainly be included in the `context- governed' corpus. 8. Transcription and encoding There are several points of considerable controversy which need to be discussed at more length. These include - notions of 'turn' and `sentence' need better definition - use of punctuation to represent intonational units (the current proposal falls straight between two stools) - use of controlled vocabulary for nonstandard orthography (we think this is a real minefield which will need a lot more discussion) - treatment of truncation false starts and repetition (the hyphen on its own is not enough) - treatment of back-channels (are they regarded as turns or not?) - paralinguistic features: (need to distinguish events that happen during an utterance from those that happen between utterances; need to show duration for some of these -- a `laugh' is not the same as 'laughing'; need to distinguish events produced by participants within their own utterance from events happening externally --e.g. -- or produced by other participants: these may be overlapped separately. ) - voice quality, performance etc (these need to be thought about more carefully and we should at least provide a set of suggestions for specific things worth noting e.g. pitch, volume, tempo, rather than relying on the sort of 'stage direction' approach implied here) We think these and related points should be taken up in more detail either in a meeting of TGC or less formally between OUCS and Longmans as soon as possible. 9. Situtional parameters for spoken texts The following list of situational parameters is a subset of that proposed at the TEI work group on corpora chaired by Douglas Biber during its recent meeting in Stockholm. Clearly, balanced sampling across all these parameters is neither desirable or possible: however it should be possible to characterise each conversation along most of these dimensions. We leave it to TGA to determine which are the most important from the point of view of sampling. Mode: (spoken, written-to-be-spoken and spoken-to-be-written) Channel: (radio, telephone, face to face etc.) Language: (languages, dialects, sublanguages etc used) Participant: (This parameter is subdivided as follows: - role (addressor, addressee, both) - number (single, multiple, corporate) - demographic characteristics (name, age, sex, place of residence, education, occupation, socio-economic status, parent-tongue, ethnic group, dialect etc.) - awareness of recording - relation to other participants Setting: (i.e. physical location) Factuality: (imaginative, non-imaginative (that is, meant to be taken as factual") Preparedness: (apparent degree of revision or polishing. Possible values: high, medium, low). Domain of use: This is probably the single parameter of most help in sampling. We think it is of great importance that language produced in each of the following situations should be equally well represented in the corpus: - religious - art/ entertainment, - business / workplace, - education - domestic - government - other public, - mass media (i.e. tv, radio, cinema, press) Note that, with the exception of the last, which could be combined with any of the others, these are more or less independent of each other. Purpose: Three axes seem possible, as follows: - persuade or sell - transfer information - entertain Topic: (subject matter) Perceived value: (a culture-dependent subjective assessment, with values such as important, beautiful, unmarked, high, medium, low, artistic trash, popular, highbrow, ephemeral ) Originality: (Possible values: translation, adaptation, condensation, abridgment, transcription)