Minutes of BNC Task Group A Meeting, 29th August 1991 Longman Dictionaries, Longman House, Harlow. Jeremy Clear 3 September 1991 PRESENT: Michael Bryant, Gavin Burnage, Jeremy Clear, Steve Crowdy, Dominic Dunlop. 1. Minutes of the Last Meeting and Matters Arising The minutes were agreed to be an accurate record, and there were no matters arising. 2. Written Corpus Design 2.1 Finalization of Design Specification Document. It was agreed that TGAW04 "Corpus Design Specification" should be split into two documents for the written and spoken components of the BNC respectively. TGAW04 should be revised to incorporate the following alterations: - those documented in the Minutes of the Project Committee meeting, 8th July 1991, section 3.2 b) - insert note to the effect that text classification should be carried out at the time of data capture (if possible) such that it will be OUP's responsibility to classify the written texts and Longman's to classify the spoken. - insert note to the effect that not all classification fields need necessarily be filled: in certain cases, especially for pre-existing corpus data, not all classification data can be entered. The revised document can then be signed off as the "Written Corpus Design Specification". 2.2 TEI Header After some discussion about the amount of classification information which is to appear at the head of each individual text sample, it was agreed that OUCS will make recommendations for the SGML encoding of the classification information and possible mechanisms for reference from the text sample header to classification and bibliographic data stored externally. 2.3 Definition of "text" Unit for Periodicals In accordance with the Minutes of the Project Committee of 8 July 1991, sect 3.2 b) the task group discussed the matter of treating whole magazine or newspaper issues as single text samples. It was agreed that OUP will adopt a policy of separating the component articles within magazines and newspapers, where there is some clear variety of domain, and classify these individually. The cost and time effect will be monitored and it may be necessary to review this policy if it appears that data capture targets are slipping. 3. Spoken Corpus It was agreed that the spoken material to be collected by recording of selected subjects over a period of days should be termed the "demographic" component of the BNC and material gathered from other sources in accordance with targeted categories of speech event should be termed "selective". 3.1 Presentation of Demographic Spoken Corpus Pilot SC gave a presentation of the details of the British Market Research Bureau's methods in selecting people for recording. He gave an overview of the types of problems that had emerged in the pilot; for example, getting consent from the general public to participate in the recording, possible sampling bias introduced by the willingness (or reluctance) of particular types of people to participate, completion of recording logs, etc. The meeting adjourned for a demonstration of the audio recording, conversion to and from analogue and digital tape, and sample transcriptions. 3.2 Spoken Corpus Design Specification With reference to the Minutes of the Project Committee, 8 July 1991, 3.2 b) action point e., SC said that he had almost completed a paper on the Spoken Corpus Design Specification which would be presented to the consortium in time for the next Project Committee meeting on 20 September 1991. This paper should constitute a replacement for section 6 of TGAW04 "Corpus Design Specification, DRAFT" (1 July 1991). 3.3 "Queries on Spoken Texts" (TGAW09) 15 August 1991 Lou Burnard had submitted written queries concerning the collection of spoken corpus data and these were discussed in detail. The meeting's responses to LB's notes were as follows: 1. Situational classification: SC's paper will include a revision of the working taxonomy of text types for the spoken corpus. 2. Respondents do not now have to select from a "menu" of speech situations; it was agreed that a fixed list would be too constricting. 3. Monologue and dialogue will be distinguished. 4. The "market research" style of the collection questionnaire was noted, but felt not to be inappropriate or detrimental to the quality or value of the data. Impressionistic categories for the sociological parameters of participants in speech recordings are recognised to be less than ideal but practically necessary. 5. Each speaker is uniquely identified in transcription. 6. Pauses will be encoded simply as short, medium or long. 7. Overlapping speech: it was agreed that Longman's current system is acceptable and should continue. 8. Unidentifiable speakers: Longman's current method (using to indicate an unidentifiable speaker) are quite acceptable. 9. Privacy/anonymity: it was agreed that surnames, telephone numbers and addresses should be replaced by general codes. Some other proper names may also be encoded at the transcriber's discretion if privacy or anonymity was felt to be an issue. 10. It will not be possible to mark which participants were aware of the recording and which were not. 11. Inconsistency in transcription encoding: Longman give clear instructions to transcribers and spot check will be carried out as part of post-editing of transcriptions. Some errors will occur. 12. TEI AI2 draft work paper: LB will be invited to review the check list of spoken features contained in this TEI document and report back. 4. Any Other Business 4.1 DD distributed the following documents relating to Task Group C (Encoding and Storage): TGCW10 Prototype CDIF DTD produced at TEI Workshop in July TGCW11 Spoken text sample transduced manually from Longman transcription format into prototype CDIF TGCW12 The beginning and end of "The Wimbledon Poisoner" automatically transduced from OUP pilot corpus markup to prototype CDIF. TGCW13 A Unix manual page for 'vm2', a markup verifier built on the SGML public-domain parser TGCN01 Notes on a meeting to discuss the processing of scanned text, held between OUP and OUCS, 23rd August 1991 TGCW14 "Transcription design principles for spoken discourse research", John W DuBois, University of California, Santa Barbara, 5th March 1991 4.2 The Task Managers recommend that all papers presented to the Project Committee should normally be copied to the Task Managers. 5. Date of Next Meeting No further meeting for Task Group A was set.