Report on BNC task group A (corpus design) meeting of 10th April, 1991 TGA01 Location: Longman, 5 Bentinck Street, London Time: 14:00, 10th April, 1991 Present: Lou Burnard OUCS Michael Bryant Lancaster Jeremy Clear OUP (Chair) Steve Crowdy Longman Dominic Dunlop OUCS Agenda: The following agenda was used, but only fitfully observed. The report attempts to group business into the heading suggested by the agenda, even where discussion took place throughout the meeting. 1. Task groups 1.1 Terms of reference 1.2 Frequency of meetings 1.3 Protocols & procedures 2. Review of current position 3. Action points towards a design specification 4. Milestones & timetable 5. Any other business 6. Date of next meeting Report 0. Opening remarks Lou introduced Dominic Dunlop, who has recently joined OUCS as their project manager for the BNC. Though not officially a member of the task group, Lou will attend future meetings as required. 1. Task groups 1.1 Terms of reference Those present agreed that the terms of reference and division of responsibilities for the Task Groups proposed in Jeremy Clear's memo of 27th February were reasonable. It was reasonable that text capture and storage, notionally separate concerns, should be handled by a single Task Group. 1.2 Frequency of meetings A monthly meeting of each task group was judged to be too frequent -- even where, because of common membership, more than one group met in the same place on the same day. It was agreed that each group should hold at least four meetings per year, with additional meetings as required. Hosting and chair of meetings will rotate between the participants. 1.3 Protocols & procedures There was a discussion of the function of the Task Group. An important task is to make sure that nothing falls down the cracks between the four participating organizations. Another is, as soon as they are identified, to report any cost over-runs or project slippages to the Project Committee for resolution. 2. Review of current position JC mentioned that Chambers was anxious to get involved with the BNC project and effort were under way to apply formally to the DTI for the additional participant. Chambers would not be expected to take an active role in the task groups, but to make a financial contribution which would offset some costs which OUP had provisionally undertaken to bear. SC summarized the state of play on copyright. Three organizations, the Publishers' Association, the Society of Authors and the Association of Authors' Agents, were being approached in order to secure their support for the BNC. Permission requests are likely to be in the form of an informal letter describing the aims of the BNC project, rather than a formal contract. The Publisher's Association seems to be supportive; individual authors are very helpful but the agents are expected to be more difficult. No consideration has yet been given to use of the Corpus outside the UK. The DTI would be likely to have a position on this. The aim of the Copyright and Permissions Task Group is that all texts in the archive shall be cleared for all reasonable uses. If this proves not to be possible, the number of classes of exclusions will be minimised. 3. Action points towards a design specification The group reviewed the minutes of the BNC design meeting of 15th March, and agreed that ``design'' referred to description of the data, not to its structure. The sources for BNC texts were estimated as follows: 20 million words from existing corpora held by Longmans/Lancaster, by OUP (Oxford Pilot Corpus) and possibly by OUCS (Oxford Text Archive). 30 million words converted from existing machine-readable texts 50 million words of newly-captured texts. OUP is overseeing this work, provisionally costed at one pound per thousand keystrokes, and representing somewhere between 300 and 500 million keystrokes, depending on the level of mark-up. Better cost estimates are expected after a future pilot run. There was prolonged discussion of Longman's and OUP's contribution to the BNC from their existing materials, despite protestations from LB that to assign too much weight to the needs of only 20% of the corpus would be to allow the tail to wag the dog. Nevertheless, there is considerable disagreement between Longman and OUP about what constitutes modern English: Longman's existing corpus, of which it wishes (not least because of the notional financial value) to contribute as much as possible to the project, includes materials from the whole of the twentieth century; OUP, on the other hand, would prefer that all material in the BNC was no older than perhaps 15 years. Clearly, a compromise must be reached. It transpires that something over 50% of Longman's material dates from after 1975, and that texts used in Brown & LOB pioneering corpus-based work date from no later than 1961. A suitable cut-off date may lie between these two points. The issue is to be discussed between Longman and OUP next week. The matter of the breakdown of the corpus into particular types of text (imaginative, scientific etc.) is also still open to question. It was agreed that any breakdown chosen must be easily defensible, but no conclusion was reached, other than that the 40% of imaginative works in the Longman/Lancaster corpus was on the high side, and that the ten ``superfields'' described in a paper to be circulated by JC and DS by 6th May are broadly acceptable. JC will, by 16th April, deliver a first draft paper describing envisaged uses of the corpus. This will be an aid to its classification and to other design issues. There was also considerable discussion of the means of obtaining, and of the contents of, a spoken corpus. SC proposed an ambitious scheme in which the conversations of a large number of representative people chosen by a market research organization would be recorded and subsequently transcribed. The group agreed that this was an excellent idea, and that JC should write a note to the Project Committee recommending a pilot project. The mark-up of spoken material may or may not conform to the same DTD as written material -- this issue has yet to be decided. On behalf of OUCS, LB agreed that the DAT cartridges holding the recorded conversations would be archived by OUCS, but not duplicated or distributed by them. This issue will be addressed in the future, should it become necessary. There was some concern that Della Summers (Longman) appeared to oppose the inclusion of ``semi-scripted'' or prepared broadcast material in the corpus. The group agreed that such an omission was difficult to justify, and that the means of obtaining such materials (and the necessary clearances etc.) should be investigated. (No action was placed on anybody to do the investigation.) 4. Milestones & timetable It was agreed that the production of a design specification was an urgent priority. JHC undertook to produce by 3rd May a draft design document which would synthesise the design features proposed by Longman and OUP. LB reported that he was working on a markup/encoding specification, and presented a very early draft document. He stated that he was working towards the design of a single DTD which would encompass all the document types in the Corpus. Should this prove not to be feasible, the number of DTDs would be kept to a minimum. There was some discussion of the level of abstraction of a single DTD: it would be likely to be rather abstract, and might require augmentation by individual users in order to satisfy particular needs, or better to reflect conventional names for entities in particular types of texts. This could perhaps be achieved by using concurrent DTDs. LB also confirmed that word-level markup, appropriate for encoding the word-class tags would be a part of the markup spec. 5. Any other business None. 6. Date of next meeting A meeting will be held at OUCS during the last week of May. Since no-one except JC had a diary to hand, JC is to finalize a date.