Matters for discussion at BNC Task Group A (Design) meeting of 29th August, 1991 Dominic Dunlop, OUCS 28th August, 1991 INTRODUCTION This is something of a hodge-podge of issues, most of which touch on the provinces of several task groups, and several of which are arguably not primarily the responsibility of task group A. I'm sure other group members will tell me if they think I'm totally out of line! MATERIALS FOR DISTRIBUTION 1. Lou Burnard's comments on TGCW09 -- materials on spoken corpus classification and mark-up circulated by Steve Crowdy at TEI workshop. 2. Prototype CDIF DTD derived from ``toy2'' DTD used at TEI workshop. 3. Spoken text sample transduced by hand from Longman transcription format to prototype CDIF. 4. Beginning and end of ``The Wimbledon Poisoner'' automatically transduced from OUP pilot corpus mark-up to prototype CDIF. 5. UNIX man page for vm2, a mark-up verifier built on the SGML Users Group's public domain parser. 6. Notes on meeting to discuss the processing of scanned text held between OUCS and OUP On 23rd August. 7. ``Transcription Design Principles for Spoken Discourse Research'', John W Dubois, UCSB, 5th March, 1991 These documents should be numbered. I suspect that they're all task group C material. OUCS STATUS SUMMARY We have a prototype DTD for CDIF derived from the ``toy2'' subset of TEI P1. It works, but needs more work to cut out P1 features that are not a part of CDIF; to add some attributes peculiar to CDIF; to properly accommodate spoken materials (see below); and to match the header content to that which we desire to record in the Corpus (see below). Using a UNIX adaptation of the public domain SGML parser recently distributed by the SGML Users Groups, we have successfully parsed a hand-edited version of sample spoken material from Longman, and an automatically transduced ``Wimbledon Poisoner'' -- a novel provided by OUP with Pilot Corpus mark-up. The parser zips along at 100,000 lines per minute on a Sun SPARCstation 2. The MS-DOS version is no slouch either. The problem in either case is that the capacity of the parser (in ``capacity points'') is low: it cannot handle complex DTDs. At the moment, this does not appear to be a problem for the BNC project. We can provide MS-DOS or SunOS versions of the parser, with source code, to anybody who wants them. We also have a prototype database, on which more work will be done in September. To some extent, further development awaits decisions on the content and format of the header materials for Corpus texts. (See below.) Talks between OUCS and OUP have resulted in agreement on the processing and disposition of texts scanned for the BNC project on the OUCS KDEM. MARK-UP OF SPOKEN CORPUS MATERIAL The TEI workgroup on spoken texts is meeting on 2nd October in Oxford immediately after the ``Using Corpora'' conference (29th September - 1st October). As a representative of the BNC, Steve Crowdy would be most welcome to attend. The workgroup is well advanced in the development of a TEI mark-up for spoken material, but does not wish to go public yet. A hint of its thinking can be seen in the current support for spoken material in the prototype CDIF DTD. Because of the imminence of the appearance of a TEI mark-up specification, OUCS has not yet specified extensions to CDIF for this purpose. (Milestone III/2, due in July, required this specification.) Particularly in view of the recent progress on the specification and processing of CDIF for written texts, OUCS feels that it would be better to delay a decision on the mark-up for spoken materials until further input has been obtained from the TEI workgroup. The workgroup and others have also been considering mark-up which can easily be used during the transcription of spoken materials, and which can subsequently easily be transduced into a TEI-conformant form. As usual, minds are exercised by the issue of overlapping speech. Current thinking is that the words of three speakers, all speaking more or less together, can be represented thus by a transcriber, Spk 1: Where did I put [1that pen down? [2I thought1] I...2] Spk 2: [1Your biro? The red one?1] Spk 3: [2Search me.2] Oh, here it is. and later transduced to some collection of points, references and all that good SGML stuff. This mechanism is similar to that in use at Longman, except that it allows an arbitrary number of overlaps to be ``open'' at once, while the Longman scheme runs out of brackets after just two. I'm sorry to say that sample materials recently received from Longman are not in a state from which they can automatically be transduced into CDIF: there are too many inconsistencies and departures from the Draft Spoken Corpus Transcription Scheme of TGCW09. (The sample also raises an interesting copyright issue: it starts with a person reading a Postman Pat story to children.) WHAT DO WE PUT IN THE HEADER, AND HOW DO WE FIND IT OUT? Sample texts received to date have had close to no header material: we have little or nothing on provenance, author background, copyright holder(s), means of capture, processing prior to receipt, category, sampling procedure if applied, features included or omited... and so on. Clearly, this cannot continue when we enter the data capture phase of the project. (See also Lou's comments on header information for spoken material.) We have discussed this before, but do not yet have a firm list of the facts that we want to record about each text (if they are relevant and accessible), or the format in which we want to record them. (Arbitrary text; one of constrained list of choices; multiple choices from constrained list; and so on, and so forth). Can we decide on such a list in order that OUCS can cast it into some TEI header-conformant mould, and tell the database how to prompt for the information when texts are introduced to the system and subsequently generate headers for documents in the system? (For example, how much of the information on the title page and verso of a book do we wish to record? What do we need in addition?) Having decided on the information that we want to capture, providers (OUP and Longman) of texts can reach an agreement with the consumer (OUCS) as to how the information is to be provided for each text. WHO CATEGORIZES TEXTS? Key information in the header of a text determines the category to which it is assigned. This information will later be used to check that we are maintaining balance in the Corpus according to the criteria that we set ourselves. OUCS has a couple of questions about the process of categorization: 1. When does it happen? -- Before approval to incorporate (a sample of) the text into the Corpus is sought from the copyright holders? -- After copyright approval has been obtained (explicitly or by default) but before data capture? -- After data capture but before submission of text to OUCS? -- After text has been logged by OUCS? -- One or other of the above (which?) depending upon circumstances (what circumstances?) If categorization takes place before OUCS sees the text, do we need placeholders in our database to ``reserve a spot'' for the text when it arrives? And how do we receive the information to be stored in the placeholder? If texts arrive at OUCS uncategorized, we will have to put mechanisms in place to help us to nudge the categorizers, and we need to know if this expected of us. Which brings us to... 2. Who does the categorizing? SHOULD SEGMENTS NEST? This is a nitty-gritty encoding issue. OUCS' transduction work to date has assumed that segments (sentences) do not nest because this makes parsing and checking easier. It can be argued that segments really do nest, and that to pretend that they do not is an unacceptable simplification. Does anybody have any views on the matter?