Matters for discussion at BNC Task Group A (Design)
		  meeting of 29th August, 1991


		     Dominic Dunlop, OUCS

		      28th August, 1991

INTRODUCTION

This is something of a hodge-podge of issues, most of which touch on the
provinces of several task groups, and several of which are arguably not
primarily the responsibility of task group A.  I'm sure other group
members will tell me if they think I'm totally out of line!


MATERIALS FOR DISTRIBUTION

  1.  Lou Burnard's comments on TGCW09 -- materials on spoken corpus
      classification and mark-up circulated by Steve Crowdy at TEI
      workshop.

  2.  Prototype CDIF DTD derived from ``toy2'' DTD used at TEI workshop.

  3.  Spoken text sample transduced by hand from Longman transcription
      format to prototype CDIF.

  4.  Beginning and end of ``The Wimbledon Poisoner'' automatically
      transduced from OUP pilot corpus mark-up to prototype CDIF.

  5.  UNIX man page for vm2, a mark-up verifier built on the SGML Users
      Group's public domain parser.

  6.  Notes on meeting to discuss the processing of scanned text held
      between OUCS and OUP On 23rd August.

  7.  ``Transcription Design Principles for Spoken Discourse Research'',
      John W Dubois, UCSB, 5th March, 1991
These documents should be numbered.  I suspect that they're all task
group C material.


OUCS STATUS SUMMARY

We have a prototype DTD for CDIF derived from the ``toy2'' subset of TEI
P1.  It works, but needs more work to cut out P1 features that are not a
part of CDIF; to add some attributes peculiar to CDIF; to properly
accommodate spoken materials (see below); and to match the header
content to that which we desire to record in the Corpus (see below).

Using a UNIX adaptation of the public domain SGML parser recently
distributed by the SGML Users Groups, we have successfully parsed a
hand-edited version of sample spoken material from Longman, and an
automatically transduced ``Wimbledon Poisoner'' -- a novel provided by
OUP with Pilot Corpus mark-up.  The parser zips along at 100,000 lines
per minute on a Sun SPARCstation 2.  The MS-DOS version is no slouch
either.  The problem in either case is that the capacity of the parser
(in ``capacity points'') is low: it cannot handle complex DTDs.  At the
moment, this does not appear to be a problem for the BNC project.

We can provide MS-DOS or SunOS versions of the parser, with source code,
to anybody who wants them.

We also have a prototype database, on which more work will be done in
September.  To some extent, further development awaits decisions on the
content and format of the header materials for Corpus texts.  (See
below.)

Talks between OUCS and OUP have resulted in agreement on the processing
and disposition of texts scanned for the BNC project on the OUCS KDEM.


MARK-UP OF SPOKEN CORPUS MATERIAL

The TEI workgroup on spoken texts is meeting on 2nd October in Oxford
immediately after the ``Using Corpora'' conference (29th September - 1st
October).  As a representative of the BNC, Steve Crowdy would be most
welcome to attend.

The workgroup is well advanced in the development of a TEI mark-up for
spoken material, but does not wish to go public yet.  A hint of its
thinking can be seen in the current support for spoken material in the
prototype CDIF DTD.  Because of the imminence of the appearance of a TEI
mark-up specification, OUCS has not yet specified extensions to CDIF for
this purpose.  (Milestone III/2, due in July, required this
specification.)  Particularly in view of the recent progress on the
specification and processing of CDIF for written texts, OUCS feels that
it would be better to delay a decision on the mark-up for spoken
materials until further input has been obtained from the TEI workgroup.

The workgroup and others have also been considering mark-up which can
easily be used during the transcription of spoken materials, and which
can subsequently easily be transduced into a TEI-conformant form.  As
usual, minds are exercised by the issue of overlapping speech.  Current
thinking is that the words of three speakers, all speaking more or less
together, can be represented thus by a transcriber,

	Spk 1: Where did I put [1that pen down?  [2I thought1] I...2]

	Spk 2: [1Your biro?  The red one?1]

	Spk 3: [2Search me.2]  Oh, here it is.

and later transduced to some collection of points, references and all
that good SGML stuff.  This mechanism is similar to that in use at
Longman, except that it allows an arbitrary number of overlaps to be
``open'' at once, while the Longman scheme runs out of brackets after
just two.

I'm sorry to say that sample materials recently received from Longman
are not in a state from which they can automatically be transduced into
CDIF: there are too many inconsistencies and departures from the Draft
Spoken Corpus Transcription Scheme of TGCW09.  (The sample also raises
an interesting copyright issue: it starts with a person reading a
Postman Pat story to children.)


WHAT DO WE PUT IN THE HEADER, AND HOW DO WE FIND IT OUT?

Sample texts received to date have had close to no header material:
we have little or nothing on provenance, author background, copyright
holder(s), means of capture, processing prior to receipt, category,
sampling procedure if applied, features included or omited... and so on.
Clearly, this cannot continue when we enter the data capture phase of
the project.  (See also Lou's comments on header information for spoken
material.)

We have discussed this before, but do not yet have a firm list of the
facts that we want to record about each text (if they are relevant and
accessible), or the format in which we want to record them.  (Arbitrary
text; one of constrained list of choices; multiple choices from
constrained list; and so on, and so forth).  Can we decide on such a
list in order that OUCS can cast it into some TEI header-conformant
mould, and tell the database how to prompt for the information when
texts are introduced to the system and subsequently generate headers for
documents in the system?

(For example, how much of the information on the title page and verso of
a book do we wish to record?  What do we need in addition?)

Having decided on the information that we want to capture, providers
(OUP and Longman) of texts can reach an agreement with the consumer
(OUCS) as to how the information is to be provided for each text.


WHO CATEGORIZES TEXTS?

Key information in the header of a text determines the category to which
it is assigned.  This information will later be used to check that we
are maintaining balance in the Corpus according to the criteria that we
set ourselves.

OUCS has a couple of questions about the process of categorization:

  1.  When does it happen?

	--  Before approval to incorporate (a sample of) the text into
  	    the Corpus is sought from the copyright holders?

	--  After copyright approval has been obtained (explicitly or by
	    default) but before data capture?

	--  After data capture but before submission of text to OUCS?

	--  After text has been logged by OUCS?

	--  One or other of the above (which?) depending upon
	    circumstances (what circumstances?)

      If categorization takes place before OUCS sees the text, do we
      need placeholders in our database to ``reserve a spot'' for the
      text when it arrives?  And how do we receive the information to be
      stored in the placeholder?

      If texts arrive at OUCS uncategorized, we will have to put
      mechanisms in place to help us to nudge the categorizers, and we
      need to know if this expected of us.  Which brings us to...

  2.  Who does the categorizing?  


SHOULD SEGMENTS NEST?

This is a nitty-gritty encoding issue.  OUCS' transduction work to date
has assumed that segments (sentences) do not nest because this makes
parsing and checking easier.  It can be argued that segments really do
nest, and that to pretend that they do not is an unacceptable
simplification.  Does anybody have any views on the matter?