From: MX%"dominic@natcorp.ox.ac.uk" 5-JAN-1993 17:12:02.32 To: LOU CC: Subj: What happened in the last quarter Return-Path: Received: from natcorp.ox.ac.uk (onions.natcorp) by vax.ox.ac.uk (MX V3.1C) with SMTP; Tue, 05 Jan 1993 15:52:37 +0000 From: Dominic Dunlop Date: Tue, 5 Jan 93 15:52:08 GMT Message-ID: <27434.9301051552@onions.natcorp.ox.ac.uk> To: lou@natcorp.ox.ac.uk Subject: What happened in the last quarter CC: users@natcorp.ox.ac.uk If anybody feels this should be added to or amended, let Lou know. Computer facilities: Added 2.3 GB disk drive, considerably increasing storage available for corpus texts. It is anticipated that one further drive of this capacity will need to be added before the end of the project. One of the older drives failed and was replaced, providing an opportunity to demonstrate that the back-up scheme works! Personnel: No change. Text accession: See summary sheet. A weekly status report is now sent automatically to participants with e-mail service. Text encoding: Apart from minor tweaks required to address problems as they arise, no more work is required on the CDIF text encoding for the bulk of the corpus. (See also Text Enrichment below.) OUCS has been involved with, and to some extent has done work on behalf of, Lancaster, Longman and OUP during the past quarter. This work is intended to insure that each partner delivers text in a form which causes a minimum of problems for the partner receiving the text. A TEI-conformant content model for text headers has been agreed, but has yet to be fully implemented and applied. Text enrichment: Word class tags and segmentation strategy for the bulk of the corpus have been agreed and implemented. Work continues on the characteristics and representation of the text enrichment proposed by Lancaster for the ``core corpus'': an extended set of word-class tags has been agreed, and a set of segment types proposed. Neither has yet been integrated into CDIF. Text dissemination: Accounts on the BNC system are being provided for particpants as required. It had been intended to distribute a ``sample tape'' at the end of 1992. At the time of writing, this had not been done, because, while there is plenty of written material from OUP which could be redistributed to participants, only small quantities of test data have been received from Longman (and, of this, only the spoken material can be redistributed), and similarly small quantities of word-class tagged material have been received from Lancaster. The views of participants as to the usefulness of the compilation of a sample under these circumstances are being sought. Documentation: Encoding the British National Corpus (paper for ICAME proceedings) (BNCX27) Problems in producing a large text corpus (Abstract for ALLC/ACH '93) (BNCX29) The New BNC Database (TGCW36) weeklyReport -- report on corpus throughput (TGCW39) weeklySummary -- summarize corpus throughput (TGCW40) minimize -- minimize CDIF tagging... (TGCW41) Report on 13th ICAME conference (published in Computers & Texts, October 1992) Presentations: (None by us in the basement in the last quarter.) --- Dominic