BNC: Progress Report : 1991, fourth quarter <date>10 January 1992 <author>Lou Burnard <body> <ol> <li><emph>Task group and other related meetings</emph> OUCS staff attended meetings of Task Group A (21 Oct), C (12 Nov and 10 Dec) and D (5 Sept). There was also some discussion with OUP about revision of project milestones in view of the considerable slippage resulting from delays in agreement on encoding formats and selection critiera. <li><emph>Computer facilities</emph> <p>No changes in the hardware during this quarter. Direct connexion to the international INTERNET became a real possibility at the end of this reporting period. We are now considering the security implications. <li><emph>Software</emph>The public domain SGML parser continues to be of consoiderable usefulness. We also took delivery of the XTRAN software system produced by Exoterica Corporation in December, made available at a substantial discount to the project under special licensing arrangements. This software provides powerful facilities for converting to and from SGML, together with an excellent SGML parser. <li><emph>Database</emph> ??? <li><emph>Text Accession</emph> A sample body of texts totalling a million words in prototype CDIF format had been received from OUP by the beginning of December. A detailed evaluation of this material revealed a variety of discrepancies, documented in TGC???, which are in the process of being resolved. About half of the texts have now been fully verified. <p> Following signature of an agreement between Longmans and OUCS, we received several samples from the Longman/Lancaster Corpus, for eventual inclusion in the BNC and release to other participants, when appropriate permissions have been obtained. Some sample written texts have also been received. No progress on converting the written materials to CDIF has yet been made as this is not a time-critical task. A specification for software to convert automatically from the Longman spoken text format to CDIF has been drafted. <li><emph>Text Encoding</emph> At the start of this reporting period, a preliminary version of the CDIF dtd was tested against a small number of written texts leading to some revision of the dtd. A new dtd, providing all and only the tags agreed to by Task Group C, has now been drafted and is being tested. <p> A first attempt was also made at defining a TEI-conformant header structure for the corpus; this however is in need of substantial revision both because of changes within the TEI recommendations and to incorporate extensions for spoken texts. <p> Following agreement of the encoding scheme for spoken texts by Task Group C, CDIF was expanded to include TEI-conformant versions of the encoding proposed for spoken texts. A new document providing a definitive version of the whole set of CDIF tags is in active preparation. <li><emph>Text Enrichment</emph> A set of codes for linguistic annotation was proposed by Lancaster in September, for which a p[reliminary set of equivalent TEI feature-structure declarations were drafted. The provision of a full TEI FSD for the corpus is pending revision of the Lancaster tagset which is subject to discussion with other members of the SALT community. <li><emph>Documentation</emph> Aside from minutes and internal notes, OUCS produced working papers on ???????????????????? <li><emph>Visits and presentations</emph> OUCS staff attended the <q>Using Corpora</q> conference in Oxford in September and the <q>Computers and Teaching in the Humanities</q> conference in Durham in December. During October, presentations about the BNC were given by LB at <q>SGML '91</q> in Providence in October, and by GB at <q>Text Retrieval '91</q> in London. GB also visited a number of sites in Ireland on behalf of the project, giving lectures and soliciting material, as reported in BNCR12 </ol> </body></gdoc>