[This is a lightly revised version of a paper presented at the last BNC Project Committee meeting, which Prof. Leech refers to in his paper "Need to Improve the State of the Corpus". Most attendees at the meeting on the 19th will have already seen it, which is why I did not think to send it earlier.] PCW51 : Non-commercial access to the BNC 1. OUCS Committments Workpackage 4 of the Level 2 plan describes OUCS' responsibilities for providing archival access to the Corpus as follows: "At the completion of the Project, individuals and institutions not associated with the Project will also be entitled to request copies of parts or the whole of the National Corpus in accordance with conditions yet to be fully formulated. OUCS will provide such copies on magnetic tape or other media, as appropriate, and will recover the cost of doing so by means of an agreed charge. The National Corpus will be distributed with accompanying electronic data derived from the catalogue database, which will provide full information about the sources of the samples and their classification. Similarly, the SGML encoding scheme and input search and retrieval tools will be documented and distributed as part of the National Corpus. A record will be kept of all such external distributions of the National Corpus. Considerations of scale, both of the National Corpus itself, and in the likely demand for it from researchers, imply that some extension of existing OUCS facilities will be needed to support the above requirements." This document explores - what conditions of use we might wish to impose on the corpus - what kind of facilities we might provide for access to it - what organizational infrastructure would best suit its continued support - what kind of financial support we think it will need 2. Conditions of Use The terms under which texts have been acquired for inclusion in the corpus, while generous, appear to require a continued degree of control. If we are to distribute copies of the whole corpus, or of whole texts, we must be satisfied that the recipients are engaged in linguistic research or teaching activities only. Only limited rights in the majority of texts in the corpus have been granted. We will need to prepare some kind of simple licence spelling out the conditions under which copies of the whole corpus, or of groups of complete texts, are made available, and to whom. This is less demonstrably the case where what is being distributed is short extracts from the corpus, which we believe should be regarded as "fair use". This implies that we would be able to offer an online retrieval service enabling researchers to recover all sentences containing a given phrase or word, for example, without formality. We might wish to restrict the amount of data which could be recovered from the corpus in this way for other reasons, of course. The User Declaration currently used by the Oxford Text Archive might serve as a useful model for the kind of agreement we would require of potential recipients of the corpus (or parts thereof), but may need some modification in light of the existing agreements with rights holders. Obviously, we must respect the specific details of existing permissions agreements. The crucial elements seem to be: - a named individual must accept responsibility for controlling access to the copy - no further redistribution of copies in printed or electronic form (except fair use) - research applications only - any commercial application to be referred back to the BNC Consortium The first provision would allow us to build up a useful database of corpus users. We might additionally request bibliographic information for work done using the corpus. Electronic redistribution is a potential problem area. It should be possible for an institution to put a copy of the corpus on its network so that researchers and students can access it, provided someone will take responsibility for ensuring that only "fair use" is made of it. This may be difficult or impossible to enforce in some contexts. Presumably we would have to vet the credibility of both the institution and the individual in this respect. We might wish to share the task of distributing the corpus itself, e.g. with ICAME for Europe or (eventually) with any of a number of sites in the US or Japan. This however is probably precluded by the current agreements with rights holders. 3. Access Facilities We think we should aim to provide a full range of access facilities, from online browsing to discrete copies on tape or CD. The availability of the Corpus Search Toolkit (CST) currently being produced at OUCS with the aid of a British Library grant gives us considerable flexibility in this respect. We will offer researchers at least the following options: - A copy of the whole corpus, plus search tools, on tape or CD or downloaded over the network - Identification of a subset of the corpus, defined by information held in the header or elsewhere (e.g. "female discourse", "newspaper texts containing references to Islam", "texts published outside London", "spoken texts"), followed by transfer of the texts so selected - Browsing and selection of specific results from the corpus (or a subset of it) using the CST, optionally followed by transfer of those result sets (a result set is a collection of segments not forming a continuous text). The purpose of the second option above is to allow researchers to build their own subcorpus; they would then be able to index this and manipulate it locally, using either the CST or their own SGML-aware corpus manipulation software. Electronic forms of the corpus, subsets or result sets from it, will be distributed only in CDIF. 4. Infrastructure and facilities We think a good case can be made to SERC and to other interested funding bodies (notably ESRC, British Academy, JISC) for the provision of continued support for the Corpus at OUCS. As well as the computing support needed for continued access to the Corpus via network, an infrastructure is needed which can provide for publicity about access to the corpus, distribution of information about its usage etc. With University support, it might form the focus of a small research unit for visit scholars (rather like the Waterloo NOED Centre). Such a centre would be appropriately located within OUCS, where other related facilities are already well established: for this we would hope also to get some degree of support from the University. Access to the original source materials (on paper and on digit audio tape) needs to be considered. The former will soon need specialist conservation, which might perhaps better be done within the BL. The latter are already promised to the NSA. There is an outstanding and unresolved issue relating to anonymization of audio materials. A clear procedure for collating and fixing errors found in the corpus needs to be identified, both at the level of individual quick fixes, and at the level of substantial revisions affecting the tagging of large numbers of texts. Our current scheme allows for incremental versions of individual texts: some method of informing licensees about the availability of updates needs to be defined and implemented. Similar considerations apply to the provision of enhanced or enriched versions of corpus texts, in which additional encoding has been provided for example. One obvious and highly desirable extension would be provision of a version of parts of the spoken corpus in multimedia format with both audio and transcription linked; clearly additional permissions would be needed for this to be done. By April, the CST will provide a basic set of components for indexing and retrieval and an MS-Windows based client only. Versions of the client for other environments and design and development of easy to use pc-based analysis software based on the toolkit it offers will also be of great importance. We think the BNC will be highly influential in the development of corpus linguistics as a whole and particularly within Europe. As such it's important that we are ready to participate in new comparable ventures. A centre to act as a focus for such activities and to ensure continuity of expertise in this area is correspondingly important. 5. Financial Support Needed In the short to medium term (the next three years) we think we will need two to three fulltime staff to work on BNC maintenance and distribution. One of these will be a fulltime UNIX administrator with responsibility for all aspects of technical support and systems development. Another will be responsible for non-technical management of the centre, designing and implementing documentation and publicity, monitoring usage and distribution etc. Assistants will be needed to work on error correction, secretarial support etc. The present hardware will need some enhancement to support the anticipated interactive load. We are working on a specification for this. We currently have no equipment for editing or playing digital or analogue audio material; this might well be needed in the future. We may also wish to make a case for purchase of CD-ROM production facilities. A finger-in-the-breeze figure for the running costs needed to maintain distribution alone over the next three years would be 70-80 K per annum. LB, 7 Oct 93, revised 12 Jan 94