Software for Searching Large SGML Textbases Proposal for a research post to be funded by the British Library Submitted by OUCS and UCREL on behalf of the BNC Consortium CONTEXT [One paragraph explaining what the BNC is, to be drafted ] GOALS 1. Report on existing public domain software tools for the analysis and indexation of large (multi megabyte) text databases in SGML. 2. Design and implement a package of text-searching tools, using existing public domain utilities as far as possible. Basic functionality of this package to be defined by needs of BNC Consortium. Package to be implemented in a machine-independent manner, initially for UNIX workstations running under Motif. 3. Define and document interfaces to the package which will permit users to extend its capabilities for other environments or applications. 4. Assess performance of the resulting package with large text corpora using the BNC as a testbed. 5. Construct distribution version of package, complete with documentation and test examples. SUGGESTED SET OF FUNCTIONS (to be refined by the TC) - KWIC concordance generation - string-based (exact and fuzzy) retrieval and browsing of lexical and grammatical items, alone or in combination - browsing etc delimited in terms of SGML structure - arbitrary user-defined subsetting of corpus - statistical output e.g. frequency lists, collocation lists etc Functionality to be no less than that proposed as the IS&RP deliverable in the Consortium Agreement. Performance should be comparable when operating on the whole corpus or on a subset of it. METHOD 1. A Software Engineer will be appointed as soon as possible to join existing team at OUCS, but working closely with design and development team at UCREL. Person should be in post by Aug/Sept at latest, and preferably earlier. A recent computing graduate with interest and experience in Unix text processing and Internet resources might be appropriate; alternatively, a more experienced person might be sought out, to work for a comparably shorter period. 2. Survey existing public domain or suitably licensable software libraries for relevant tools. These to include among others: SGML parsers, text indexing software, text browsers, text editors. 3. In close consultation with OUCS and UCREL, design modular system architecture for initial package. Identify and implement relevant public domain tools. Create and document any necessary additional components. 4. Implement and test the whole package. Validate it against the BNC. Prepare corpus-specific application testbed. 5. Amend testbed in light of experience during (4). Fix bugs. Document scope and interfaces of package. Release to selected test sites together with completion of BNC project. 6. Continue with bug fixes and minor enhancements to system, based on feed back from first release. Prepare first public release. 7. Report on results of survey; assess usability of existing software as opposed to writing from scratch; identify any particular problems in the approach taken to searching large text bases. 8. On completion, the basic software tools developed for the project will be placed in the public domain. Any BNC-specific applications of them will be distributed together with the corpus itself. BUDGET 1 yr RA1 + overheads (or 6 months consultant) 20-22K Estimated additional hardware 4-5K Travel etc. 2K Total not to exceed 30K