Software for Searching Large SGML Textbases
  Proposal for a research post to be funded by the British Library
    Submitted by OUCS and UCREL on behalf of the BNC Consortium

CONTEXT

[One paragraph explaining what the BNC is, to be drafted ]

GOALS

1. Report on existing public domain software  tools for the analysis and
indexation of large (multi megabyte) text databases in SGML.

2. Design and implement a package of text-searching tools, using
existing public domain utilities as far as possible. Basic functionality 
of this package to be defined by needs of BNC Consortium. Package to be
implemented in a machine-independent manner, initially for UNIX
workstations running under Motif. 

3. Define and document interfaces to the package which will permit users
to extend its capabilities for other environments or applications.

4. Assess performance of the resulting package with large text corpora
using the BNC as a testbed. 

5. Construct distribution version of package, complete with
documentation and test examples.

SUGGESTED SET OF FUNCTIONS (to be refined by the TC)

- KWIC concordance generation
- string-based (exact and fuzzy) retrieval and browsing of lexical and
  grammatical items, alone or in combination
- browsing etc delimited in terms of SGML structure
- arbitrary user-defined subsetting of corpus
- statistical output e.g. frequency lists, collocation lists etc

Functionality to be no less than that proposed as the IS&RP
deliverable in the Consortium Agreement.

Performance should be comparable when operating on the whole corpus or
on a subset of it. 


METHOD

1.  A Software Engineer will be appointed as soon as possible to join
existing team at OUCS, but working closely with design and development
team at UCREL. Person should be in post by Aug/Sept at latest, and
preferably earlier. A recent computing graduate with interest and
experience in Unix text processing and Internet resources might be
appropriate; alternatively, a more experienced person might be sought
out, to work for a comparably shorter period. 

2.  Survey existing public domain or suitably licensable software
libraries for relevant tools. These to include among others: SGML
parsers, text indexing software, text browsers, text editors. 

3. In close consultation with OUCS and UCREL, design modular system
architecture for initial package. Identify and implement relevant public
domain tools. Create and document any necessary additional components.

4. Implement and test the whole package. Validate it against the BNC.
Prepare corpus-specific application testbed. 

5. Amend testbed in light of experience during (4). Fix bugs. Document
scope and interfaces of package. Release to selected test sites
together with completion of BNC project. 

6. Continue with bug fixes and minor enhancements to system, based on feed
back from first release. Prepare first public release. 

7. Report on results of survey; assess usability of existing software as
opposed to writing from scratch; identify any particular problems in the
approach taken to searching large text bases.

8. On completion, the basic software tools developed for the project
will be placed in the public domain. Any BNC-specific applications of
them will be distributed together with the corpus itself. 

BUDGET

1 yr RA1 + overheads (or 6 months consultant) 20-22K

Estimated additional hardware 4-5K

Travel etc.             2K

Total not to exceed 30K