BNC Acceptance Procedures: Draft OUCS Proposals <author>Lou Burnard <date>15 January 1992, revised 6 March </front> <body> <div1><head>Introduction Discussion in Task Group C and at the Project Committee, indicates some urgency in reaching formal agreement between BNC participants concerning acceptance procedures and conformance thresholds for texts entering the Corpus 'sausage machine'. This note sets out OUCS proposals in this respect, as revised following discussion with Task Group C. It represents the consensus reached by that group, and is being presented for approval by the Project Committee.  The job of reaching an agreement seems to have the following components: <list> <item>agreement as to the content and structure of CDIF, and in particular which of the textual features it distinguishes are mandatory for acceptance purposes <item>definition of the format or formats in which texts will be supplied to OUCS by participants, and the relationship between that format and CDIF. <item>agreement as to the procedure OUCS should follow in validating the materials once they have been converted, and the thresholds at which materials should be excluded from the corpus </list> This document is mostly concerned with the third of these, but the following general remarks on the other two may be useful to place the rest in context. <div1><head>CDIF This is the target format for the whole corpus, spoken and written. A working paper (TGCW25) documenting it in detail is currently being produced. Its basic content has already been largely agreed to by all participants, and is summarised in section 4 below. This list has been expanded beyond that previously agreed by TGC to include tags needed for spoken text, but is otherwise largely unchanged. <div1><head>Delivery formats Participants may of course, as OUP has, elect to supply material in CDIF directly, but this is not essential, provided that definition of an automatic conversion procedure is feasible and cost-effective, as (for example) with the Longman's spoken material. It seems likely that we will need to provide a number of such procedures for texts coming to us from different routes, as not all sources of corpus material may wish to convert to CDIF themselves. This is obviously the case with material in pre-existing electronic form. OUCS is willing and able to do this, provided that we receive a full and accurate definition of the format used, together with a sufficiently large sample to test our conversion procedures on. If, in our estimation, any set of materials provided is so heterogenous that no single procedure or set of procedures could achieve the target rates of throughput and accuracy (see below), we will propose that the material (or some other exemplars satisfying the same selection criteria) be rekeyed. <div1><head>The sausage machine The following processing steps are envisaged for each text received at OUCS: <list> <item> Non-CDIF texts are first automatically converted to CDIF. Texts for which no converter exists are not accepted into the machine <item>Syntactic accuracy of the markup is checked, using an industrial strength SGML parser <item>Semantic accuracy of the markup is checked at regular intervals throughout the text, using techniques outlined below <item>Texts which do not pass both sets of checks will be referred to the supplier with a suggestion that the material (or some other exemplars satisfying the same selection criteria) be re-keyed. <item>Texts which do pass both sets of checks will be batched up and transmitted to Lancaster for enrichment <item>Texts received from Lancaster will be automatically checked for syntactic accuracy a second time. We do not anticipate a need for further semantic checking. <item> Inclusion of texts into the project database, generation of standard header etc. This process is independent of the others and thus can be carried out in parallel with them. </list> <div1><head>Proposed Thresholds <div2><head>Converters In general, we will develop customised software only for formats in which large amounts (i.e. more than 1 million words) of texts are anticipated and for which we have been provided with a full and accurate description. <note>An exception already agreed is the Lancaster <q>model corpus</q></note> If, in our opinion, provision of such software would require investing more than 6 person/hours per 20,000 words, we will not undertake it. Conversion of small quantities of text using combinations of ad hoc tools will be undertaken on a best-endeavours basis, and only where it does not lead to our overall throughput of material falling below the target levels implied by project milestones. <div2><head>Syntactic checking All texts in the corpus must parse correctly. We will do our best to fix any systematic or sporadic tagging errors causing SGML parser error messages, provided that doing so does not take up more than 1 person hour per 20,000 words. <div2><head>Semantic checking Only syntactically valid texts will be checked for semantic accuracy. For this purpose, we will need copies of the original source material against which to carry out spot-checks of the encoded text. The checks will be carried out visually against the start and end of each text and at several randomly chosen places within the text. Between 5 and 10 percent by bulk of the texts should be checked against the original in this way in order to determine whether, in our opinion, all the textual features which should have been tagged (i.e. those mandated by CDIF) have in fact been tagged. Any obvious typos, missing words etc will also be noted. We will do our best to fix all such errors (throughout the text, not just in the pages checked), but only where we are confident that doing so will not take up more than 6 person hours per 20,000 words. While achieving 100% accuracy is recognised as impossible, our intention is to correctly identify and tag over 90% of occurrences of features distinguished by the markup scheme. <div2><head>Overall limits The threshholds quoted above are subject to the further constraint that we do not have the resources to spend more than 40 person hours a week on correcting syntactic or semantic errors in texts submitted for inclusion in the corpus. <div1><head>CDIF summary <div2><head> Required The following textual features are mandated by CDIF. If retained in the text as captured, they must be marked up, using the tags shown. <list> <label>cdif<item>Tags a single conformant CDIF text (SW) <label>div<item>Any arbitrary grouping of utterances within a spoken text (S) <label>div0<item>A group of <gi>div1</gi> elements within a written text, which have been combined for convenience of handling (W) <label>div1<item>Major subdivision of a written text, e.g. chapter <label>head<item>A title or heading occurring at the start of a <gi>div1</gi> or <gi>div0</gi> element (W) <label>header<item>Bibliographic and descriptive information about a text supplied for BNC indexing purposes (the format and exact content of this information has yet to be agreed) (SW) <label>item<item>An item in a list (mandatory within list) (W) <label>l<item>A line of verse (mandatory within poem) (SW) <label>note<item>A footnote or sidenote in a written text, not forming part of the main text (may be simply deleted from source) (W) <label>p<item>A paragraph in a written text (W) <label>ptr<item>empty tag pointing from one part of a text to some other element: used to align parts of a spoken text with a timeline representing overlap (S) <label>stext<item>An individual spoken text (S) <label>text<item>An individual written text (W) <label>u<item>An utterance by a single speaker (S) </list> <div2><head>4.2 Recommended Distinguishing the following textual features is regarded as highly desirable. If present in a text, they should be marked up using the tags shown, provided that this can been done throughout a text in a reliable and consistent manner. Some indication should be provided as to which tags in this category have been supplied in a given text, and whether coverage is intended to be complete or partial. <list> <label>add<item>An editorial addition, supplying for example a word missed out unintentionally during transcription (SW) <label>caption<item>(1) A heading, title etc. attached to a picture or diagram, usually with deictic content (2) a `pull quote' or other text about or extracted from a text and superimposed upon it to draw attention to it (may be simply deleted from source) (W) <label>del<item>An editorial deletion; in spoken texts, particularly where words identifying persons or places have been removed in transcription (SW) <label>div2<item>A further subdivision of a written text, entirely contained within a div1, e.g. section (W) <label>div3<item>A further subdivision of a written text, entirely contained within a div2, e.g. subsection (W) <label>head<item>A title or heading prefixed to a <gi>div2</gi> or <gi>div3</gi> (W) <label>hi<item>A passage of written text which is typographically highlighted for example by italics or bold, where the reason for this cannot be expressed by other tags (may be simply deleted from source) (W) <label>label<item>An enumerator or other label attached to a list item, or appearing freely within a text (may be simply deleted from source) (SW) <label>list<item>A collection of distinct items flagged as such by special layout in written texts, often functioning as a single syntactic unit (may be simply deleted from source) (W) <label>pause<item>A marked pause during or between utterances in a spoken text (S) <label>pb<item>Marks the start of a new page in the original source (may be simply deleted from source) (W) <label>poem<item>A poem or extract from one, embedded or quoted within a text (may be simply deleted from source) (SW) <label>reg<item>Any editorial regularisation, whether to correct a word or phrase mis-transcribed or mis-spelled, or to normalise variant spellings. (SW) <label>shift<item>A marked change in voice quality (S) <label>sic<item>A word or phrase which has not been regularised, but which is in doubt, for example a spoken word which the transcribers cannot recognise. (SW) <label>trunc<item>A word or phrase which has been truncated during speech (S) <label>unclear<item>A point in a spoken text at which it is unclear what is happening, e.g. who is speaking or what is being said (S) <label>vocal<item>A non-linguistic but communicative noise made by one of the participants in a spoken text (S) </list> <div2><head> Optional Distinguishing the following textual features is entirely optional at the corpus acquisition stage. They are provided for the convenience of participants wishing to preserve features already encoded in texts. <list> <label>abbrev<item>Any acronym or abbreviation. In spoken texts acronyms are spelled out as they are pronounced, but need not be tagged as such (SW) <label>back<item>Matter not forming part of a text but appended to it in an appendix or similar (may be simply deleted from source) (W) <label>citn<item>A bibliographic citation, containing possibly an author, title, page reference etc. (W) <label>date<item>A calendar date in any format (SW) <label>epigraph<item>A quotation or dedication prefixed to some division of a written text (may be simply deleted from source) (W) <label>event<item>A non-communicative event (e.g. a door slamming) occurring during a conversation regarded as worthy of note. (S) <label>front<item>Material prefixed to but not forming part of a written text (may be simply deleted from source) (W) <label>marked<item>A word or phrase regarded as marked, for example as non-English, technical, archaic, regional etc. (may be simply deleted from source) (SW) <label>propname<item>Proper name of a person, place or institution (SW) <label>q<item>A quotation, either embedded or displayed; also, any representation within a written text of spoken language (e.g. dialogue) (W) <label>salute<item>A formulaic greeting, appearing at the start or end of some unit of a text. (W) <label>title<item>The title of a book, song or similar bibliographic entity, either within a <gi>citn</gi> or cited elsewhere in a written or spoken text (SW) </list> <div2><head>Generated Markup of the following featurs will be automatically generated during the corpus enrichment process at Lancaster. <list type=gloss> <label>s<item>A segment of text corresponding to the CLAWS segmentation scheme (SW) <label>word class codes<item>These will be converted to pointers linked to a TEI-conformant <term>feature structure declaration</term>. Further details to be supplied (SW) </list> </text> </ldoc>