Lou Burnard Minutes of BNC Task Group C Meeting, 5 June 1991 <body> <present> DD (chair), GB, JHC, LB, SC (in part)</FRONT> <DIV1><HEAD>Procedural matters <p id="p1">DD's agenda was agreed. This being the first meeting there were no outstanding items. The document list for the Task Group was approved. <p id="p2">It was noted that TGCW02 was the most recently received version of the tags used by the Longmans texts to be included in the corpus. (SC later confirmed this). <DIV1><HEAD>Discussion of TGCW01 <p id="p3">The main item was the proposed CDIF tagset, as defined in document TGCW01. LB apologised for not having revised the draft in light of comments received to date. There had been an internal discussion within OUCS, and two sets of comments had been received from Lancaster. <p id="p4">The alphabetical list of tags in TGCW01 was gone through in some detail. An attempt was made to reach consensus as to whether distinguishing each feature in CDIF was Essential Desirable Nice or Undesirable, and some attempt at predicting whether making such distinctions automatically would be Easy Tricky or Impossible. The results are listed below, and will also be incorporated in the next revision of TGCW01. <p id="p5"> A number of general points arising during the discussion: <ul> <li>CDIF defined an interchange format, which need not necessarily be the same as either a data capture format, appropriate to the needs of those entering or editing texts, or a data processing format, appropriate to local software. LB said that while making the processing format identical to CDIF would obviously simplify matters at any site, there was certainly no need for data to be captured in CDIF. OUCS would attempt to convert data capture formats to CDIF where this could be done automatically, and would refer any problematic material to OUP for consideration. <li>Textual features which could not be identified automatically with any degree of reliability, other than those which by definition had to be entered manually (such as the editorial tags <tag>corr</tag>, <tag>add</tag> etc.) could not be mandated for CDIF, for obvious practical reasons. There was also considerable dispute about textual features thought to be ill-defined or inherently controversial, notably lists. <li>The emphasis on descriptive, as opposed to presentational, markup was likely to cause most problems, both in converting from existing material such as the Oxford Pilot Corpus and the Longman material. </ul> <DIV2> <HEAD>Alphabetical list of CDIF features <p>In the following list, tags are listed alphabetically, with a cross reference to the section in the TEI Guidelines where the corresponding feature is defined. <gl><gt>abbr<gd>abbrev[5.3.7]. Agreed to be too tricky to identify and of very little use. <gt>add<gd>add[5.4]. Agreed to be desirable and (by definition) feasible. <gt>address<gd>address[7.5.3]. Somewhere between Nice and Desirable, but tricky to identify. JHC noted that existing texts in the pilot corpus contained many addresses. <gt>back<gd>back[5.2.5]. Agreed to be Essential and Easy. <gt>body<gd>body[5.2.4]. Agreed to be Essential and Easy. <gt>citn<gd>citn[5.5]. There was some discussion of the distinction between this and <tag>q</tag>. It was agreed that embedded citations and references should be marked using the tag, but that internal components (such as author, title etc.) would not be distinguished. Felt to be tricky. <gt>corr<gd>corr[5.4]. Agreed to be desirable and (by definition) feasible. <gt>date<gd>date[5.3.11] Agreed to be Nice but Tricky. <gt>div1<gd>div1[5.2.4]. Agreed to be Essential and Easy. <gt>emph<gd>emph[5.3.2]. Agreed to be highly problematic to identify and of little use. <gt>enum<gd>enum[5.3.8]. . Agreed that it was Desirable to mark these whether they appeared within a formal list or as a floating reference to a list item. Much controversy as to whether other parts of a list structure should (or could) be formally identified. <gt>figure<gd>figure[5.9]. Felt to be Essential. No consensus as to feasibility. <gt>foreign<gd>foreign[5.3.4]. No consensus as to importance. Generally felt to be impossible to identify. <gt>front<gd>front[5.2.3]. Agreed to be Essential and Easy. <gt>head<gd>head[5.2.4]. Agreed to be Essential, some doubt as to feasibility. <gt>header<gd>tei.header[4]. Agreed to be Essential and Easy. <gt>hi<gd>highlighted[5.3.2]. Easy to tag, if typographically marked. No consensus as to usefulness. Much discussion as to need for both this and <tag>q.mark</tag>, qv. <gt>in.quot<gd>[5.3.3]. Agreed to be of little use and Impossible to mark auomatically. <gt>item<gd>item[5.3.8]. No consensus. LB felt that it might be possible to tag automatically by the presence of <tag>enum</tag>s; JHC disagreed, and felt that the concept was too ill defined to be of use. <gt>l<gd>l[7.3.1]. Agreed to be Essential and Easy. <gt>list<gd>list[5.3.8]. Agreed (with some reservations from JHC) to be Essential. No consensus as to feasibility. <gt>name<gd>propname[5.3.6]. Agreed to be Desirable but Tricky. <gt>note<gd>note[5.3.9]. Agreed to be Desirable/Essential, and Easy to mark. <gt>number<gd>num[5.3.11]. Some doubt as to utility, but Easy. In spoken texts, felt to be essential, as the intention was to normalise. <gt>p<gd>p[5.3.1]. Agreed to be Desirable/Essential and probably Easy to detect. <gt>point<gd>milestone[5.6.4]. Agreed to be Desirable/Nice, only feasible if reference points were already marked in source material. <gt>q<gd>q[5.3.3]. Agreed to be Desirable, but felt to be Tricky <gt>q.mark<gd>q.mark[5.3.3]. Disagreement as to the need for this feature as well as <tag>hi</tag>. Agreed that it was Easy, though differentiating from <tag>q</tag> was probably very Tricky. <gt>s<gd>s[5.8]. Agreed to be Essential; no consensus as to feasibility. It was suggested that segmentation might be better left to the Lancaster Parser. <gt>text<gd>text/tei.1[2.4]. Agreed to be Essential and Easy. LB noted that more attention was needed to the implications of using samples rather than whole texts in this context: sample boundaries leading to textual discontinuity would need to be marked in some way. <gt>turn<gd>No equivalent. Agreed to be Essential and Easy. <gt>w<gd>No equivalent. Agreed to be Essential at some level, but tricky in that the Lancaster Parser might tokenise the text differently from the input. </gl> <p id="p7">LB reported that Michael Sperberg-McQueen had been assigned to the National Corpus project as TEI Consultant and would be reviewing CDIF and advising on TEI-compatibility at the forthcoming TEI Workshop (1-5 July). <p id="p8">Time precluded discussion of the other documents tabled at the meeting. No date was fixed for the next meeting. </body></ldoc>