BNC Data Capture and Processing Summary of meeting held at OUCS, 23rd August, 1991 Dominic Dunlop, 27th August, 1991 [This report appears initially in plain text because of time constraints. Later, I may put together a LaTeX version showing the characters named by the entities used in the Corpus -- DD] PRESENT Gavin Burnage OUCS Lou Burnard OUCS Jeremy Clear OUP Dominic Dunlop OUCS BACKGROUND As a result of concern in OUCS at the disparity between the mark-up proposed by OUP for use during data capture, and the CDIF (Common Document Interchange Format) to be used within the British National Corpus project, a meeting had been arranged between LB and JC in order to discuss the mark-up and processing of texts captured by OUP for the BNC project using the OUCS KDEM (Kurzwiel data Entry Machine). The major headings which follow correspond to the four topics discussed. VOLUME OF DATA CAPTURED A quick calculation showed that, at the current KDEM utilization of 47 hours per week (35 by OUP staff; 12 by OUCS) scanning 12 pages per hour, about 250,000 words per week would be captured. This adds up to about 20 million words over the 84 working weeks of the data capture phase of the BNC project. Allowing for 10 million words each from existing corpora at OUP and Longman, and a further 10 million from the Longman spoken corpus project, this leaves a shortfall of 50 million words relative to the 100 million that is the Corpus target. Three possibilities for increasing the volume of scanned data captured were discussed: -- Purchase of a scanner by OUP. While Caere's OmniPage for the Macintosh looks like good value relative to the current KDEM model, the KDEM has a higher throughput. DD referred JC to a comparative review in the April 1991 edition of Byte. -- OUP buys additional KDEM time from OUCS. As the projected productivity the KDEM would need to work almost continuously on BNC material if the whole of the projected shortfall were to be made up. This is probably not practical. -- OUP buys additional scanning services from commercial suppliers. This would be expensive, and would require the production of good specifications for mark-up, error rates and so on. (No bad thing.) Other sources of material were also discussed: -- Existing texts available as computer data (typesetting tapes etc.). JC wanted to avoid this source of material if possible: his experience was that rekeying or scanning almost always turned out to be cheaper and more effective than transduction. OUCS may have the resources for this work, provided that not too great a variety of formats is encountered, but cannot estimate its difficulty without a test text or two. -- Rekeying in the far east. LB said that, provided that clear mark-up instructions were given, this could well be cost-effective, given the volume of data that we wished to process. He would make enquires through contacts on other projects using such services, and report back to JC. No conclusions were reached. MARK UP OF CAPTURED DATA There was a discussion of the mapping between the structural and non-structural marks given in OUP's document ``Corpus Markup -- Codes for Freelancers'', as passed to OUCS KDEM staff by JC, and CDIF mark-up. Documents produced by DD setting out work by GB and LB showed the correspondence, and were updated as a result of the discussion. They appear as appendix A (structural mark-up) and appendix B (non-structural) to this document. JC introduced another document, ``Corpus Markup'' (TGCW04), which is similar in content to ``Codes for Freelancers'', but which covers more ground, particularly in the area of mark-up for material from periodicals. As TGCW04 is not in the possession of the OUCS KDEM staff, it would probably be better to present any modifications to data entry markup as relative to ``Codes for Freelancers'', rather than to TGCW04. On the matter of paragraph marking, JC was concerned at the cost of doing this at data capture time, but all agreed that heuristics applied after data capture could not produce wholly accurate results. JC agreed to have paragraphs marked during data capture for a trial period, after which the impact on productivity would be assessed and the matter reviewed. OUCS queried the use of the WordPerfect spell checker subsequent to data capture, as it is not targeted at the type of error made by OCR, and could result in untagged normalizations of variant spellings in texts. JC contended that it was the most agreeable spelling checker that he had encountered, and that he had had good results from it in similar applications. OUP is supplying a commercial copy to OUCS for use on the BNC project. (Subsequent enquiries show that OUCS has academic copies for other KDEM projects, but that these cannot be used for the commercial BNC contract.) SUBSEQUENT PROCESSING OF CAPTURED DATA The upshot of the discussion on mark-up was that scanned data will initially be marked up using marks very similar to those in ``Codes for Freelancers'', and subsequently transduced by OUP into CDIF according to the mappings given in appendix A and B. On receiving CDIF texts, OUCS will check that they parsed correctly, then apply further processing to enrich the mark-up -- for example, by attempting to infer the reason for the use of highlighting. This done, the text will be passed to Lancaster for the tagging of parts of speech. (OUCS and Lancaster have yet to discuss who is responsible for the marking of segments.) Texts returned from Lancaster will be converted back to CDIF if necessary (again, a topic for future discussion), and added to the Corpus. Discussing this topic after the meeting, LB and DD realised that, as it is the intention of the BNC project to publish the criteria by which texts in the Corpus have been marked up, it will be necessary to document the transformations applied by OUP software. Further discussion is required. DISPOSITION OF CAPTURED DATA OUP had been planning that captured texts would be written to diskette and mailed to OUP for subsequent processing before being returned to OUCS on some medium or another for incorporation into the Corpus. GB, LB and DD felt that there must be some more efficient way of doing things, particularly since KDEM staff could deposit data directly on the disks of the BNC Suns by using PC-NFS over the Ethernet. It was agreed that checked texts from the KDEM will be written to the BNC Suns' disks. OUCS BNC staff (DD, that is) will send accumulated texts to OUP periodically (probably weekly) on QIC-150 cartridge for further processing. OUP will carry out the processing, and pass the CDIF-conforming results back to OUCS, again on QIC-150 cartridge. For security, OUCS will archive both checked KDEM output and CDIF text on the VAX, but it should not normally be necessary to access these archives. The disposition of scanned data is to be as follows. (This table includes processing steps involving Lancaster, which were discussed only very briefly.) KDEM staff Scan text Clean up Spell check using WP 5.1 Mark up using ``Codes for Freelancers'' Write to BNC Sun disks OUCS BNC staff Archive data Consolidate on to QIC-150 cartridge Ship to OUP OUP staff Convert data to CDIF Consolidate on to QIC-150 cartridge Ship to OUCS OUCS BNC staff Parse text Archive text Enrich text Archive text Ship to Lancaster (method t.b.d.) Lancaster Mark parts of speech in text Ship to OUCS (method t.b.d.) OUCS BNC staff Transduce text to CDIF (if necessary) Archive text Incorporate into Corpus APPENDIX A Correspondence between features in ``Corpus Markup -- Codes for Freelancers'' (May 1991) and CDIF markup. Features identified in ``Codes for Freelancers'' Freelancers CDIF NOTES ----------- ---- ----- ... ... ...

{...} ... 1 ... ... 2 3 3 ... ... 3 ... ... 3 ... ... ... ... ... bloody bloody Features allowed by CDIF, but not identified in ``Codes for Freelancers''. This list excludes marks for phrases -- date, name, number etc. -- as it is not likely to be practical to capture these during data entry. It also excludes segments ( ... ) for the same reason. OUP has agreed that, with the exception of (see note) and possibly endnotes, all these features will be identified in at data capture. Freelancers CDIF Notes ----------- ---- ----- 4 4 4 5 6 7

4 4 ... 8

9 Notes ----- 1. The mark-up ``{ ... }'' may be used to distinguish any distinct typeface or type style used for emphahsis and which can easily be recognized during data entry. (``... Freelancers'' states that it should be used only for italics.) 2. The (caption) tag needs to be added to CDIF. It should have an attribute which allows the function of the caption to be stated. 3. The , ... data entry tags should not be used. Instead, footnotes should be interpolated into text at the point where they are referenced (as in TeX, troff etc.). This requires block move operations with an editor subsequent to data capture. 4. Data entry tagging instructions will be updated so that these features can be identified and marked. 5. At the meeting, nobody could remember the function of this tag, which is to flag the leading and/or trailing deleted material in a sampled text. Since this matter must be identified in the Corpus, the issue of how it should be identified during data capture is still open. 6. The tag (or anything corresponding to it) should not be used during data capture. It may subsequently be used during enrichment of the mark-up by OUCS. 7. Although not listed in ``... Freelancers'', can be used if necessary during data entry, and should subsequently be transduced into ... . This presents no problems. The use during data capture of , ... is currently tacitly allowed, but may be problematic should it happen in practice: TEI Guidelines, and hence CDIF, do not permit nesting to this level. 8. There was some discussion of the interpolation of endnotes into text in the same manner as had been agreed for footnotes, but no conclusion was reached. 9. For a trial period, paragraphs are to be marked during data capture. See report above. APPENDIX B Correspondence between non-structural marks in ``Corpus Markup -- Codes for Freelancers'' (May 1991) and ISO 8879 (SGML) Public Entities. Marks defined in ``Codes for Freelancers'' Freelancers Public Entity Name Notes ----------- ------------- ---- ----- " “ double quotation mark, left 1 " ‘ single quotation mark, left 1 # £ pound sign $ $ dollar sign $1/3; ⅓ fraction one-third &1/2; ½ fraction one-half &1/4; ¼ fraction one-quarter &Agr; &Agr; capital Alpha, Greek &agr; &agr; small alpha, Greek ∧ & ampersand β &bgr; small beta, Greek &Bgr; &Bgr; capital Beta, Greek ¢ ¢ cent sign © © copyright sign °ree; ° degree sign Δ &Dgr; capital Delta, Greek δ &dgr; small delta, Greek ÷ ÷ divide sign &Egr; &Egr; capital Epsilon, Greek ε &egr; small epsilon, Greek = = equals sign &ft; feet (single quote) 3 Γ &Ggr; capital Gamma, Greek γ &ggr; small gamma, Greek > > greater-than sign ∞ ∞ infinity &ins; inches (double quote) 3 < < less-than sign π &pgr; small pi, Greek &sub1; superscript one &sub2; superscript two &sub3; superscript three ¹ ¹ superscript one ² ² superscript two ³ ³ superscript three × × multiply sign ¥ ¥ yen sign ' ' apostrophe 1 * • round bullet, filled 1 * ⁃ rectangle, filled (hyphen bullet) 1 * ▪ sq bullet, filled 1 == … ellipsis (horizontal) \1 ´ acute accent \2 ` grave accent \3 ¨ umlaut mark \4 ˆ circumflex accent \5 ¸ cedilla \6 ˜ tilde \7 hacek 2 \8 oblique 2 \9 ˚ ring \10 ¯ macron \11 ˛ ogonek \12 ˙ dot above \13 crossbar 2 _ — em dash ` ” double quotation mark, right 1 ` ’ single quotation mark, right 1 ~ – en dash Notes ----- 1. Some entities which are distinguished in SGML public entity sets are not distinct in ``... Freelancers''. This is not a problem. 2. BNC needs to define entities for these accents if, on further investigation, they cannot be found in a public entity set. 3. BNC needs to define entities (probably identical to the data entry mark-up entities) for these marks. SGML Public Entities (excluding Technical Use entities) for which no code is given in ``Codes for Freelancers''. Should it prove necessary to enter these characters, the following entity names should be used. Freelancers Public Entity Name ----------- ------------- ---- Æ capital AE diphthong (ligature) &Aacgr; capital Alpha, accent, Greek Á capital A, acute accent Ă capital A, breve  capital A, circumflex accent А capital A, Cyrillic À capltal A, grave accent Ā capital A, macron Ą capital A, ogonek Å capital A, ring à capital A, tilde Ä capital A, dieresis or umlaut mark Б capital BE, Cyrillic Ч capital CHE, Cyrillic Ć capital C, acute accent Č capital C, caron Ç capital C, cedilla Ĉ capital C, circumflex accent Ċ capital C, dot above Ђ capital DJE, Serbian &DSCy; capital DSE, Macedonian &DZCv; caPital dze. Serbian ‡ double dagger Ď capital D, caron Д capital DE, Cyrillic Đ capital D, stroke &EEacgr; capital Eta, accent, Greek &EEgr; capital Eta, Greek Ŋ capital ENG, Lapp Ð capital Eth, Icelandic &Eacgr; capital Epsilon, accent, Greek É capital E, acute accent Ě capital E, caron Ê capital E, circumflex accent Э capital E, Cyrillic Ė capital E, dot above È capital E, grave accent Ē capital E, macron Ę capital E, ogonek Ë capital E, dieresis or umlaut mark Ф capital EF, Cyrillic Ѓ capital GJE Macedonian Ğ capital G, breve Ģ capital G, cedilla Ĝ capital G, circumflex accent Г capital GHE, Cyrillic Ġ capital G, dot above Ъ capital HARD sign, Cyrillic Ĥ capital H, circumflex accent Ħ capital H, stroke Е capital IE, Cyrillic IJ capital IJ ligature Ё capital I0, Russian &Iacgr; capital Iota, accent, Greek Í capital I, acute accent Î capital I, circumflex accent И capital I, Cyrillic &Idigr; capital Iota, dieresis, Greek İ capital I, dot above &Igr; capital Iota, Greek Ì capital I, grave accent Ī capital I, macron Į capital I, ogonek Ĩ capital I, tilde І capital I, Ukrainian Ï capital I, dieresis or umlaut mark Ĵ capital J, circumflex accent Й capital short I, Cyrillic Ј capital JE, Serbian Є capital JE, Ukrainian Х capital HA, Cyrillic &KHgr; capital Chi, Greek Ќ capital KJE, Macedonian Ķ capital K, cedilla К capital KA, Cyrillic &Kgr; capital Kappa, Greek Љ capital LJE, Serbian Ĺ capital L, acute accent Ľ capital L, caron Ļ capital L, cedilla Л capital EL, Cyrillic &Lgr; capital Lambda, Greek Ŀ capital L, middle dot Ł capital L, stroke М capital EM, Cyrillic &Mgr; capital Mu, Greek Њ capital NJE, Serbian Ń capital N, acute accent Ň capital N, caron Ņ capital N, cedilla Н capital EN, Cyrillic &Ngr; capital Nu, Greek Ñ capital N, tilde Œ capital OE ligature &OHacqr; capital Omega, accent, Greek &OHgr; capital Omega, Greek &Oacgr; capital Omicron, accent, Greek Ó capital 0, acute accent Ô capital 0, circumflex accent О capital 0, Cyrillic Ő capital 0, double acute accent &Ogr; capital Omicron, Greek Ò capital 0, grave accent Ō capital 0, macron Ø capital 0, slash Õ capital 0, tilde Ö capital 0, dieresis or umlaut mark &PHgr; capital Phi, Greek &PSgr; capital Psi, Greek П capita. PE, Cyrillic &Pgr; capital Pi, Greek Ŕ capital R, acute accent Ř capital R, caron Ŗ capital R, cedilla Р capital ER, Cyrillic &Rgr; capital Rho, Greek Щ capital SHCHA, Cyrillic Ш capital SHA, Cyrillic Ь capital SOFT sign, Cyrillic Ś capital S, acute accent Š capital S, caron Ş capital S, cedilla Ŝ capital S, circumflex accent С capital ES, Cyrillic &Sgr; capital Sigma, Greek Þ capital THORN, Icelandic &THgr; capital Theta, Greek Ћ capital TSHE, Serbian Ц capital TSE, Cyrillic Ť capital T, caron Ţ capital T, cedilla Т capital TE, Cyrillic &Tgr; capital Tau, Greek Ŧ capital T, stroke &Uacgr; capital Upsilon, accent, Greek Ú capital U, acute accent Ў capital U, Byelorussian Ŭ capital U, breve Û capital U, circumflex accent У capital U, Cyrillic Ű capital U, double acute accent &Udigr; capital Upsilon, dieresis, Greek &Ugr; capital Upsilon, Greek Ù capital U, grave accent Ū capital U, macron Ų capital U, ogonek Ů capital U, ring Ũ capital U, tilde Ü capital U, dieresis or umlaut mark В capital VE, Cyrillic Ŵ capital W, circumflex accent &Xgr; capital Xi, Greek Я capital YA, Cyrillic Ї capital YI, Ukrainian Ю capital YU, Cyrillic Ý capital Y, acute accent Ŷ capital Y, circumflex accent Ы capital YERU, Cyrillic Ÿ capital Y, dieresis or umlaut mark Ж capital ZHE, Cyrillic Ź capital Z, acute accent Ž capital Z, caron З capital ZE, Cyrillic Ż capital Z, dot above &Zgr; capital Zeta, Greek &aacgr; small alpha, accent, Greek á small a, acute accent ă small a, breve &aclrc; small a, circumflex accent а small a, Cyrillic æ small ae diphthong (ligature) à small a, grave accent ā small a, macron ą small a, ogonek å small a, ring * asterisk ã small a, tilde ä small a, dieresis or umlaut mark б small be, Cyrillic ␣ significant blank symbol ▒ 50% shaded block ░ 25% shaded block ▓ 75% shaded block █ full block ˘ breve ¦ broken (vertical) bar \ reverse solidus ć small c, acute accent ⁁ caret (insertion mark) ˇ caron č small c, caron ç small c, cedilla ĉ small c, circumflex accent ċ small c, dot above ч small che, Cyrillic ✓ tick, check mark ○ circle, open ♣ club suit symbol : colon , comma @ commercial at ℗ sound recording copyright sign ✗ ballot cross ¤ general currency sign † dagger ↓ downward arrow ‐ hyphen (true graphic) ˝ double acute accent ď small d, caron д small de, Cyrillic ♦ diamond suit symbol ¨ dieresis ђ small dje, Serbian ⌍ downward left crop mark ⌌ downward right crop mark ѕ small dse, Macedonian đ small d, stroke ▿ down triangle, open ▾ dn tri, filled џ small dze, Serbian &eacgr; small epsilon, accent, Greek é small e, acute accent ě small e, caron ê small e, circumflex accent э small e, Cyrillic ė small e, dot above &eeacgr; small eta, accent, Greek &eegr; small eta, Greek è small e, grave accent ē small e, macron   1/3-em space   1/4-em space   em space ŋ small eng, Lapp   en space (1/2-em) ę small e, ogonek ð small eth, Icelandic ë small e, dieresis or umlaut mark ! exclamation mark ф small ef, Cyrillic ♀ female symbol ffi small ffi ligature ff small ff ligature ffl small ffl ligature fi small fi ligature fj small fj ligature ♭ musical flat fl small fl ligature ⅕ fraction one-fifth ⅙ fraction one-sixth ⅛ fraction one-eighth ⅔ fraction two-thirds ⅖ fraction two-fifths ¾ fraction three-quarters ⅗ fraction three-fifths ⅜ fraction three-elghths ⅘ fraction four-fifths ⅚ fraction five-sixths ⅞ fraction seven-eighths &fracS8; fraction five-eighths ǵ small g, acute accent ğ small g, breve ĝ small g, circumflex accent г small ghe, Cyrillic ġ small g, dot above ѓ small gje, Macedonian   hair space ½ fraction one-half ъ small hard sign, Cyrillic ĥ small h, circumflex accent ♥ heart suit symbol ― horizontal bar ħ small h, stroke ‐ hyphen &iacgr; small iota, accent, Greek í small i, acute accent î small i, circumflex accent и small i, Cyrillic &idiagr; small iota, dieresis, accent, Greek &idigr; small iota, dieresis, Greek е small ie, Cyrillic ¡ inverted exclamation mark &igr; small iota, Greek ì small i, grave accent ij small ij ligature ī small i, macron ℅ in-care-of symbol ı small i without dot ё small io, Russian į small i, ogonek ¿ inverted question mark ĩ small i, tilde і small i, Ukrainian ï small i, dieresis or umlaut mark ĵ small j, circumflex accent й small short i, Cyrillic ј small je, Serbian є small je, Ukrainian ķ small k, cedilla к small ka, Cyrillic &kgr; small kappa, Greek ĸ small k, Greenlandic х small ha, Cyrillic &khgr; small chi, Greek ќ small kje Macedonian ĺ small 1, acute accent « angle quotation mark, left ← leftward arrow ľ small 1, caron ļ small 1, cedilla { left curly bracket л small el, Cyrillic „ rising dbl quote, left (low) &lgr; small lambda, Greek ▄ lower half block љ small lje, Serbian ŀ small 1, middle dot _ low line ◊ lozenge or total mark ⧫ lozenge, filled ( left parenthesis [ left square bracket ‚ rising single quote, left (low) ł small 1, stroke ◃ triangle, open ◂ l tri, filled ♂ male symbol ✠ maltese cross ▮ histogram marker м small em, Cyrillic &mgr; small mu, Greek µ micro sign · middle dot … em leader ń small n, acute accent ʼn small n, apostrophe ♮ music natural   no break (required) space ň small n, caron ņ small n, cedilla н small en, Cyrillic &ngr; small nu, Greek њ small nje, Serbian ‥ double baseline dot (en leader) ¬ not sign ñ small n, tilde # number sign № numero sign   digit space (width of a number) &oacgr; small omicron, accent, Greek ó small o, acute accent ô small o, circumflex accent о small o, Cyrillic ő small o, double acute accent œ small oe ligature &ogr; small omicron, Greek ò small o, grave accent &ohacgr; small omega, accent, Greek &ohgr; small omega, Greek Ω ohm sign ō small o, macron ª ordinal indicator, feminine º ordinal indicator, masculine ø small o, slash õ small o, tilde ö small o, dieresis or umlaut mark ¶ pilcrow (paragraph sign) п small pe, Cyrillic % percent sign &perlod; full stop, period &phgr; small phi, Greek ☎ telephone symbol + plus sign B:-- > ± plus-or-minus sign &psgr; small psi, Greek   punctuation space (width of comma) ? question mark " quotation mark ŕ small r, acute accent » angle quotation mark, right → rightward arrow ř small r, caron ŗ small r, cedilla } right curly bracket р small er, Cyrillic ” rising dbl quote, right (high) ▭ rectangle, open ® registered sign &rgr; small rho, Greek ) right parenthesis ] right square bracket ’ rising single quote, right (high) ▹ r triangle, open ▸ r tri, filled ℞ pharmaceutical prescription (Rx) ś small s, acute accent š small s, caron ş small s, cedilla ŝ small s, circumflex accent с small es, Cyrillic § section sign ; semicolon P: ✶ sextile (6-pointed star) &sfgr; final small sigma, Greek &sgr; small sigma, Greek ♯ musical sharp щ small shcha, Cyrillic ш small sha, Cyrillic ­ soft hyphen ь small soft sign, Cyrilic / solidus ♠ spades suit symbol □ square, open ☆ star, open ★ star, filled ♪ music note (sung text sign) ß small sharp s, German (sz ligature) ⌖ register mark or target ť small t, caron ţ smali t, cedilla т small te, Cyrillic ⌕ telephone recorder symbol &tgr; small tau, Greek &thgr; small theta, Greek   thin space (1/6-em) þ small thorn, Icelandic ™ trade mark sign ц small tse, Cyrillic ћ small tshe, Serbian ŧ small t, stroke &uacgr; small upsilon, accent, Greek ú small u, acute accent ↑ upward arrow ў small u, Byelorussian ŭ small u, breve û small u, circumflex accent у small u, Cyrillic ű small u, double acute accent &udiagr; small upsilon, dieresis, accent, Greek &udigr; small upsilon, dieresis, Greek &ugr; small upsilon, Greek ù small u, grave accent ▀ upper half block ⌏ upward left crop mark ū small u, macron ų small u, ogonek ⌎ upward right crop mark ů small u, ring ũ small u, tilde ▵ up triangle, open ▴ up tri, filled ü small u, dieresis or umlaut mark в small ve, Cyrillic ⋮ vertical ellipsis | vertical bar ŵ small w, circumflex accent &xgr; small xi, Greek ý small y, acute accent я small ya, Cyrillic ŷ small y, circumflex accent ы small yeru, Cyrillic ї small yi, Ukrainian ю small yu, Cyrillic ÿ small y, dieresis or umlaut mark ź small z, acute accent ž small z, caron з small ze, Cyrillic ż small z, dot above &zgr; small zeta, Greek ж small zhe, Cyrillic