OUCS ALLC Foils 920508 [BNCP20] 1. Oxford University Computing Services -- Its role in the British National Corpus Consortium Dominic Dunlop -- Task manager 2. Topics 2.1. CDIF -- Corpus Document Interchange Format 2.1.1. Conforms to recommendations of the Text Encoding Initiative 2.1.2. An application of Standard Generalized Markup Language 2.2. The "Sausage machine" 2.2.1. Processing of texts overall 2.2.2. Processing of texts within OUCS 3. CDIF -- TEI conformance on a tight budget A three-year project with a fixed budget needs to keep a rein on costs and time taken for: 3.1. Data capture 3.2. Conversion of existing electronic texts 3.3. Validation of mark-up 3.4. Entry of bibliographic and sampling data 4. CDIF and the TEI Recommendations (1) Separating the babies from the bath-water... CDIF is a small subset of the TEI Recommendations in most areas: 4.1. Characters and Character sets 4.1.1. ISO 646 International reference Version 4.1.2. More common ISO public entity sets (ISOnum, ISOpub...) 4.1.3. A few entities of our own (&ft, &ins...) 4.1.4. Word class "&PUQ;tags&NN2"&PUQ 4.1.5. No language indication (supposedly all current British English) 5. CDIF and the TEI Recommendations (2) 5.1. Bibliographic control... 5.1.1. Not particularly important in classifying texts by title or by author 5.1.2. Very important in classifying texts by characteristics 6. CDIF and the TEI Recommendations (3) 6.1. "Features common to many text types" Considerably simplified relative to potential richness available from TEI recommendations. And we have levels of simplicity... 6.1.1. Required (,

...) 6.1.2. Recommended (, ...) 6.1.3. Optional (, ...) 7. CDIF and the TEI Recommendations (4) 7.1. Analytic and interpretive information 7.1.1. Relatively simple, but all-encompassing, word class tagging 7.1.2. "Blow-up" minimized through use of entities and IDREFs 7.1.3. More in next presentation... 8. CDIF and the TEI Recommendations (5) 8.1. "Specific text types" 8.1.1. Corpus recommendations very important -- and initially inadequate 8.1.2. Simple but innovative support for spoken texts 8.1.3. Minimal support for poetry (silent omission allowed) 8.1.4. No specific support for drama, office documents 8.1.5. Dictionaries specifically excluded from Corpus 9. An example 1. A sample text

s contain paragraphs; they may also contain a) Lists b) Poems -- such as "It's only words, and words are all I have..." c) Lower-level
s 1.1 A sub-section Contents of the sub-section*. * One further level is supported 10. Required CDIF markup
A sample text containing only mandatory mark-up
A sample text

<div>s contain paragraphs; they may also contain a) Lists b) Poems — such as "It's only words, and words are all I have…" c) Lower-level <div>s

A sub-section

Contents of the sub-section. 11. Recommended CDIF markup

A sample text containing required and recommended mark-up
A sample text

<div>s contain paragraphs; they may also contain Lists Poems — such as "It's only words, and words are all I have…" Lower-level <div>s A sub-section

Contents of the sub-section One more level is allowed. 12. Optional CDIF markup

A sample text containing required, recommended and optional mark-up
A sample text

<div>s contain paragraphs; they may also contain Lists Poems — such as It's only words, and words are all I have… Lower-level <div>s A sub-section

Contents of the sub-section One more level is allowed. 13. A Spoken Example Tom: I used to to to more conferences -- Dick: (interrupting) More conferences? Tom: (at the same time) than I do now. But I never had to arrange them.

A sample spoken text
I used to go to more conferences than I do now but I never had to arrange them More conferences? 14. The Sausage machine [diagram omitted] 15. OUCS tasks -- in 90 minutes per text... 15.1. Process incoming texts 15.2. Syntactic check 15.3. Semantic check 15.4. Create text headers 15.5. Send to Lancaster 15.6. Receive from Lancaster 15.7. Accession to Corpus