From: OXVAX::LOU "Lou Burnard" 29-JUL-1991 18:08:57.27 To: NATCORP CC: Subj: Geoff Leech on TGC working papers From: CBS%UK.AC.LANCASTER.CENTRAL1::EIA014 29-JUL-1991 16:46:52.58 To: lou,smbowie CC: Subj: Markup Via: UK.AC.LANCASTER.CENTRAL1; Mon, 29 Jul 91 16:46 BST From: "Prof G N Leech" Date: Mon, 29 Jul 91 16:48:03 +0100 Message-Id: <1509.9107291548@central1.lancaster.ac.uk> To: JHCLEAR@uk.ac.ox.vax, bryant@uk.ac.lancs.comp, garside@com.ibm.watson, lou@uk.ac.oxford.vax, smbowie@uk.ac.oxford.vax Subject: Markup From: Geoff Leech To: Lou Burnard, Jeremy Clear, Simon Murison-Bowie, Terry Cannon, Della Summers & Steve Crowdy Date: 29 July 1991 COMMENTS ON THE MINUTES OF TASK GROUP C (TGCM02) 7 June 1991 DEALING WITH THE CDIF MARK-UP, AS DISCUSSED IN TGCW01 From the Lancaster point of view, i.e. from the point of view of grammatical tagging, the following points are worth making. I will relate my comments to the alphabetical list of CDIF features, e.g. "abbr", "citn", etc. "abbr" If the markup can't handle abbreviation, this is not disastrous for grammatical tagging, although our error rate will go up a little. "address" Similarly - if the markup doesn't include addresses, we will manage without. "citn" If this were marked, it would prevent us from making some errors, but we could live without "citations" being marked. "date" Again, we could live without this. "enum" Yes, we need this, please, so that we can recognize not only "1.", "2.", etc. but also "(a)", "(b)", etc in lists as somehow being extraneous to the text. "emph" Yes, forget this: we will rely on "hi" (for "hilight") instead. "foreign" From our point of view, it's important to be able to recognize a sequence of words (or even a single word) as belonging to a foreign language. This would only be feas- ible if the sequence were highlighted typographically, or enclosed in quotes. If this information is not supplied, our grammatical tagger CLAWS goes churning through the French, Latin, etc. fondly imagining it to be English, with relatively disastrous results. We understand this is difficult for the markers-uppers to manage. But at least some degree of help would be better than none: e.g. if the text suddenly changes into Cyrillic, it's likely to be a foreign language! More seriously, if it contains accents or umlauts, etc. this could be a clue enabling the text to be marked-up as foreign ab initio. "head" In spite of the comment 'some doubt as to feasibility', we believe it to be important to mark headings, although hierarchical relations between headings are not necessary. "hi" We need indications of typographical shifts, which presum- ably would be indicated by this tag. It would also be helpful to know, for example, whether the highlighting is bold-face, italics, underlining, etc. - as this would be useful information, on occasions, for grammatical tagging. "item" (i.e. for items in a list). I am happy to have this omit- ted, so long as "enum" is included. "list" Yes, it would be useful to mark lists. But it is not ab- solutely essential for us. "name" Yes, we would love to have names already tagged before we get to do the grammatical tagging. Does OUP have a nice little machine-readable dictionary of names that it could run over the text beforehand? Seriously, though, there is the problem of ambiguity (SMITH and BROWN are ambiguous, but ROBINSON is not) which would prevent 100% result, even if your dictionary included the names of all the villages in the world. So, realistically, I doubt whether mark-up will be able to do more than part of the job. One consequence of this, I believe, is that we will have to leave a lot of the word in the BNC ambiguous in the output as either "proper noun" or "common noun". "note" It would help us if the notes were collected somewhere at the end of the text/chapter, instead of having footnotes (albeit demarcated by tags) interrupting sentences in the main text. "number" I'm glad marking numbers is deemed to be 'easy'. We have problems recognizing roman numbers from the point of view of grammatical tagging. So if someone else can do it, that's great! "p" and "point" We're not worried about these. "q" We would like this marked if possible, but would understand if it could not be. Would the occurrence of quotation marks ", ', etc. be retrievable from the marked-up text? They ought to, I submit. "q.mark" We're not particularly bothered about this one. "Hi" is more useful. "s" 'It was suggested that segmentation [into "sentences"] might be better left to the Lancaster Parser.' -- How nice of you to think of us! Yes, we could segment the written texts into operational sentences, but it is not exactly straightforward, and some small error rate may be expected. Where we could NOT cope is in the segmentation of the Longman spoken corpus. In fact, a "sentence" would not be an appropriate unit in conversational data. So I would suggest that the "segmentation" should be done by Longman before it comes to us for tagging. Some discuss- ion is no doubt necessary to decide on what principle (e.g. pauses, turns) segmentation should be done. Or could one argue that "s" is unnecessary in the conversational part of the corpus? "w" (Presumably this means "word"?) We don't really see the need for this for Lancaster's purposes: it would mean inserting ... a hundred million times, and possibly doubling the length of the corpus in terms of characters. (Adding the grammatical tags would entail further verbosity.) CLAWS works on the default assumption that a space character demarcates one word from another (peripheral punctuation symbols generally having to be excluded from the spelling of each word). Thus each word can be assigned a grammatical tag unambiguously by adding the tag symbol (say NN1) to the end of the word, separated by (say) the underline symbol: e.g. "tree_NN1", "taken_VVN", etc. The only major problem we get with this setup is dealing with cases which require two grammatical tags for one orthographic word (e.g. "He'll", "there's") or one grammatical tag for two or more orthographic words (e.g. "instead of"). We handle the latter by adding digits after the grammatical tag name: "instead_II21 of_II22". The first digit means "This is an n-word idiom"; the second digit means "This the nth word of the idiom". "II" (the grammatical tag for "preposition") is attached to both parts of the idiom. As you can see, the style of CLAWS is to deal with all things in a linear-segment fashion. It is a matter of convenience for us to continue like this as far as possible. We would prefer the TEI conversion of grammatical tags to be done at OUCS after we have done the automatic tagging. We believe that this conversion would be unproblematic in that it would be a relatively simple one-to-one mapping. TWO ADDITIONAL POINTS There are a couple of potentially tricky things omitted from TGCM02, and these ought to be considered in relation to mark-up. (a) OCCURRENCE OF "SOFT HYPHENS" AT THE END OF LINES OF TEXT, SIGNALLING THAT THE SAME WORD CONTINUES ON THE NEXT LINE. The problem, of course, is that soft hyphens in this position cannot be distinguished from hard hyphens. The solution which (I understand) Lou proposes is that a special entity symbol indicates "end-of-line hyphen". This is reasonable, although it doesn't deal with the problem of ambiguity. We would have to build into the grammatical tagger an algorithm to identify (as far as possible) whether the hyphen is hard or not. (b) OCCURRENCE OF APOSTROPHES AT THE ENDS OF WORDS. This again is usually ambiguous. Can we assume that such apostrophes are distinguished from end-single-quotes at the mark-up stage? (One would be associated with a , whereas the other wouldn't.) Geoff Leech