TGCW31 A Program for Summarizing CDIF Tag Usage Dominic Dunlop 30th April, 1992 Tagsum is a program which examines text files, and reports on the following: - Word count - Number of line-end hyphens - Usage of non-CDIF tags - Tags not used, used incorrectly, and not used in each of the classes required, recommended and optional - Characters which should be encoded as entities, but which are not - Usage of non-CDIF entities - Usage of entities known to CDIF but which are not allowed - Usage of entities whose status in CDIF is questionable - Correctly used entities. The program, which is available on the BNC Suns, fills two needs: - It provides a quick overview of the correctness and extent of the markup in incoming files or files which have been passed through preliminary automatic processing; and - It generates statistical information for use in text headers and which may be used in possible future refinement of the CDIF specification. For further details, see the attached UNIX-style manual page, which is also available on-line on the BNC Suns. Please let me know if you have any suggestions on ways in which the program might be improved. The source, which requires perl, patchlevel 19, in order to run, can be supplied on request. TAGSUM(1) TGCW31 TAGSUM(1) NAME tagsum - summarize CDIF tag and entity usage SYNOPSIS tagsum [ -dfhltvw ] [ filename... ] DESCRIPTION tagsum sends to standard output summary information about the number of words, and the Corpus Document Interchange Format (CDIF) tagging and entity usage, in the each file named on its command line, or in its standard input if no files are named. The filename `-' may appear anywhere in the list of filenames, and is interpreted to mean standard input. The program does not incorporate an SGML parser. Instead, information is obtained by counting start tags, end tags, words, entities, characters which should be encoded as enti- ties but are not, and line-end hyphens. This approach has the following advantages: - Useful results can be obtained even for texts with mark-up which is far from correct - for example, where no tag is supplied. - Summary information about tags which are unused as well as those which are used can be obtained. - A general impression of the correctness of the mark-up - for example, whether there are unmatched tags - can be obtained; if using a parser, it may be necessary to fix a number of more gross errors before information of this type comes to light. Of course, there are also disadvantages: - The program cannot recognize incorrect nesting and tag ordering: for example, tag usage will be reported as apparently correct so long as the number of start tags is equal to that of end tags, even if the placement of the tags is invalid. The program considers the usage of valid CDIF tags to be incorrect in the following circumstances: - required tag is never used. (This is actually not always an error - see BUGS below.) - A tag is used, even though a tag to which it must be subordinate is never used. For example, the appearance of a tag in a file containing no tags is clearly an error. BNC Last change: 29 April 1992 1 TAGSUM(1) TGCW31 TAGSUM(1) - A tag expected to appear zero or or one times appears more than once. - A tag expected to appear once appears some other number of times (including zero). - A tag expected to appear one or more times does not appear. - An end tag is found for a tag which should be empty. - The number of end tags does not equal the number of start tags for a type of tag which must be ended. - The number of end tags exceeds the number of start tags for a type of tag which may be ended. OPTIONS -d Debugging information, detailing expected and actual usage of each tag and entity appearing in the text, is output ahead of the program's normal output for each file. -f Filter output format is selected: each tag and entity name appears on a separate output line for ease of parsing by subsequent programs in a pipeline. This format is the default if standard output appears not to be connected to a terminal. The -f flag over-rides -t - see below. -h The program sends a help message to standard error, then exits. -l Long-format output is produced: by default, the program reports only on tag and entity usage which is likely to be erroneous; long-format output also includes informa- tion on tags and entities which appear to be used correctly, and on recommended and optional tags which have not been used. (No information is given about entities which have not been used.) -t Terminal-format output is selected: multiple tag and entity names appear on each output line. This mode is the default if standard output appears to be connected to a terminal. -v Verbose class descriptions are output; normal descrip- tions are terse, and intended to be easily parsed by subsequent programs in a pipeline. -w Warn if the tag and entity description data embedded in the program itself appears to be inconsistent. This BNC Last change: 29 April 1992 2 TAGSUM(1) TGCW31 TAGSUM(1) option, which increases run-time slightly, is intended for use during program development. OUTPUT FORMAT The output in filter format (see -f option above) is as fol- lows: filename words words filename hyphens hyphens filename tags unknown ... filename tags required incorrect ... filename tags required unused ... filename tags required correct ... (sections repeated for recommended and optional tags) filename entities missing ... (section repeated for bad, illegal, questionable and valid entities) Any section with no content is omitted from the output, and, unless the -l option is in force, information about a lack of line-end hyphenation, correctly-used tags of any class, unused recommended and optional tags, and valid entities, is also omitted. The output in terminal format is similar, except that multi- ple tagname or entityname items appear on each line; verbose output format affects only the header lines. Debugging information appears in a tabular form, and is paginated: a new page is started for each input file. DIAGNOSTICS Complains and exits if given bad command-line options; com- plains but continues on encountering unreadable files. BNC Last change: 29 April 1992 3 TAGSUM(1) TGCW31 TAGSUM(1) EXIT STATUS The program exits with a return value of zero unless the program's internal tag description or help data appears to be corrupted. SEE ALSO BNC document TGCW25, Markup for non-ISO 646 invariant part characters..., Dominic Dunlop, 25 March, 1992 BNC document TGCW27, BNC acceptance procedures: Draft OUCS proposals, Lou Burnard, 15 January 1992, revised 6 March BNC document TGCW30, Corpus Document Interchange Format v. 1.0, Lou Burnard, 12 March 1992 BNC document TGDW08, Revised proposal for basic grammatical tagset, Geoff Leech, 1 April, 1992 BUGS The program currently knows only about tags used in written texts. The output suggests that the omission of any tag in the required class is a bad thing. This is not the case: and should be omitted if and respec- tively do not appear, and and should not appear unless the corresponding structure exists in the text. The terminal width and page length are hard-wired. The word-counting algorithm in effect discards all tagging and the contents of the CDIF
, then counts sequences of one or more non-whitespace characters separated by sequences of one or more whitespace characters. This means that free-standing entities, such as —, are counted as if they are words, whether or not this is a valid descrip- tion. The resulting error in word count is very small in typical texts The program does not report on errors involving end tags unless at least one start tag of the same name is found. BNC Last change: 29 April 1992 4