TGCW31

	       A Program for Summarizing CDIF Tag Usage

			    Dominic Dunlop
			   30th April, 1992

Tagsum is a program which examines text files, and reports on the
following:

  -  Word count

  -  Number of line-end hyphens

  -  Usage of non-CDIF tags

  -  Tags not used, used incorrectly, and not used in each of the
     classes required, recommended and optional

  -  Characters which should be encoded as entities, but which are not

  -  Usage of non-CDIF entities

  -  Usage of entities known to CDIF but which are not allowed

  -  Usage of entities whose status in CDIF is questionable

  -  Correctly used entities.

The program, which is available on the BNC Suns, fills two needs:

  -  It provides a quick overview of the correctness and extent of the
     markup in incoming files or files which have been passed through
     preliminary automatic processing; and

  -  It generates statistical information for use in text headers and
     which may be used in possible future refinement of the CDIF
     specification.

For further details, see the attached UNIX-style manual page, which is
also available on-line on the BNC Suns.

Please let me know if you have any suggestions on ways in which the
program might be improved.  The source, which requires perl,
patchlevel 19, in order to run, can be supplied on request.


TAGSUM(1)		     TGCW31			TAGSUM(1)


NAME
     tagsum - summarize	CDIF tag and entity usage

SYNOPSIS
     tagsum [ -dfhltvw ] [ filename...	]

DESCRIPTION
     tagsum sends to standard output  summary  information  about
     the  number  of  words,  and the Corpus Document Interchange
     Format (CDIF) tagging and entity usage,  in  the  each  file
     named  on	its  command line, or in its standard input if no
     files are named.  The filename `-'	may  appear  anywhere  in
     the  list	of filenames, and is interpreted to mean standard
     input.

     The program does not incorporate an SGML  parser.	 Instead,
     information  is  obtained	by counting start tags,	end tags,
     words, entities, characters which should be encoded as enti-
     ties  but	are not, and line-end hyphens.	This approach has
     the following advantages:

      -	  Useful results can be	 obtained  even	 for  texts  with
	  mark-up  which is far	from correct - for example, where
	  no <text> tag	is supplied.

      -	  Summary information about tags which are unused as well
	  as those which are used can be obtained.

      -	  A general impression of the correctness of the  mark-up
	  -  for example, whether there	are unmatched <hi> tags	-
	  can be obtained; if using a parser, it may be	necessary
	  to fix a number of more gross	errors before information
	  of this type comes to	light.

     Of	course,	there are also disadvantages:

      -	  The program cannot recognize incorrect nesting and  tag
	  ordering:  for example, <hi> tag usage will be reported
	  as apparently	correct	so long	as  the	 number	 of  <hi>
	  start	 tags  is  equal to that of end	tags, even if the
	  placement of the tags	is invalid.

     The program considers the usage of	valid  CDIF  tags  to  be
     incorrect in the following	circumstances:

      -	  required tag is never	 used.	 (This	is  actually  not
	  always an error - see	BUGS below.)

      -	  A tag	is used, even though a tag to which  it	 must  be
	  subordinate is never used.  For example, the appearance
	  of a <div2> tag in a file containing no <div1> tags  is
	  clearly an error.


BNC		   Last	change:	29 April 1992			1


TAGSUM(1)		     TGCW31			TAGSUM(1)


      -	  A tag	expected to appear zero	or or one  times  appears
	  more than once.

      -	  A tag	expected to appear once	appears	some other number
	  of times (including zero).

      -	  A tag	expected to appear one or  more	 times	does  not
	  appear.

      -	  An end tag is	found for a tag	which should be	empty.

      -	  The number of	end tags does not  equal  the  number  of
	  start	tags for a type	of tag which must be ended.

      -	  The number of	end tags exceeds the number of start tags
	  for a	type of	tag which may be ended.

OPTIONS
     -d	  Debugging information, detailing  expected  and  actual
	  usage	 of each tag and entity	appearing in the text, is
	  output ahead of the program's	normal	output	for  each
	  file.

     -f	  Filter output	format is selected: each tag  and  entity
	  name	appears	 on  a	separate  output line for ease of
	  parsing by subsequent	programs  in  a	 pipeline.   This
	  format is the	default	if standard output appears not to
	  be connected to a terminal.  The -f flag over-rides  -t
	  - see	below.

     -h	  The program sends a help  message  to	 standard  error,
	  then exits.

     -l	  Long-format output is	produced: by default, the program
	  reports only on tag and entity usage which is	likely to
	  be erroneous;	long-format output also	includes informa-
	  tion	on  tags  and  entities	 which	appear to be used
	  correctly, and on recommended	and optional  tags  which
	  have	not  been  used.   (No information is given about
	  entities which have not been used.)

     -t	  Terminal-format output is selected:  multiple	 tag  and
	  entity  names	appear on each output line.  This mode is
	  the default if standard output appears to be	connected
	  to a terminal.

     -v	  Verbose class	descriptions are output; normal	 descrip-
	  tions	 are  terse,  and intended to be easily	parsed by
	  subsequent programs in a pipeline.

     -w	  Warn if the tag and entity description data embedded in
	  the  program	itself	appears	to be inconsistent.  This


BNC		   Last	change:	29 April 1992			2


TAGSUM(1)		     TGCW31			TAGSUM(1)


	  option, which	increases run-time slightly, is	 intended
	  for use during program development.

OUTPUT FORMAT
     The output	in filter format (see -f option	above) is as fol-
     lows:

	  filename  words     words

	  filename  hyphens   hyphens

	  filename  tags unknown
	  <tagname>
	  ...

	  filename  tags required  incorrect
	  <tagname>
	  ...

	  filename  tags required  unused
	  <tagname>
	  ...

	  filename  tags required  correct
	  <tagname>
	  ...

	  (sections repeated for recommended and optional tags)

	  filename  entities  missing
	  <entityname>
	  ...

	  (section repeated for	bad,  illegal,	questionable  and
	  valid	entities)

     Any section with no content is omitted from the output, and,
     unless  the  -l option is in force, information about a lack
     of	line-end hyphenation, correctly-used tags of  any  class,
     unused recommended	and optional tags, and valid entities, is
     also omitted.

     The output	in terminal format is similar, except that multi-
     ple tagname or entityname items appear on each line; verbose
     output format affects only	the header lines.

     Debugging information appears in  a  tabular  form,  and  is
     paginated:	a new page is started for each input file.

DIAGNOSTICS
     Complains and exits if given bad command-line options;  com-
     plains but	continues on encountering unreadable files.


BNC		   Last	change:	29 April 1992			3


TAGSUM(1)		     TGCW31			TAGSUM(1)


EXIT STATUS
     The program exits with a return value  of	zero  unless  the
     program's	internal  tag description or help data appears to
     be	corrupted.

SEE ALSO
     BNC document TGCW25, Markup for non-ISO 646  invariant  part
     characters..., Dominic Dunlop, 25 March, 1992

     BNC document TGCW27, BNC acceptance procedures:  Draft  OUCS
     proposals,	Lou Burnard, 15	January	1992, revised 6	March

     BNC document TGCW30, Corpus Document Interchange  Format  v.
     1.0, Lou Burnard, 12 March	1992

     BNC document TGDW08, Revised proposal for basic  grammatical
     tagset, Geoff Leech, 1 April, 1992

BUGS
     The program currently knows only about tags used in  written
     texts.

     The output	suggests that the omission  of	any  tag  in  the
     required  class  is  a bad	thing.	This is	not the	case: <l>
     and <item>	should be omitted if <poem>  and  <list>  respec-
     tively  do	 not  appear,  and  <div0>  and	<note> should not
     appear unless the	corresponding  structure  exists  in  the
     text.

     The terminal width	and page length	are hard-wired.

     The word-counting algorithm in effect discards  all  tagging
     and the contents of the CDIF <header>, then counts	sequences
     of	 one  or  more	non-whitespace	characters  separated  by
     sequences	of one or more whitespace characters.  This means
     that free-standing	entities, such as &mdash;, are counted as
     if	 they  are words, whether or not this is a valid descrip-
     tion.  The	resulting error	in word	count is  very	small  in
     typical texts

     The program does not report on  errors  involving	end  tags
     unless at least one start tag of the same name is found.


BNC		   Last	change:	29 April 1992			4