LongmanCONV(1l)	  MISC.	REFERENCE MANUAL PAGES	  LongmanCONV(1l)


NAME
     longmanconv - convert Longman transfer format to approximate
     CDIF

SYNOPSIS
     longmanconv [-vz] Longman-format_file...

DESCRIPTION
     longmanconv takes as its input one	 or  more  Longman-format
     (dot)  files  and	produces corresponding A_ files.  Option-
     ally,  corresponding  OUP-style   classification	(classif)
     information  and  documentation  (Z_) files may also be pro-
     duced.  To	convert	all the	dot files in the  current  direc-
     tory, producing A_	and Z_ files, use the command line

	  longmanconv -vz .??????

     To	do the same, and in addition generate OUP-style	classifi-
     cation  data for subsequent automatic insertion into the BNC
     database, use

	  longmanconv -cvz .?????? > ../classif

     The program operates on each input	file in	 three	phases	-
     transduction, examination,	and documentation.  The	transduc-
     tion and examination phases are controlled	by tables appear-
     ing at the	end of the program source file - see FILES below.

     In	the transduction phase,	 the  Longman  file  prologue  is
     transformed  into a CDIF prologue,	including a dummy header.
     If	the -c option (see below) is in	force,	information  from
     the  Longman  file	 prologue  is written to standard output.
     The format	of the data is close to	that of	 the  classifica-
     tion  records  provided by	OUP for	written	texts, and can be
     processed in the same manner - see	NOTES below.

     A sequence	of regular expression-based substitutions is then
     made on the remainder of the file.	 These cover

     -	  differences between Longman transfer format, as defined
	  in TGCW03 and	TGCW56,	and CDIF;

     -	  differences between Longman  data  capture  format,  as
	  defined in TGCW03, and CDIF (which should be handled by
	  Longman prior	to transfer, but which appear  from  time
	  to time in received files); and

     -	  a number of heuristic	 fixes	for  commonly-encountered
	  errors in Longman transfer format tiles.

     In	the examination	phase, an attempt is made to  match  each
     of	 a  sequence of	regular	expressions against the	output of


Sun Release 4.Last change: TGCW57: 2 September,	1993		1


LongmanCONV(1l)	  MISC.	REFERENCE MANUAL PAGES	  LongmanCONV(1l)


     the transduction phase.  A	match  indicates  a  possible  or
     definite problem, or an area requiring attention.	A warning
     is	produced for each regular expression which matches.  Mul-
     tiple   matches  of  a  given  regular  expression	 are  not
     attempted:	the warnings only state	that a particular  condi-
     tion  exists; they	do not state how often it occurs, or give
     the location at which the match was found.

     The result	of the transduction phase is then is directed  to
     a file named by substituting the leading dot of the trailing
     pathname component	of the input filename with A_.	If a file
     of	 this  name already exists, it is overwritten.	Note that
     no	line wrapping takes place: output lineation is	close  to
     that of the input file.

     The documentation phase writes the	warnings produced in  the
     examination  phase	 either	 to  standard error or,	if the -z
     option is in force, to a  file  named  by	substituting  the
     leading  dot of the trailing pathname component of	the input
     filename with Z_.	The information	written	includes  a  word
     count.

     emacs versions of the regular expressions applied during the
     examination  phase	 are  as  follows  (space  represented by
     underscore	(_), line-feed by control-J (^J)):

     ^@	       Text contains control/non-ASCII	characters.  Some
	       Longman	texts contain patches of digital garbage;
	       some   contain	single	  non-ASCII    characters
	       representing,  for example, currency signs.  long-
	       manconv first rewrites known non-ASCII characters,
	       such  as	 the  IBM  PC  pound  sterling sign, then
	       attempts	to reduce what	remains	 to  single  null
	       characters.  These intentionally	cause SGML errors
	       when the	syntax of the file is checked:	the  text
	       around  each such error should be checked for dam-
	       age.  If	a text contains	 many  nulls,  indicating
	       many patches of garbage,	have no	compunction about
	       bouncing	it.  If	there are only a few,  or  if  it
	       seems  that  a clean file can be	created	simply by
	       deleting	the nulls or by	 changing  then	 to  some
	       other  character	 using a query-replace operation,
	       consider	accepting the text.  Please  let  Dominic
	       know  if	 you  suspect  that longmanconv	should be
	       rewriting some non-ASCII	character - say, that for
	       -,  but is not.	(^@ is ASCII null - use	C-QC-@ in
	       emacs to	generate this character.)

     &[gl]t;   Text contains &lt;/&gt;:	badly  constructed  tags?
	       Longman seems inclined occasionally to mistype the
	       starting	 and  ending   angle-brackets	enclosing
	       mark-up.	  longmanconv  attempts	 to  fix the most


Sun Release 4.Last change: TGCW57: 2 September,	1993		2


LongmanCONV(1l)	  MISC.	REFERENCE MANUAL PAGES	  LongmanCONV(1l)


	       common cases, but cannot	fix everything.	 If prob-
	       lems  remain,  a	 syntax	check will probably fail.
	       The  test  gives	 false	 positives   on	  genuine
	       occurences of &lt;/&gt;.

     [0-9]/[-0-9]
	       Unmarked	fractions  or  #sd  amounts?  (Grrr.   No
	       British pound sign in this troff.) With the excep-
	       tion of the occasional &frac12;,	Longman	does  not
	       specifically  mark  fractions or	pounds,	shillings
	       and pence amounts.  If time is available	(which it
	       probably	 is  not),  give  consideration	 to using
	       &frac14;	etc. and  &shilling;  where  appropriate.
	       Things  like musical time signatures and	reference
	       numbers should be left alone, however.

     &bquo;&[be]quo;
	       Problems	with nested quote marks? &bquo;bquo;  may
	       be  legal if a quotation	starts with a nested quo-
	       tation; &bquo;&equo; is always an error.

     [,!?],    End quote after [,!?] rendered as comma?	This is	a
	       relatively frequent scanner error.

     &bquo;[^&]+\.,
	       End quote after . rendered as comma? This is basi-
	       cally  the  same	error as the previous one, but is
	       harder to detect, as a simple search for	., yields
	       false  positives	because	of the full stops used in
	       abbreviations.  The regular expression  used  does
	       not  detect errors in quotations	containing entity
	       references, and may give	false positives	for  quo-
	       tations containing abbreviations.

     [_^J]'[^']
	       Apostrophe used as quote	mark? The regular expres-
	       sion  detects  an apostrophe at the beginning of	a
	       word.  There is no check	for apostrophes	 used  to
	       end  quotations,	as these are difficult to distin-
	       guish from legal	uses of	apostrophes.   Illegally-
	       used  apostrophes  should  be replaced with &bquo;
	       and &equo;, or with &quot;.  (See  also	next  two
	       items and NOTES.)

     [_^JxX-]\([0-9]+.\)?[0-9]+\(&equo;\|'\)
	       &ft;  or	 &ins;	rendered  as  &equo;?  Prime   or
	       double-prime following digits probably represent	a
	       unit of measurement, rather than	a  closing  quote
	       mark.  The test gives false positives where quoted
	       material	ends with a year or model number.

     [_^JxX-]\([0-9]+.\)?[0-9]+\(&quot;\|'\)


Sun Release 4.Last change: TGCW57: 2 September,	1993		3


LongmanCONV(1l)	  MISC.	REFERENCE MANUAL PAGES	  LongmanCONV(1l)


	       &ft; or &ins; rendered as &quot;? This is the same
	       as  the	previous  test,	except that it applies to
	       texts in	 which	Longman	 has  not  differentiated
	       start and end quote marks.

     ^-[^0-9]  Line-start hyphens. Line-start hyphens  are  legal
	       only  where  they  introduce  a	negative  number.
	       Other occurences	should	be  eliminated.	  A  high
	       level  of  illegal  hyphens  is	justification for
	       bouncing	a text.	 (See also next	item.)

     -_*$      Line-end	hyphens. These should be legal where con-
	       structions  such	as half- and full-board	are split
	       across lines after the first hyphen.  However,  as
	       checking	  for  such  constructions  is	far  from
	       trivial,	such splitting is considered illegal: the
	       text  should  be	amended	so that	the split appears
	       elsewhere.

     [0-9]-[0-9]
	       Hyphens instead of &ndash;? Conventionally, dashes
	       separating  digits are en dashes, but Longman sel-
	       dom, if ever, renders them as  such.   longmanconv
	       does not	attempt	to enforce the convention because
	       of the risk of mangling formulae	such  as  A=7-2\.
	       Consider	 fixing	 things	 up  using  emacs' query-
	       replace-regexp or similar.

     \.(<[^>]+>)?_l
	       Initial I rendered as l?	This is	a common  scanner
	       error, as are the next five cases.

     [a-z]L    l rendered as L?

     [a-z]P    p rendered as P?

     [0-9][oOl][0-9]
	       Internal	0, 1 rendered as letters?

     \<[oOl][0-9]
	       Initial 0, 1 rendered as	letters?

     [0-9][oOl]\>
	       Terminal	0, 1 rendered as letters?

     </head>[_^J]*<head>
	       Multiple	typeless <head>s per <div>. The	semantics
	       of  CDIF	require	that all <head>s except	the first
	       have a type attribute  with  a  value  of  sub  or
	       byline.	  Longman  is  also  prone  to	splitting
	       <head>s in two for no good reason.


Sun Release 4.Last change: TGCW57: 2 September,	1993		4


LongmanCONV(1l)	  MISC.	REFERENCE MANUAL PAGES	  LongmanCONV(1l)


     <div>     Text contains un-numbered  <div>s.  Longman  occa-
	       sionally	 uses  un-numbered  <div>s, which are not
	       legal in	 written  CDIF	texts.	 They  should  be
	       replaced	with appropriately-numbered <divn>s.

     <div0>    Text contains <div0>s: consider incrementing. Some
	       Longman	texts  use  <div0>,  the  use of which is
	       discouraged in CDIF.  (We're holding  it	 back  in
	       case  of	 some  future requirement that we haven't
	       anticipated.) If	a single <div0>	encloses a  whole
	       text,  it  can  be eliminated; otherwise, consider
	       bumping	  all	 <divn>s    up	  to	<divn+1>.
	       (C-xhC-uM-|bumpdiv does this to the contents of an
	       emacs buffer - provided that  ~natcorp/bin  is  on
	       your search path.)

     <div\(.\)>[_^J]*<div\1>
	       Spurious	<divn> tags? This is a	search	for  com-
	       pletely empty <div>s.

     <div1>\([_^J]*<head[^<]+</head>[_^J]*\)+<div1>
	       Parts, chapters	both  <div1>?  Where  a	 book  is
	       divided	into  parts,  Longman  may  mark both the
	       parts and the chapters as <div1>s.  Chapters,  and
	       all  subordinate	divisions, should be knocked down
	       a level.	 (You can use bumpdiv, as in the previous
	       case, then knock	back the few <div2>s which should
	       be <div1>s by hand.)

     <p>\(\(<pb[^>]*>\)\|[_^J]\)*\(<p\|<div[0-4]\)\W
	       Empty <p>s. The regular	expression  detects  <p>s
	       immediately  followed by	another	<p> or by a <div>
	       tag, possibly with intervening <pb> elements.

     <p>[_^J]*[a-z]
	       Spurious	<p>s? This is a	simple	check  for  para-
	       graphs  starting	 with a	lower-case letter, a cir-
	       cumstance which may be legal in some cases.

     A number of warnings such as Check	date  have  been  omitted
     from  the	list.	These  suggest actions concerned with the
     clean-up of information in	the dummy header.

OPTIONS
     -c	  Write	information in the format found	 in  OUP  classif
	  files	to standard output.  See NOTES below.

     -v	  The -v option	causes each file name to  be  printed  on
	  standard  error  immediately	before	the  file is pro-
	  cessed.

     -z	  The -z option	results	in the	documentation  (Z_)  file


Sun Release 4.Last change: TGCW57: 2 September,	1993		5


LongmanCONV(1l)	  MISC.	REFERENCE MANUAL PAGES	  LongmanCONV(1l)


	  corresponding	 to  each  input file being created.  The
	  first	line gives the text name; the  second  is  blank;
	  and  the  third  gives  the date, the	login name of the
	  user running longmanconv, and	the  command  line  used.
	  The fourth and subsequent lines carry	the warnings pro-
	  duced	by the examination phase.

	  If the file exists, it is overwritten.

	  In the absence of the	-z  option,  warnings  about  the
	  contents of the file are sent	to standard error.

DIAGNOSTICS
     Various  error  conditions	 associated  with   non-existent,
     badly-named,  or  unreadable  input  files, with input files
     which are not valid Longman transfer format  documents,  and
     with  output files	which cannot be	created, result	in ``file
     skipped'' messages	on standard error.

     Under the -z options, a warning is	given on  standard  error
     if	 the documentation (Z_)	file cannot be created.	 The main
     output file is still written under	these circumstances.

     More serious error	conditions result in  immediate	 termina-
     tion with a diagnostic message.

     The return	status of the program is zero  only  if	 no  file
     warning  or  error	 conditions  were  encountered.	 Warnings
     about file	contents generated during the  examination  phase
     do	not affect output status.

FILES
     /home/natcorp/bin/longmanconv
	  The program itself.  It contains both	this manual  page
	  (use	nroff -man /home/natcorp/bin/longmanconv or simi-
	  lar to print it), and	the tables  used  to  define  the
	  various transductions	and tests applied to the text.

AUTHOR
     Dominic Dunlop

SEE ALSO
     perl(1), TGCW03: CPH Appendix A; TGCW25: Markup for  non-ISO
     646  invariant  part  characters;	TGCW30:	 Corpus	 Document
     Interchange Format, v1.2; TGCW35:	Corpus	text  processing:
     directory	structure  and filenames; TGCW50 overnight - per-
     form overnight housekeeping for corpus files; TGCW55 oup2bnc
     -	update	BNC  database  with  information derived from OUP
     database; TGCW56: Guide to	Longman/Lancaster header codes.

NOTES
     Under the -c option, OUP-style classification information is


Sun Release 4.Last change: TGCW57: 2 September,	1993		6


LongmanCONV(1l)	  MISC.	REFERENCE MANUAL PAGES	  LongmanCONV(1l)


     sent  to  standard	 output.  In order that	this is	processed
     automatically  by	the  overnight	program,   classification
     information  for  all  files  in a	given directory	should be
     redirected	to a file named	classif	in  the	 directory  named
     for  the date on which the	files were received from Longman.
     This is generally the parent directory of the directory con-
     taining  the  files.   Thus, the second command line example
     shown under USAGE above does the trick.

     There are two styles of quote mark	usage in  Longman  files:
     in	 a given text, start and end quote may be differentiated,
     or	they may not.  In the first case, the output of	 longman-
     conv will contain both &bquo; and &equo;; in the second, all
     quote marks should	appear as &quot;.

BUGS
     The transductions and subsequent examinations applied to the
     input  may	 not  be correct in every case,	nor do they cover
     every possibility.	 Let Dominic know if you encounter a cir-
     cumstance applying	to more	than one input file which is han-
     dled incorrectly, or not handled at all.

     It	would be nice if the  regular  expressions  used  in  the
     examination phase where easily available for use in editors,
     in	order that they	could subsequently be used to locate  the
     source  of	 a  problem.  Sadly, they are perl-format regular
     expressions, which	differ subtly but annoyingly from  emacs-
     and  vi-format  regular expression, making	manual conversion
     is	necessary.  The	emacs versions shown above under DESCRIP-
     TION are untested approximations.


Sun Release 4.Last change: TGCW57: 2 September,	1993		7