OUPCONV(1l)       MISC. REFERENCE MANUAL PAGES        OUPCONV(1l)


NAME
     oupconv - convert OUP transfer format to approximate CDIF

SYNOPSIS
     oupconv [-vz] _O_U_P-_f_o_r_m_a_t__f_i_l_e...

DESCRIPTION
     oupconv takes as its input  one  or  more  OUP-format  (dot)
     files  and  produces  corresponding  A_  files.  Optionally,
     corresponding documentation (Z_) files may also be produced.
     To  convert all the dot files in the current directory, pro-
     ducing both A_ and Z_ files, use the command line

          oupconv -vz .??????

     The program operates on each input file  in  four  phases  -
     transduction,  examination,  reformatting and documentation.
     The transduction and examination phases  are  controlled  by
     tables appearing at the end of the program source file - see
     FILES below.

     In  the  transduction  phase,  the  OUP  file  prologue   is
     transformed  into a CDIF prologue, including a dummy header.
     A sequence of regular expression-based substitutions is then
     made on the remainder of the file.  These cover

     -    differences between OUP transfer format, as defined  in
          TGCW33, and CDIF;

     -    differences between OUP data capture format, as defined
          in TGCW04 and TGCW52, and CDIF (which should be handled
          by OUP prior to transfer, but which appear from time to
          time in received files);

     -    a number of heuristic  fixes  for  commonly-encountered
          errors in OUP transfer format tiles; and

     -    the conversion of <sic>_e_r_r_e_r</sic>  to  <reg  sic=_e_r_r_e_r
          ed=OUCS>_e_r_r_e_r</reg>  in  order  that corrected forms of
          spelling errors identified by OUP may be added.

     In the examination phase, an attempt is made to  match  each
     of  a  sequence of regular expressions against the output of
     the transduction phase.  A match  indicates  a  possible  or
     definite problem, or an area requiring attention.  A warning
     is produced for each regular expression which matches.  Mul-
     tiple   matches  of  a  given  regular  expression  are  not
     attempted: the warnings only state that a particular  condi-
     tion  exists; they do not state how often it occurs, or give
     the location at which the match was found.


Sun Release 4.1Last change: TGCW54: 10 August, 1993             1


OUPCONV(1l)       MISC. REFERENCE MANUAL PAGES        OUPCONV(1l)


     The reformatting phase directs the output of  the  transduc-
     tion phase through a line-wrapping program which attempts to
     add line-feeds to its input so as to  limit  the  length  of
     output lines to 74 characters while not breaking tags across
     lines.  The output of the wrapping program is directed to  a
     file  named  by substituting the leading dot of the trailing
     pathname component of the input filename with A_.  If a file
     of this name already exists, it is overwritten.

     The documentation phase writes the warnings produced in  the
     examination  phase  either  to  standard error or, if the -z
     option is in force, to a  file  named  by  substituting  the
     leading  dot of the trailing pathname component of the input
     filename with Z_.  The information written includes  a  word
     count.

     emacs versions of the regular expressions applied during the
     examination  phase  are  as  follows  (space  represented by
     underscore (_), line-feed by control-J (^J)):

     &amp;[a-z]
               _N_o _s_p_a_c_e _a_f_t_e_r &_a_m_p;.   _B_a_d_l_y  _r_e_n_d_e_r_e_d  _e_n_t_i_t_i_e_s?
               Some  OUP texts suffer from the leading ampersands
               of entity references having themselves been turned
               into  entity  references.   oupconv fixes this for
               entity references representing  fractions;  others
               must be fixed by hand.  Test gives false positives
               on such constructions as r&amp;r.

     [0-9]/[-0-9]
               _U_n_m_a_r_k_e_d _f_r_a_c_t_i_o_n_s  _o_r  #_s_d  _a_m_o_u_n_t_s?  (Grrr.   No
               British pound sign in this troff.) Give considera-
               tion to using &frac12; etc. and  &shilling;  where
               appropriate.   Things like musical time signatures
               and reference numbers should be left  alone,  how-
               ever.

     <del desc=formula>
               _C_h_e_c_k   _u_s_a_g_e   _o_f   <_d_e_l   _d_e_s_c=_f_o_r_m_u_l_a>;.   <del
               desc=formula>  is a cop-out for anything formulaic
               and reckoned to be too complicated to represent in
               CDIF.   In  particular, it replaces anything which
               OUP has marked up as a fraction,  and  which  does
               not  have  a specific entity reference of its own.
               (&frac12;, &frac78; etc.).   OUP  sometimes  marks
               such  things  as  time  signatures  and  reference
               numbers as fractions: where <del desc=formula>  is
               used unexpectedly, check with the source text.

     &bquo;&[be]quo;
               _P_r_o_b_l_e_m_s _w_i_t_h _n_e_s_t_e_d _q_u_o_t_e _m_a_r_k_s? &bquo;bquo;  may
               be  legal  if  a  quotation  starts  with a nested


Sun Release 4.1Last change: TGCW54: 10 August, 1993             2


OUPCONV(1l)       MISC. REFERENCE MANUAL PAGES        OUPCONV(1l)


               quotation; &bquo;&equo; is always an error.

     [,!?],    _E_n_d _q_u_o_t_e _a_f_t_e_r [,!?] _r_e_n_d_e_r_e_d _a_s _c_o_m_m_a? This is a
               relatively frequent scanner error.

     &bquo;[^&]+\.,
               _E_n_d _q_u_o_t_e _a_f_t_e_r . _r_e_n_d_e_r_e_d _a_s _c_o_m_m_a? This is basi-
               cally  the  same error as the previous one, but is
               harder to detect, as a simple search for ., yields
               false  positives because of the full stops used in
               abbreviations.  The regular expression  used  does
               not  detect errors in quotations containing entity
               references, and may give false positives for  quo-
               tations containing abbreviations.

     [_^J]'[^']
               _A_p_o_s_t_r_o_p_h_e _u_s_e_d _a_s _q_u_o_t_e _m_a_r_k? The regular expres-
               sion  detects  an apostrophe at the beginning of a
               word.  There is no check for apostrophes  used  to
               end  quotations, as these are difficult to distin-
               guish from legal uses of apostrophes.   Illegally-
               used  apostrophes  should  be replaced with &bquo;
               and &equo;.  (See also next item.)

     ''        _D_o_u_b_l_e _a_p_o_s_t_r_o_p_h_e _u_s_e_d _a_s _q_u_o_t_e _m_a_r_k?

     ^-[^0-9]  _L_i_n_e-_s_t_a_r_t _h_y_p_h_e_n_s. (These are legal  if  followed
               by  digits  -  the hyphen presumably being a minus
               sign.)

     [_^JxX-]\([0-9]+.\)?[0-9]+\(&equo;\|'\)
               &_f_t;  _o_r  &_i_n_s;  _r_e_n_d_e_r_e_d  _a_s  &_e_q_u_o;?  Prime   or
               double-prime following digits probably represent a
               unit of measurement, rather than a  closing  quote
               mark.    If   measurement   occurs  within  quoted
               material, check following  &bquo;  and  &equo;  as
               quote normalization algorithm may have become con-
               fused.   The  test  gives  false  positives  where
               quoted material ends with a year or model number.

     &frac..;\(&equo;\|'\)
               &_f_t; _o_r &_i_n_s; _a_f_t_e_r _f_r_a_c_t_i_o_n _r_e_n_d_e_r_e_d  _a_s  &_e_q_u_o;?
               Same as previous case.

     -_*$      _L_i_n_e-_e_n_d _h_y_p_h_e_n_s. These should be legal where con-
               structions  such as _h_a_l_f- _a_n_d _f_u_l_l-_b_o_a_r_d are split
               across lines after the first hyphen.  However,  as
               checking   for  such  constructions  is  far  from
               trivial, such splitting is considered illegal: the
               text  should  be amended so that the split appears
               elsewhere.


Sun Release 4.1Last change: TGCW54: 10 August, 1993             3


OUPCONV(1l)       MISC. REFERENCE MANUAL PAGES        OUPCONV(1l)


     [0-9]-[0-9]
               _H_y_p_h_e_n_s _i_n_s_t_e_a_d _o_f &_n_d_a_s_h;? Conventionally, dashes
               separating  digits  are  en  dashes.  Conceivably,
               this warning could also be triggered  by  formulae
               such as A=7-2\.

     _-_       _H_y_p_h_e_n_s _i_n_s_t_e_a_d _o_f &_m_d_a_s_h;?  It  can  happen  that
               em-dashes  are  rendered  as  hyphens without sur-
               rounding white space.  This circumstance can  gen-
               erally  only  be  detected by close proof-reading,
               and so is unlikely to come to light.

     \.(<[^>]+>)?_l
               _I_n_i_t_i_a_l _I _r_e_n_d_e_r_e_d _a_s _l? This is a common  scanner
               error, as are the next six cases.

     [a-z]L    _l _r_e_n_d_e_r_e_d _a_s _L?

     [a-z]P    _p _r_e_n_d_e_r_e_d _a_s _P?

     [0-9][oOl][0-9]
               _I_n_t_e_r_n_a_l _0, _1 _r_e_n_d_e_r_e_d _a_s _l_e_t_t_e_r_s?

     \<[oOl][0-9]
               _I_n_i_t_i_a_l _0, _1 _r_e_n_d_e_r_e_d _a_s _l_e_t_t_e_r_s?

     [0-9][oOl]\>
               _T_e_r_m_i_n_a_l _0, _1 _r_e_n_d_e_r_e_d _a_s _l_e_t_t_e_r_s?

     <hi_+r=hi>[Oo0]</hi>
               8o9 _r_e_n_d_e_r_e_d _a_s _s_u_p_e_r_s_c_r_i_p_t _0?

     \.[_^J]*\.
               _R_u_n_s  _o_f  _f_u_l_l  _s_t_o_p_s.  This  generally  indicates
               ellipses  which  have  not been converted to &hel-
               lip;.

     [^.][_^J]\.[_^J][^.]
               _S_p_u_r_i_o_u_s _f_u_l_l _s_t_o_p_s? The search is for a full stop
               with  space  on either side, and which is not part
               of a run of full stops.  Generally, the problem is
               the interpretation by a scanner of a spurious mark
               on a page.  Ideally, one would like to search  for
               spurious full stops not surrounded by space.  How-
               ever, such a search gives many false positives  on
               abbreviations.

     </head>[_^J]*<head>
               _M_u_l_t_i_p_l_e _t_y_p_e_l_e_s_s <_h_e_a_d>_s _p_e_r <_d_i_v>. The semantics
               of  CDIF require that all <head>s except the first
               have a type attribute  with  a  value  of  sub  or
               byline.


Sun Release 4.1Last change: TGCW54: 10 August, 1993             4


OUPCONV(1l)       MISC. REFERENCE MANUAL PAGES        OUPCONV(1l)


     <div\(.\)>[_^J]*<div\1>
               _S_p_u_r_i_o_u_s <_d_i_v_n> _t_a_g_s? This is a  search  for  com-
               pletely empty <div>s.

     <div1>\([_^J]*<head[^<]+</head>[_^J]*\)+<div1>
               _P_a_r_t_s, _c_h_a_p_t_e_r_s  _b_o_t_h  <_d_i_v_1>?  Where  a  book  is
               divided  into  parts, OUP almost always marks both
               the parts and the chapters as <div1>s.   Chapters,
               and  all  subordinate divisions, should be knocked
               down a level.

     <table>   <_t_a_b_l_e> _t_a_g_s _p_r_e_s_e_n_t. OUP uses this tag  occasion-
               ally  to  represent tabular material, which may or
               may  not  have  been  deleted.   In  general,  the
               <table>  element, and its contents (if any) should
               be replaced with <del desc=table  ed=OUCS>  unless
               it is exceptionally easy to transform the contents
               into a <list>.

     <list>[^<]*<p>
               _L_i_s_t_s  _m_a_y  _n_e_e_d  <_h_e_a_d>_s.  A  spurious  paragraph
               between  a  <list>  tag  and  the  first <item> or
               <label> often turns out to be a list heading.   If
               the  heading is a sequence of column headings, the
               empty <lb> element may be used as a (rather  unsa-
               tisfactory)  way  of  separating  them  inside the
               <head>.

     <chp>     <_c_h_p> _t_a_g_s _r_e_m_a_i_n. The  translation  phase  should
               handle  these  unless they appear in an unexpected
               context, in which case this warning is  generated.
               The tags should be deleted if they are followed by
               a <div> tag, or replaced with a <div> tag if  they
               are not.

     <p>\(\(<pb[^>]*>\)\|[_^J]\)*\(<p\|<div[0-4]\)\W
               _E_m_p_t_y <_p>_s. The regular  expression  detects  <p>s
               immediately  followed by another <p> or by a <div>
               tag, possibly with intervening <pb> elements.

     <p>[_^J]*[a-z]
               _S_p_u_r_i_o_u_s <_p>_s? This is a simple  check  for  para-
               graphs  starting  with a lower-case letter, a cir-
               cumstance which may be legal in some cases.

     </quote>[_^J]*[^<]
               <_p>_s _m_i_s_s_i_n_g _a_f_t_e_r </_q_u_o_t_e>? <quote>  is  used  in
               CDIF  only  for  block quotations.  As such, a new
               paragraph  may  generally  be  expected  to  start
               immediately a <quote> has finished.  This need not
               always be true, however, and must be checked.


Sun Release 4.1Last change: TGCW54: 10 August, 1993             5


OUPCONV(1l)       MISC. REFERENCE MANUAL PAGES        OUPCONV(1l)


     <[^>]+_r=\w\w\w
               _U_n_r_e_c_o_g_n_i_z_e_d _r_e_n_d_i_t_i_o_n (_r=)  _v_a_l_u_e_s  _p_r_e_s_e_n_t.  The
               test  fails  to  catch  rendition  values with bad
               two-character names.

     -<reg\W   <_r_e_g>_s _p_r_e_c_e_d_e_d _b_y _h_y_p_h_e_n_s _p_r_e_s_e_n_t. As a point  of
               style,   <reg   sic=check-oyt>check-out</reg>   is
               preferable to check-<reg sic=oyt>out</reg> as,  in
               the  latter,  the  content of the <reg> element is
               something that would generally be  regarded  as  a
               whole  word,  whereas,  in  the former, it is not.
               However, fix-ups of this type  are  generally  too
               time-consuming to be worthwhile.

     <note_+type=ed
               _C_h_e_c_k _e_d_i_t_o_r_i_a_l _n_o_t_e_s. Most  OUP  editorial  notes
               are  eliminated,  or  transformed  into <del> ele-
               ments, as part of the transduction process.  Those
               which escape this net should be examined to see if
               they too should be replaced with <del>s.

     <reg\W    _C_h_e_c_k <_r_e_g>_s. This warning appears  for  any  text
               which,  after  transduction,  contains  <reg> ele-
               ments.  Some of these may be spurious, in that the
               enclosed word represents a valid British spelling;
               and  some  may  be  candidates  for  rewriting  as
               <sic>s,  where,  for  example, some intentionally-
               used variant form of a word appears in the  origi-
               nal text.

     A number of warnings such as _C_h_e_c_k _d_a_t_e  have  been  omitted
     from the list - see CAVEATS below.

OPTIONS
     -v   The -v option causes each file name to  be  printed  on
          standard  error  immediately  before  the  file is pro-
          cessed.

     -z   The -z option results in the  documentation  (Z_)  file
          corresponding  to  each  input file being created.  The
          first line gives the text name; the  second  is  blank;
          and  the  third  gives  the date, the login name of the
          user running oupconv, and the command line  used.   The
          fourth and subsequent lines carry the warnings produced
          by the examination phase.

          If the file exists, it is overwritten.

          In the absence of the -z  option,  warnings  about  the
          contents of the file are sent to standard error.


Sun Release 4.1Last change: TGCW54: 10 August, 1993             6


OUPCONV(1l)       MISC. REFERENCE MANUAL PAGES        OUPCONV(1l)


DIAGNOSTICS
     Various  error  conditions  associated  with   non-existent,
     badly-named,  or  unreadable  input  files, with input files
     which are not valid OUP transfer format documents, and  with
     output  files  which  cannot  be  created,  result in ``file
     skipped'' messages on standard error.

     Under the -z option, a warning is given on standard error if
     the  documentation  (Z_)  file  cannot be created.  The main
     output file is still written under these circumstances.

     More serious error conditions result in  immediate  termina-
     tion with a diagnostic message.

     The return status of the program is zero  only  if  no  file
     warning  or  error  conditions  were  encountered.  Warnings
     about file contents generated during the  examination  phase
     do not affect output status.

CAVEATS
     The program guesses that a file represents a book if its BNC
     name is identical to its OUP name.  If the names differ only
     in their last character, the file is presumed to represent a
     periodical  or other non-book material.  If the names differ
     in more characters than  the  last,  the  program  does  not
     express  an opinion on the type of the source material.  The
     guesses determine the contents of the dummy CDIF header, the
     values  of  the  attributes of the <text> tag, and whether a
     word count greater  than  40,000  elicits  a  comment.   The
     guesses are not always correct, and should be checked.

     The program is unable to pick issue dates  or  author  names
     out  of  OUP prologues because of the inconsistent manner in
     which this information is presented.

FILES
     /home/natcorp/bin/oupconv
          The program itself.  It contains both this manual  page
          (use nroff -man /home/natcorp/bin/oupconv or similar to
          print it), and the tables used to  define  the  various
          transductions and tests applied to the text.

AUTHOR
     Dominic Dunlop

SEE ALSO
     perl(1), TGCW04: _E_n_c_o_d_i_n_g _a_n_d _m_a_r_k_u_p _f_o_r  _t_h_e  _O_x_f_o_r_d  _P_i_l_o_t
     _C_o_r_p_u_s;  TGCW25: _M_a_r_k_u_p _f_o_r _n_o_n-_I_S_O _6_4_6 _i_n_v_a_r_i_a_n_t _p_a_r_t _c_h_a_r_-
     _a_c_t_e_r_s; TGCW30: _C_o_r_p_u_s _D_o_c_u_m_e_n_t  _I_n_t_e_r_c_h_a_n_g_e  _F_o_r_m_a_t,  _v_1._2;
     TGCW33:  _B_N_C  _d_a_t_a  _c_a_p_t_u_r_e:  _O_U_P _f_o_r_m_a_t _d_e_f_i_n_i_t_i_o_n _f_o_r _t_e_x_t
     _h_a_n_d_o_v_e_r _t_o _O_U_C_S; TGCW35: _C_o_r_p_u_s _t_e_x_t _p_r_o_c_e_s_s_i_n_g:  _d_i_r_e_c_t_o_r_y
     _s_t_r_u_c_t_u_r_e _a_n_d _f_i_l_e_n_a_m_e_s.


Sun Release 4.1Last change: TGCW54: 10 August, 1993             7


OUPCONV(1l)       MISC. REFERENCE MANUAL PAGES        OUPCONV(1l)


BUGS
     The transductions and subsequent examinations applied to the
     input  may  not  be correct in every case, nor do they cover
     every possibility.  Let Dominic know if you encounter a cir-
     cumstance applying to more than one input file which is han-
     dled incorrectly, or not handled at all.

     It would be nice if the  regular  expressions  used  in  the
     examination phase where easily available for use in editors,
     in order that they could subsequently be used to locate  the
     source  of  a  problem.  Sadly, they are perl-format regular
     expressions, which differ subtly but annoyingly from  emacs-
     and  vi-format  regular expression, making manual conversion
     is necessary.  The emacs versions shown above under DESCRIP-
     TION are untested approximations.

     If confronted with very large individual files (several hun-
     dred  thousand  words), the program may grow very large, and
     ultimately run out of memory.  In such cases, it  will  fail
     with  a message stating that it is out of memory, or that it
     is unable to fork.  If the failure occurs after oupconv  has
     processed  several  files, it may be possible to overcome it
     by processing the offending file on its own; if this  fails,
     the  only  solution  is  to  break the file into a number of
     files with  valid  OUP  prologues  by  hand,  process  these
     separately, then reconstitute.  oupconv could circumvent the
     problem by breaking up the file itself, but the  problem  is
     that there is no safe place to break.  For example, breaking
     ahead of <div1>s may prevent  spurious  <div1>s  from  being
     detected.  So the work is left to a human.

     Even in the normal course of events,  oupconv  grows  rather
     large:  it seems that perl's regular expression evaluator is
     memory-hungry.


Sun Release 4.1Last change: TGCW54: 10 August, 1993             8