OUPCONV(1l) MISC. REFERENCE MANUAL PAGES OUPCONV(1l) NAME oupconv - convert OUP transfer format to approximate CDIF SYNOPSIS oupconv [-vz] _O_U_P-_f_o_r_m_a_t__f_i_l_e... DESCRIPTION oupconv takes as its input one or more OUP-format (dot) files and produces corresponding A_ files. Optionally, corresponding documentation (Z_) files may also be produced. To convert all the dot files in the current directory, pro- ducing both A_ and Z_ files, use the command line oupconv -vz .?????? The program operates on each input file in four phases - transduction, examination, reformatting and documentation. The transduction and examination phases are controlled by tables appearing at the end of the program source file - see FILES below. In the transduction phase, the OUP file prologue is transformed into a CDIF prologue, including a dummy header. A sequence of regular expression-based substitutions is then made on the remainder of the file. These cover - differences between OUP transfer format, as defined in TGCW33, and CDIF; - differences between OUP data capture format, as defined in TGCW04 and TGCW52, and CDIF (which should be handled by OUP prior to transfer, but which appear from time to time in received files); - a number of heuristic fixes for commonly-encountered errors in OUP transfer format tiles; and - the conversion of _e_r_r_e_r to _e_r_r_e_r in order that corrected forms of spelling errors identified by OUP may be added. In the examination phase, an attempt is made to match each of a sequence of regular expressions against the output of the transduction phase. A match indicates a possible or definite problem, or an area requiring attention. A warning is produced for each regular expression which matches. Mul- tiple matches of a given regular expression are not attempted: the warnings only state that a particular condi- tion exists; they do not state how often it occurs, or give the location at which the match was found. Sun Release 4.1Last change: TGCW54: 10 August, 1993 1 OUPCONV(1l) MISC. REFERENCE MANUAL PAGES OUPCONV(1l) The reformatting phase directs the output of the transduc- tion phase through a line-wrapping program which attempts to add line-feeds to its input so as to limit the length of output lines to 74 characters while not breaking tags across lines. The output of the wrapping program is directed to a file named by substituting the leading dot of the trailing pathname component of the input filename with A_. If a file of this name already exists, it is overwritten. The documentation phase writes the warnings produced in the examination phase either to standard error or, if the -z option is in force, to a file named by substituting the leading dot of the trailing pathname component of the input filename with Z_. The information written includes a word count. emacs versions of the regular expressions applied during the examination phase are as follows (space represented by underscore (_), line-feed by control-J (^J)): &[a-z] _N_o _s_p_a_c_e _a_f_t_e_r &_a_m_p;. _B_a_d_l_y _r_e_n_d_e_r_e_d _e_n_t_i_t_i_e_s? Some OUP texts suffer from the leading ampersands of entity references having themselves been turned into entity references. oupconv fixes this for entity references representing fractions; others must be fixed by hand. Test gives false positives on such constructions as r&r. [0-9]/[-0-9] _U_n_m_a_r_k_e_d _f_r_a_c_t_i_o_n_s _o_r #_s_d _a_m_o_u_n_t_s? (Grrr. No British pound sign in this troff.) Give considera- tion to using ½ etc. and &shilling; where appropriate. Things like musical time signatures and reference numbers should be left alone, how- ever. _C_h_e_c_k _u_s_a_g_e _o_f <_d_e_l _d_e_s_c=_f_o_r_m_u_l_a>;. is a cop-out for anything formulaic and reckoned to be too complicated to represent in CDIF. In particular, it replaces anything which OUP has marked up as a fraction, and which does not have a specific entity reference of its own. (½, ⅞ etc.). OUP sometimes marks such things as time signatures and reference numbers as fractions: where is used unexpectedly, check with the source text. &bquo;&[be]quo; _P_r_o_b_l_e_m_s _w_i_t_h _n_e_s_t_e_d _q_u_o_t_e _m_a_r_k_s? &bquo;bquo; may be legal if a quotation starts with a nested Sun Release 4.1Last change: TGCW54: 10 August, 1993 2 OUPCONV(1l) MISC. REFERENCE MANUAL PAGES OUPCONV(1l) quotation; &bquo;&equo; is always an error. [,!?], _E_n_d _q_u_o_t_e _a_f_t_e_r [,!?] _r_e_n_d_e_r_e_d _a_s _c_o_m_m_a? This is a relatively frequent scanner error. &bquo;[^&]+\., _E_n_d _q_u_o_t_e _a_f_t_e_r . _r_e_n_d_e_r_e_d _a_s _c_o_m_m_a? This is basi- cally the same error as the previous one, but is harder to detect, as a simple search for ., yields false positives because of the full stops used in abbreviations. The regular expression used does not detect errors in quotations containing entity references, and may give false positives for quo- tations containing abbreviations. [_^J]'[^'] _A_p_o_s_t_r_o_p_h_e _u_s_e_d _a_s _q_u_o_t_e _m_a_r_k? The regular expres- sion detects an apostrophe at the beginning of a word. There is no check for apostrophes used to end quotations, as these are difficult to distin- guish from legal uses of apostrophes. Illegally- used apostrophes should be replaced with &bquo; and &equo;. (See also next item.) '' _D_o_u_b_l_e _a_p_o_s_t_r_o_p_h_e _u_s_e_d _a_s _q_u_o_t_e _m_a_r_k? ^-[^0-9] _L_i_n_e-_s_t_a_r_t _h_y_p_h_e_n_s. (These are legal if followed by digits - the hyphen presumably being a minus sign.) [_^JxX-]\([0-9]+.\)?[0-9]+\(&equo;\|'\) &_f_t; _o_r &_i_n_s; _r_e_n_d_e_r_e_d _a_s &_e_q_u_o;? Prime or double-prime following digits probably represent a unit of measurement, rather than a closing quote mark. If measurement occurs within quoted material, check following &bquo; and &equo; as quote normalization algorithm may have become con- fused. The test gives false positives where quoted material ends with a year or model number. &frac..;\(&equo;\|'\) &_f_t; _o_r &_i_n_s; _a_f_t_e_r _f_r_a_c_t_i_o_n _r_e_n_d_e_r_e_d _a_s &_e_q_u_o;? Same as previous case. -_*$ _L_i_n_e-_e_n_d _h_y_p_h_e_n_s. These should be legal where con- structions such as _h_a_l_f- _a_n_d _f_u_l_l-_b_o_a_r_d are split across lines after the first hyphen. However, as checking for such constructions is far from trivial, such splitting is considered illegal: the text should be amended so that the split appears elsewhere. Sun Release 4.1Last change: TGCW54: 10 August, 1993 3 OUPCONV(1l) MISC. REFERENCE MANUAL PAGES OUPCONV(1l) [0-9]-[0-9] _H_y_p_h_e_n_s _i_n_s_t_e_a_d _o_f &_n_d_a_s_h;? Conventionally, dashes separating digits are en dashes. Conceivably, this warning could also be triggered by formulae such as A=7-2\. _-_ _H_y_p_h_e_n_s _i_n_s_t_e_a_d _o_f &_m_d_a_s_h;? It can happen that em-dashes are rendered as hyphens without sur- rounding white space. This circumstance can gen- erally only be detected by close proof-reading, and so is unlikely to come to light. \.(<[^>]+>)?_l _I_n_i_t_i_a_l _I _r_e_n_d_e_r_e_d _a_s _l? This is a common scanner error, as are the next six cases. [a-z]L _l _r_e_n_d_e_r_e_d _a_s _L? [a-z]P _p _r_e_n_d_e_r_e_d _a_s _P? [0-9][oOl][0-9] _I_n_t_e_r_n_a_l _0, _1 _r_e_n_d_e_r_e_d _a_s _l_e_t_t_e_r_s? \<[oOl][0-9] _I_n_i_t_i_a_l _0, _1 _r_e_n_d_e_r_e_d _a_s _l_e_t_t_e_r_s? [0-9][oOl]\> _T_e_r_m_i_n_a_l _0, _1 _r_e_n_d_e_r_e_d _a_s _l_e_t_t_e_r_s? [Oo0] 8o9 _r_e_n_d_e_r_e_d _a_s _s_u_p_e_r_s_c_r_i_p_t _0? \.[_^J]*\. _R_u_n_s _o_f _f_u_l_l _s_t_o_p_s. This generally indicates ellipses which have not been converted to &hel- lip;. [^.][_^J]\.[_^J][^.] _S_p_u_r_i_o_u_s _f_u_l_l _s_t_o_p_s? The search is for a full stop with space on either side, and which is not part of a run of full stops. Generally, the problem is the interpretation by a scanner of a spurious mark on a page. Ideally, one would like to search for spurious full stops not surrounded by space. How- ever, such a search gives many false positives on abbreviations. [_^J]* _M_u_l_t_i_p_l_e _t_y_p_e_l_e_s_s <_h_e_a_d>_s _p_e_r <_d_i_v>. The semantics of CDIF require that all s except the first have a type attribute with a value of sub or byline. Sun Release 4.1Last change: TGCW54: 10 August, 1993 4 OUPCONV(1l) MISC. REFERENCE MANUAL PAGES OUPCONV(1l) [_^J]* _S_p_u_r_i_o_u_s <_d_i_v_n> _t_a_g_s? This is a search for com- pletely empty
s. \([_^J]*[_^J]*\)+ _P_a_r_t_s, _c_h_a_p_t_e_r_s _b_o_t_h <_d_i_v_1>? Where a book is divided into parts, OUP almost always marks both the parts and the chapters as s. Chapters, and all subordinate divisions, should be knocked down a level. <_t_a_b_l_e> _t_a_g_s _p_r_e_s_e_n_t. OUP uses this tag occasion- ally to represent tabular material, which may or may not have been deleted. In general, the
element, and its contents (if any) should be replaced with unless it is exceptionally easy to transform the contents into a . [^<]*

_L_i_s_t_s _m_a_y _n_e_e_d <_h_e_a_d>_s. A spurious paragraph between a tag and the first or

tag, or replaced with a
tag if they are not.

\(\(]*>\)\|[_^J]\)*\(_s. The regular expression detects

s immediately followed by another

or by a

tag, possibly with intervening elements.

[_^J]*[a-z] _S_p_u_r_i_o_u_s <_p>_s? This is a simple check for para- graphs starting with a lower-case letter, a cir- cumstance which may be legal in some cases. [_^J]*[^<] <_p>_s _m_i_s_s_i_n_g _a_f_t_e_r ? is used in CDIF only for block quotations. As such, a new paragraph may generally be expected to start immediately a has finished. This need not always be true, however, and must be checked. Sun Release 4.1Last change: TGCW54: 10 August, 1993 5 OUPCONV(1l) MISC. REFERENCE MANUAL PAGES OUPCONV(1l) <[^>]+_r=\w\w\w _U_n_r_e_c_o_g_n_i_z_e_d _r_e_n_d_i_t_i_o_n (_r=) _v_a_l_u_e_s _p_r_e_s_e_n_t. The test fails to catch rendition values with bad two-character names. -_s _p_r_e_c_e_d_e_d _b_y _h_y_p_h_e_n_s _p_r_e_s_e_n_t. As a point of style, check-out is preferable to check-out as, in the latter, the content of the element is something that would generally be regarded as a whole word, whereas, in the former, it is not. However, fix-ups of this type are generally too time-consuming to be worthwhile. ele- ments, as part of the transduction process. Those which escape this net should be examined to see if they too should be replaced with s. _s. This warning appears for any text which, after transduction, contains ele- ments. Some of these may be spurious, in that the enclosed word represents a valid British spelling; and some may be candidates for rewriting as s, where, for example, some intentionally- used variant form of a word appears in the origi- nal text. A number of warnings such as _C_h_e_c_k _d_a_t_e have been omitted from the list - see CAVEATS below. OPTIONS -v The -v option causes each file name to be printed on standard error immediately before the file is pro- cessed. -z The -z option results in the documentation (Z_) file corresponding to each input file being created. The first line gives the text name; the second is blank; and the third gives the date, the login name of the user running oupconv, and the command line used. The fourth and subsequent lines carry the warnings produced by the examination phase. If the file exists, it is overwritten. In the absence of the -z option, warnings about the contents of the file are sent to standard error. Sun Release 4.1Last change: TGCW54: 10 August, 1993 6 OUPCONV(1l) MISC. REFERENCE MANUAL PAGES OUPCONV(1l) DIAGNOSTICS Various error conditions associated with non-existent, badly-named, or unreadable input files, with input files which are not valid OUP transfer format documents, and with output files which cannot be created, result in ``file skipped'' messages on standard error. Under the -z option, a warning is given on standard error if the documentation (Z_) file cannot be created. The main output file is still written under these circumstances. More serious error conditions result in immediate termina- tion with a diagnostic message. The return status of the program is zero only if no file warning or error conditions were encountered. Warnings about file contents generated during the examination phase do not affect output status. CAVEATS The program guesses that a file represents a book if its BNC name is identical to its OUP name. If the names differ only in their last character, the file is presumed to represent a periodical or other non-book material. If the names differ in more characters than the last, the program does not express an opinion on the type of the source material. The guesses determine the contents of the dummy CDIF header, the values of the attributes of the tag, and whether a word count greater than 40,000 elicits a comment. The guesses are not always correct, and should be checked. The program is unable to pick issue dates or author names out of OUP prologues because of the inconsistent manner in which this information is presented. FILES /home/natcorp/bin/oupconv The program itself. It contains both this manual page (use nroff -man /home/natcorp/bin/oupconv or similar to print it), and the tables used to define the various transductions and tests applied to the text. AUTHOR Dominic Dunlop SEE ALSO perl(1), TGCW04: _E_n_c_o_d_i_n_g _a_n_d _m_a_r_k_u_p _f_o_r _t_h_e _O_x_f_o_r_d _P_i_l_o_t _C_o_r_p_u_s; TGCW25: _M_a_r_k_u_p _f_o_r _n_o_n-_I_S_O _6_4_6 _i_n_v_a_r_i_a_n_t _p_a_r_t _c_h_a_r_- _a_c_t_e_r_s; TGCW30: _C_o_r_p_u_s _D_o_c_u_m_e_n_t _I_n_t_e_r_c_h_a_n_g_e _F_o_r_m_a_t, _v_1._2; TGCW33: _B_N_C _d_a_t_a _c_a_p_t_u_r_e: _O_U_P _f_o_r_m_a_t _d_e_f_i_n_i_t_i_o_n _f_o_r _t_e_x_t _h_a_n_d_o_v_e_r _t_o _O_U_C_S; TGCW35: _C_o_r_p_u_s _t_e_x_t _p_r_o_c_e_s_s_i_n_g: _d_i_r_e_c_t_o_r_y _s_t_r_u_c_t_u_r_e _a_n_d _f_i_l_e_n_a_m_e_s. Sun Release 4.1Last change: TGCW54: 10 August, 1993 7 OUPCONV(1l) MISC. REFERENCE MANUAL PAGES OUPCONV(1l) BUGS The transductions and subsequent examinations applied to the input may not be correct in every case, nor do they cover every possibility. Let Dominic know if you encounter a cir- cumstance applying to more than one input file which is han- dled incorrectly, or not handled at all. It would be nice if the regular expressions used in the examination phase where easily available for use in editors, in order that they could subsequently be used to locate the source of a problem. Sadly, they are perl-format regular expressions, which differ subtly but annoyingly from emacs- and vi-format regular expression, making manual conversion is necessary. The emacs versions shown above under DESCRIP- TION are untested approximations. If confronted with very large individual files (several hun- dred thousand words), the program may grow very large, and ultimately run out of memory. In such cases, it will fail with a message stating that it is out of memory, or that it is unable to fork. If the failure occurs after oupconv has processed several files, it may be possible to overcome it by processing the offending file on its own; if this fails, the only solution is to break the file into a number of files with valid OUP prologues by hand, process these separately, then reconstitute. oupconv could circumvent the problem by breaking up the file itself, but the problem is that there is no safe place to break. For example, breaking ahead of s may prevent spurious s from being detected. So the work is left to a human. Even in the normal course of events, oupconv grows rather large: it seems that perl's regular expression evaluator is memory-hungry. Sun Release 4.1Last change: TGCW54: 10 August, 1993 8