LongmanCONV(1l) MISC. REFERENCE MANUAL PAGES LongmanCONV(1l) NAME longmanconv - convert Longman transfer format to approximate CDIF SYNOPSIS longmanconv [-vz] Longman-format_file... DESCRIPTION longmanconv takes as its input one or more Longman-format (dot) files and produces corresponding A_ files. Option- ally, corresponding OUP-style classification (classif) information and documentation (Z_) files may also be pro- duced. To convert all the dot files in the current direc- tory, producing A_ and Z_ files, use the command line longmanconv -vz .?????? To do the same, and in addition generate OUP-style classifi- cation data for subsequent automatic insertion into the BNC database, use longmanconv -cvz .?????? > ../classif The program operates on each input file in three phases - transduction, examination, and documentation. The transduc- tion and examination phases are controlled by tables appear- ing at the end of the program source file - see FILES below. In the transduction phase, the Longman file prologue is transformed into a CDIF prologue, including a dummy header. If the -c option (see below) is in force, information from the Longman file prologue is written to standard output. The format of the data is close to that of the classifica- tion records provided by OUP for written texts, and can be processed in the same manner - see NOTES below. A sequence of regular expression-based substitutions is then made on the remainder of the file. These cover - differences between Longman transfer format, as defined in TGCW03 and TGCW56, and CDIF; - differences between Longman data capture format, as defined in TGCW03, and CDIF (which should be handled by Longman prior to transfer, but which appear from time to time in received files); and - a number of heuristic fixes for commonly-encountered errors in Longman transfer format tiles. In the examination phase, an attempt is made to match each of a sequence of regular expressions against the output of Sun Release 4.Last change: TGCW57: 2 September, 1993 1 LongmanCONV(1l) MISC. REFERENCE MANUAL PAGES LongmanCONV(1l) the transduction phase. A match indicates a possible or definite problem, or an area requiring attention. A warning is produced for each regular expression which matches. Mul- tiple matches of a given regular expression are not attempted: the warnings only state that a particular condi- tion exists; they do not state how often it occurs, or give the location at which the match was found. The result of the transduction phase is then is directed to a file named by substituting the leading dot of the trailing pathname component of the input filename with A_. If a file of this name already exists, it is overwritten. Note that no line wrapping takes place: output lineation is close to that of the input file. The documentation phase writes the warnings produced in the examination phase either to standard error or, if the -z option is in force, to a file named by substituting the leading dot of the trailing pathname component of the input filename with Z_. The information written includes a word count. emacs versions of the regular expressions applied during the examination phase are as follows (space represented by underscore (_), line-feed by control-J (^J)): ^@ Text contains control/non-ASCII characters. Some Longman texts contain patches of digital garbage; some contain single non-ASCII characters representing, for example, currency signs. long- manconv first rewrites known non-ASCII characters, such as the IBM PC pound sterling sign, then attempts to reduce what remains to single null characters. These intentionally cause SGML errors when the syntax of the file is checked: the text around each such error should be checked for dam- age. If a text contains many nulls, indicating many patches of garbage, have no compunction about bouncing it. If there are only a few, or if it seems that a clean file can be created simply by deleting the nulls or by changing then to some other character using a query-replace operation, consider accepting the text. Please let Dominic know if you suspect that longmanconv should be rewriting some non-ASCII character - say, that for -, but is not. (^@ is ASCII null - use C-QC-@ in emacs to generate this character.) &[gl]t; Text contains </>: badly constructed tags? Longman seems inclined occasionally to mistype the starting and ending angle-brackets enclosing mark-up. longmanconv attempts to fix the most Sun Release 4.Last change: TGCW57: 2 September, 1993 2 LongmanCONV(1l) MISC. REFERENCE MANUAL PAGES LongmanCONV(1l) common cases, but cannot fix everything. If prob- lems remain, a syntax check will probably fail. The test gives false positives on genuine occurences of </>. [0-9]/[-0-9] Unmarked fractions or #sd amounts? (Grrr. No British pound sign in this troff.) With the excep- tion of the occasional ½, Longman does not specifically mark fractions or pounds, shillings and pence amounts. If time is available (which it probably is not), give consideration to using ¼ etc. and &shilling; where appropriate. Things like musical time signatures and reference numbers should be left alone, however. &bquo;&[be]quo; Problems with nested quote marks? &bquo;bquo; may be legal if a quotation starts with a nested quo- tation; &bquo;&equo; is always an error. [,!?], End quote after [,!?] rendered as comma? This is a relatively frequent scanner error. &bquo;[^&]+\., End quote after . rendered as comma? This is basi- cally the same error as the previous one, but is harder to detect, as a simple search for ., yields false positives because of the full stops used in abbreviations. The regular expression used does not detect errors in quotations containing entity references, and may give false positives for quo- tations containing abbreviations. [_^J]'[^'] Apostrophe used as quote mark? The regular expres- sion detects an apostrophe at the beginning of a word. There is no check for apostrophes used to end quotations, as these are difficult to distin- guish from legal uses of apostrophes. Illegally- used apostrophes should be replaced with &bquo; and &equo;, or with ". (See also next two items and NOTES.) [_^JxX-]\([0-9]+.\)?[0-9]+\(&equo;\|'\) &ft; or &ins; rendered as &equo;? Prime or double-prime following digits probably represent a unit of measurement, rather than a closing quote mark. The test gives false positives where quoted material ends with a year or model number. [_^JxX-]\([0-9]+.\)?[0-9]+\("\|'\) Sun Release 4.Last change: TGCW57: 2 September, 1993 3 LongmanCONV(1l) MISC. REFERENCE MANUAL PAGES LongmanCONV(1l) &ft; or &ins; rendered as "? This is the same as the previous test, except that it applies to texts in which Longman has not differentiated start and end quote marks. ^-[^0-9] Line-start hyphens. Line-start hyphens are legal only where they introduce a negative number. Other occurences should be eliminated. A high level of illegal hyphens is justification for bouncing a text. (See also next item.) -_*$ Line-end hyphens. These should be legal where con- structions such as half- and full-board are split across lines after the first hyphen. However, as checking for such constructions is far from trivial, such splitting is considered illegal: the text should be amended so that the split appears elsewhere. [0-9]-[0-9] Hyphens instead of –? Conventionally, dashes separating digits are en dashes, but Longman sel- dom, if ever, renders them as such. longmanconv does not attempt to enforce the convention because of the risk of mangling formulae such as A=7-2\. Consider fixing things up using emacs' query- replace-regexp or similar. \.(<[^>]+>)?_l Initial I rendered as l? This is a common scanner error, as are the next five cases. [a-z]L l rendered as L? [a-z]P p rendered as P? [0-9][oOl][0-9] Internal 0, 1 rendered as letters? \<[oOl][0-9] Initial 0, 1 rendered as letters? [0-9][oOl]\> Terminal 0, 1 rendered as letters? [_^J]* Multiple typeless s per
. The semantics of CDIF require that all s except the first have a type attribute with a value of sub or byline. Longman is also prone to splitting s in two for no good reason. Sun Release 4.Last change: TGCW57: 2 September, 1993 4 LongmanCONV(1l) MISC. REFERENCE MANUAL PAGES LongmanCONV(1l)
Text contains un-numbered
s. Longman occa- sionally uses un-numbered
s, which are not legal in written CDIF texts. They should be replaced with appropriately-numbered s. Text contains s: consider incrementing. Some Longman texts use , the use of which is discouraged in CDIF. (We're holding it back in case of some future requirement that we haven't anticipated.) If a single encloses a whole text, it can be eliminated; otherwise, consider bumping all s up to . (C-xhC-uM-|bumpdiv does this to the contents of an emacs buffer - provided that ~natcorp/bin is on your search path.) [_^J]* Spurious tags? This is a search for com- pletely empty
s. \([_^J]*[_^J]*\)+ Parts, chapters both ? Where a book is divided into parts, Longman may mark both the parts and the chapters as s. Chapters, and all subordinate divisions, should be knocked down a level. (You can use bumpdiv, as in the previous case, then knock back the few s which should be s by hand.)

\(\(]*>\)\|[_^J]\)*\(s. The regular expression detects

s immediately followed by another

or by a

tag, possibly with intervening elements.

[_^J]*[a-z] Spurious

s? This is a simple check for para- graphs starting with a lower-case letter, a cir- cumstance which may be legal in some cases. A number of warnings such as Check date have been omitted from the list. These suggest actions concerned with the clean-up of information in the dummy header. OPTIONS -c Write information in the format found in OUP classif files to standard output. See NOTES below. -v The -v option causes each file name to be printed on standard error immediately before the file is pro- cessed. -z The -z option results in the documentation (Z_) file Sun Release 4.Last change: TGCW57: 2 September, 1993 5 LongmanCONV(1l) MISC. REFERENCE MANUAL PAGES LongmanCONV(1l) corresponding to each input file being created. The first line gives the text name; the second is blank; and the third gives the date, the login name of the user running longmanconv, and the command line used. The fourth and subsequent lines carry the warnings pro- duced by the examination phase. If the file exists, it is overwritten. In the absence of the -z option, warnings about the contents of the file are sent to standard error. DIAGNOSTICS Various error conditions associated with non-existent, badly-named, or unreadable input files, with input files which are not valid Longman transfer format documents, and with output files which cannot be created, result in ``file skipped'' messages on standard error. Under the -z options, a warning is given on standard error if the documentation (Z_) file cannot be created. The main output file is still written under these circumstances. More serious error conditions result in immediate termina- tion with a diagnostic message. The return status of the program is zero only if no file warning or error conditions were encountered. Warnings about file contents generated during the examination phase do not affect output status. FILES /home/natcorp/bin/longmanconv The program itself. It contains both this manual page (use nroff -man /home/natcorp/bin/longmanconv or simi- lar to print it), and the tables used to define the various transductions and tests applied to the text. AUTHOR Dominic Dunlop SEE ALSO perl(1), TGCW03: CPH Appendix A; TGCW25: Markup for non-ISO 646 invariant part characters; TGCW30: Corpus Document Interchange Format, v1.2; TGCW35: Corpus text processing: directory structure and filenames; TGCW50 overnight - per- form overnight housekeeping for corpus files; TGCW55 oup2bnc - update BNC database with information derived from OUP database; TGCW56: Guide to Longman/Lancaster header codes. NOTES Under the -c option, OUP-style classification information is Sun Release 4.Last change: TGCW57: 2 September, 1993 6 LongmanCONV(1l) MISC. REFERENCE MANUAL PAGES LongmanCONV(1l) sent to standard output. In order that this is processed automatically by the overnight program, classification information for all files in a given directory should be redirected to a file named classif in the directory named for the date on which the files were received from Longman. This is generally the parent directory of the directory con- taining the files. Thus, the second command line example shown under USAGE above does the trick. There are two styles of quote mark usage in Longman files: in a given text, start and end quote may be differentiated, or they may not. In the first case, the output of longman- conv will contain both &bquo; and &equo;; in the second, all quote marks should appear as ". BUGS The transductions and subsequent examinations applied to the input may not be correct in every case, nor do they cover every possibility. Let Dominic know if you encounter a cir- cumstance applying to more than one input file which is han- dled incorrectly, or not handled at all. It would be nice if the regular expressions used in the examination phase where easily available for use in editors, in order that they could subsequently be used to locate the source of a problem. Sadly, they are perl-format regular expressions, which differ subtly but annoyingly from emacs- and vi-format regular expression, making manual conversion is necessary. The emacs versions shown above under DESCRIP- TION are untested approximations. Sun Release 4.Last change: TGCW57: 2 September, 1993 7