LongmanCONV(1l) MISC. REFERENCE MANUAL PAGES LongmanCONV(1l) NAME longmanconv - convert Longman transfer format to approximate CDIF SYNOPSIS longmanconv [-vz] Longman-format_file... DESCRIPTION longmanconv takes as its input one or more Longman-format (dot) files and produces corresponding A_ files. Option- ally, corresponding OUP-style classification (classif) information and documentation (Z_) files may also be pro- duced. To convert all the dot files in the current direc- tory, producing A_ and Z_ files, use the command line longmanconv -vz .?????? To do the same, and in addition generate OUP-style classifi- cation data for subsequent automatic insertion into the BNC database, use longmanconv -cvz .?????? > ../classif The program operates on each input file in three phases - transduction, examination, and documentation. The transduc- tion and examination phases are controlled by tables appear- ing at the end of the program source file - see FILES below. In the transduction phase, the Longman file prologue is transformed into a CDIF prologue, including a dummy header. If the -c option (see below) is in force, information from the Longman file prologue is written to standard output. The format of the data is close to that of the classifica- tion records provided by OUP for written texts, and can be processed in the same manner - see NOTES below. A sequence of regular expression-based substitutions is then made on the remainder of the file. These cover - differences between Longman transfer format, as defined in TGCW03 and TGCW56, and CDIF; - differences between Longman data capture format, as defined in TGCW03, and CDIF (which should be handled by Longman prior to transfer, but which appear from time to time in received files); and - a number of heuristic fixes for commonly-encountered errors in Longman transfer format tiles. In the examination phase, an attempt is made to match each of a sequence of regular expressions against the output of Sun Release 4.Last change: TGCW57: 2 September, 1993 1 LongmanCONV(1l) MISC. REFERENCE MANUAL PAGES LongmanCONV(1l) the transduction phase. A match indicates a possible or definite problem, or an area requiring attention. A warning is produced for each regular expression which matches. Mul- tiple matches of a given regular expression are not attempted: the warnings only state that a particular condi- tion exists; they do not state how often it occurs, or give the location at which the match was found. The result of the transduction phase is then is directed to a file named by substituting the leading dot of the trailing pathname component of the input filename with A_. If a file of this name already exists, it is overwritten. Note that no line wrapping takes place: output lineation is close to that of the input file. The documentation phase writes the warnings produced in the examination phase either to standard error or, if the -z option is in force, to a file named by substituting the leading dot of the trailing pathname component of the input filename with Z_. The information written includes a word count. emacs versions of the regular expressions applied during the examination phase are as follows (space represented by underscore (_), line-feed by control-J (^J)): ^@ Text contains control/non-ASCII characters. Some Longman texts contain patches of digital garbage; some contain single non-ASCII characters representing, for example, currency signs. long- manconv first rewrites known non-ASCII characters, such as the IBM PC pound sterling sign, then attempts to reduce what remains to single null characters. These intentionally cause SGML errors when the syntax of the file is checked: the text around each such error should be checked for dam- age. If a text contains many nulls, indicating many patches of garbage, have no compunction about bouncing it. If there are only a few, or if it seems that a clean file can be created simply by deleting the nulls or by changing then to some other character using a query-replace operation, consider accepting the text. Please let Dominic know if you suspect that longmanconv should be rewriting some non-ASCII character - say, that for -, but is not. (^@ is ASCII null - use C-QC-@ in emacs to generate this character.) &[gl]t; Text contains </>: badly constructed tags? Longman seems inclined occasionally to mistype the starting and ending angle-brackets enclosing mark-up. longmanconv attempts to fix the most Sun Release 4.Last change: TGCW57: 2 September, 1993 2 LongmanCONV(1l) MISC. REFERENCE MANUAL PAGES LongmanCONV(1l) common cases, but cannot fix everything. If prob- lems remain, a syntax check will probably fail. The test gives false positives on genuine occurences of </>. [0-9]/[-0-9] Unmarked fractions or #sd amounts? (Grrr. No British pound sign in this troff.) With the excep- tion of the occasional ½, Longman does not specifically mark fractions or pounds, shillings and pence amounts. If time is available (which it probably is not), give consideration to using ¼ etc. and &shilling; where appropriate. Things like musical time signatures and reference numbers should be left alone, however. &bquo;&[be]quo; Problems with nested quote marks? &bquo;bquo; may be legal if a quotation starts with a nested quo- tation; &bquo;&equo; is always an error. [,!?], End quote after [,!?] rendered as comma? This is a relatively frequent scanner error. &bquo;[^&]+\., End quote after . rendered as comma? This is basi- cally the same error as the previous one, but is harder to detect, as a simple search for ., yields false positives because of the full stops used in abbreviations. The regular expression used does not detect errors in quotations containing entity references, and may give false positives for quo- tations containing abbreviations. [_^J]'[^'] Apostrophe used as quote mark? The regular expres- sion detects an apostrophe at the beginning of a word. There is no check for apostrophes used to end quotations, as these are difficult to distin- guish from legal uses of apostrophes. Illegally- used apostrophes should be replaced with &bquo; and &equo;, or with ". (See also next two items and NOTES.) [_^JxX-]\([0-9]+.\)?[0-9]+\(&equo;\|'\) &ft; or &ins; rendered as &equo;? Prime or double-prime following digits probably represent a unit of measurement, rather than a closing quote mark. The test gives false positives where quoted material ends with a year or model number. [_^JxX-]\([0-9]+.\)?[0-9]+\("\|'\) Sun Release 4.Last change: TGCW57: 2 September, 1993 3 LongmanCONV(1l) MISC. REFERENCE MANUAL PAGES LongmanCONV(1l) &ft; or &ins; rendered as "? This is the same as the previous test, except that it applies to texts in which Longman has not differentiated start and end quote marks. ^-[^0-9] Line-start hyphens. Line-start hyphens are legal only where they introduce a negative number. Other occurences should be eliminated. A high level of illegal hyphens is justification for bouncing a text. (See also next item.) -_*$ Line-end hyphens. These should be legal where con- structions such as half- and full-board are split across lines after the first hyphen. However, as checking for such constructions is far from trivial, such splitting is considered illegal: the text should be amended so that the split appears elsewhere. [0-9]-[0-9] Hyphens instead of –? Conventionally, dashes separating digits are en dashes, but Longman sel- dom, if ever, renders them as such. longmanconv does not attempt to enforce the convention because of the risk of mangling formulae such as A=7-2\. Consider fixing things up using emacs' query- replace-regexp or similar. \.(<[^>]+>)?_l Initial I rendered as l? This is a common scanner error, as are the next five cases. [a-z]L l rendered as L? [a-z]P p rendered as P? [0-9][oOl][0-9] Internal 0, 1 rendered as letters? \<[oOl][0-9] Initial 0, 1 rendered as letters? [0-9][oOl]\> Terminal 0, 1 rendered as letters? [_^J]*
Multiple typeless s per\(\( s. The regular expression detects s
immediately followed by another or by a [_^J]*[a-z]
Spurious s? This is a simple check for para-
graphs starting with a lower-case letter, a cir-
cumstance which may be legal in some cases.
A number of warnings such as Check date have been omitted
from the list. These suggest actions concerned with the
clean-up of information in the dummy header.
OPTIONS
-c Write information in the format found in OUP classif
files to standard output. See NOTES below.
-v The -v option causes each file name to be printed on
standard error immediately before the file is pro-
cessed.
-z The -z option results in the documentation (Z_) file
Sun Release 4.Last change: TGCW57: 2 September, 1993 5
LongmanCONV(1l) MISC. REFERENCE MANUAL PAGES LongmanCONV(1l)
corresponding to each input file being created. The
first line gives the text name; the second is blank;
and the third gives the date, the login name of the
user running longmanconv, and the command line used.
The fourth and subsequent lines carry the warnings pro-
duced by the examination phase.
If the file exists, it is overwritten.
In the absence of the -z option, warnings about the
contents of the file are sent to standard error.
DIAGNOSTICS
Various error conditions associated with non-existent,
badly-named, or unreadable input files, with input files
which are not valid Longman transfer format documents, and
with output files which cannot be created, result in ``file
skipped'' messages on standard error.
Under the -z options, a warning is given on standard error
if the documentation (Z_) file cannot be created. The main
output file is still written under these circumstances.
More serious error conditions result in immediate termina-
tion with a diagnostic message.
The return status of the program is zero only if no file
warning or error conditions were encountered. Warnings
about file contents generated during the examination phase
do not affect output status.
FILES
/home/natcorp/bin/longmanconv
The program itself. It contains both this manual page
(use nroff -man /home/natcorp/bin/longmanconv or simi-
lar to print it), and the tables used to define the
various transductions and tests applied to the text.
AUTHOR
Dominic Dunlop
SEE ALSO
perl(1), TGCW03: CPH Appendix A; TGCW25: Markup for non-ISO
646 invariant part characters; TGCW30: Corpus Document
Interchange Format, v1.2; TGCW35: Corpus text processing:
directory structure and filenames; TGCW50 overnight - per-
form overnight housekeeping for corpus files; TGCW55 oup2bnc
- update BNC database with information derived from OUP
database; TGCW56: Guide to Longman/Lancaster header codes.
NOTES
Under the -c option, OUP-style classification information is
Sun Release 4.Last change: TGCW57: 2 September, 1993 6
LongmanCONV(1l) MISC. REFERENCE MANUAL PAGES LongmanCONV(1l)
sent to standard output. In order that this is processed
automatically by the overnight program, classification
information for all files in a given directory should be
redirected to a file named classif in the directory named
for the date on which the files were received from Longman.
This is generally the parent directory of the directory con-
taining the files. Thus, the second command line example
shown under USAGE above does the trick.
There are two styles of quote mark usage in Longman files:
in a given text, start and end quote may be differentiated,
or they may not. In the first case, the output of longman-
conv will contain both &bquo; and &equo;; in the second, all
quote marks should appear as ".
BUGS
The transductions and subsequent examinations applied to the
input may not be correct in every case, nor do they cover
every possibility. Let Dominic know if you encounter a cir-
cumstance applying to more than one input file which is han-
dled incorrectly, or not handled at all.
It would be nice if the regular expressions used in the
examination phase where easily available for use in editors,
in order that they could subsequently be used to locate the
source of a problem. Sadly, they are perl-format regular
expressions, which differ subtly but annoyingly from emacs-
and vi-format regular expression, making manual conversion
is necessary. The emacs versions shown above under DESCRIP-
TION are untested approximations.
Sun Release 4.Last change: TGCW57: 2 September, 1993 7