Divisions of written texts
Written texts exhibit a bewildering variety and richness of
different structural forms. Some have very little organization at levels
higher than the paragraphs; others may have a complex hierarchy of
parts, sections, chapters etc. Novels are divided into chapters,
newspapers into sections, reference works into articles and so forth.
The following elements are used to represent all such textual divisions:
- <div1>
- major subdivision of a written text, e.g. chapter.
- <div2>
- further subdivision of a written text, entirely contained within
a <div1>, e.g. section.
- <div3>
- further subdivision of a written text, entirely contained within
a <div2>, e.g. subsection.
- <div4>
- smallest possible subdivision of a written text, entirely
contained within a <div3>, e.g. sub-subsection.
Most written texts, of whatever kind, are hierarchically subdivided
using these elements. Structural subdivisions smaller than level 4 (but
above paragraph level) are all tagged <div4>. In all texts,
structural subdivisions at the highest level (<div1>) are always
identified; lower levels of subdivision (i.e. <div2>, <div3>
or <div4>) may also be supplied where appropriate, but are not
required.
These elements have the following attributes in common, in addition
to the global attributes
id,
n, and
r:
- type
- categorizes the division in some respect, e.g. as a chapter,
section etc.
- org
- specifies how the content of the division is organized. Legal
values are:
- compo
- composite content: i.e. no claim is made about the sequence in
which elements inferior to this one are to be processed, or their
interrelationships
- seq
- sequential content: i.e. elements inferior to this are regarded
as forming a logical unit, to be processed in the sequence given
- complete
- specifies whether or not this division is complete or a sample.
Legal values are:
- Y
- the full text of the original has been used
- N
- a sample of the original text has been used
The
n attribute is sometimes used to supply an
identifying name or number used within the text for a given division,
for example, a chapter number, as in the following example:
<div1 type="chapter" n="three" org="seq" complete="y">
More often, however, chapter names or numbers will appear within the text,
tagged using the <head> element discussed in section Headings and captions
below.
The value of the attribute type is used to
characterise the function of the textual division, according to an
informal taxonomy. The values used are listed in ??. If a value is supplied for one division at a
given level, it may be assumed to apply to all subsequent divisions at
the same level until the end of the enclosing element.
A sequence of paragraph-level elements of arbitrary length may
precede the first structural subdivision at any level. A text may have
no structural divisions within it at all. Note that any prefatory or
appended matter not forming part of a text will not generally be
captured: the tei elements <front> and
<back> elements are not used.
Paragraph-level elements and chunks
Written texts may be organized into structural units containing more
than one <s> element and smaller than any of the divisions
discussed in section Divisions of written texts above. The most commonly
found such element is the <p> (paragraph), but there are several
others. Their common identifying feature is that they may appear
directly within divisions (that is, directly within <div1>, <div2>
etc., or within <text> elements, not nested within some other
element such as a paragraph).
An alphabetically ordered list of these elements follows:
- <bibl>
- a loosely structured bibliographic citation appearing within a
corpus text (see Notes and citations).
- <caption>
- (1) a heading, title etc. attached to a picture or diagram,
usually with deictic content (2) a `pull quote' or other text about or
extracted from a text and superimposed upon it to draw attention to it
(see Headings and captions). Attributes include:
- type
- categorizes the caption. Legal values are:
- byline
- caption containing authorship or provenance of an article in a
newspaper or periodical
- display
- extra-textual caption such as a pull quote or displayed box
- attached
- caption describing a non-transcribed item such as a figure or
photograph
- unspec
- not specified or unknown
- <head>
- a title or heading prefixed to some division of a written text
or to a poem (see Headings and captions). Attributes include:
- type
- characterises the heading in some respect. Legal values are:
- byline
- heading containing authorship or provenance of an article in a
periodical
- main
- a main heading (only one allowed per div)
- sub
- a secondary heading (may be zero or more per div)
- unspec
- not specified or unknown
- <list>
- a collection of distinct items flagged as such by special layout
in written texts, often functioning as a single syntactic unit (see
Lists).
- <note>
- any form of note, additional comment or gloss within a written
or spoken text (see Notes and citations).
- <p>
- a paragraph in a written text.
- <poem>
- a poem, or an extract from one, embedded or quoted within a
spoken or written text (see Poems).
- <quote>
- a quotation from some author other than that of the surrounding
text, usually either embedded or displayed (see Quotations).
- <sp>
- a spoken paragraph, i.e. material marked as ‘written to be
spoken’, usually by the presence of a speaker prefix, for example
in a play script or printed interview (see Spoken paragraphs).
Examples for each of these (except
<p>) are discussed
in more detail in the following subsections.
Headings and captions
Headings and captions serve a variety of functions in written texts.
The BNC scheme currently distinguishes between <head> elements, which
can appear only at the start of a text division and are logically
associated with it (for example, chapter titles, newspaper headlines
etc.) and <caption> elements which are logically independent of
the position they may have within a textual division (for example,
captions attached to pictures or figures,
‘pull-quotes’ embedded within the text, ‘by-lines’
identifying authorship and provenance of a newspaper or periodical
article).
One or more
<head> elements may appear in sequence at the
start of any
<div1>,
<div2>,
<div3> or
<div4> element, or at the start of a
<list> or
<poem>, as in the following example:.
<div1 type="u" n=1>
<head type=MAIN>
<s n="1"> <w NN1>AGEISM
</head>
<head type=SUB>
<s n="2"> <w AT0>THE <w NN1>FOUNDATION <w PRF>OF
<w NN1>AGE <w NN1>DISCRIMINATION
</head>
<head type=BYLINE>
<s n="3"> <w NP0>STEVE <w NP0-NN1>SCRUTTON
</head>
In the following example, the
<head> element is followed by
a number of
<caption> elements introducing particular parts of
a magazine story:
<div1 complete=y org=seq>
<head>
<s n=00040>
<w NN2>TROUSERS <w VVB>SUIT
</head>
<caption>
<s n=00041>
<w EX0>There <w VBZ>is <w PNI>nothing <w AJ0>masculine
<w PRP>about <w DT0>these <w AJ0>new <w NN1>trouser
<w NN2-VVZ>suits <w PRP>in <w NN1>summer<w POS>'s
<w AJ0>soft <w NN2>pastels<c PUN>.
<s n=00042>
<w NP0>Smart <w CJC>and <w AJ0>acceptable
<w PRP>for <w NN1>city <w NN1-VVB>wear <w CJC>but
<w AJ0>soft <w AV0>enough <w PRP>for
<w AJ0>relaxed <w NN2>days
</caption>
The
type attribute may be used to distinguish more
exactly the function of the caption or heading, as indicated below.
<div1 complete=y org=seq>
<head type=main>
<s n=0223>
<w PNP>They<w VBB>'re <w VDG>doing <w AJ0>fine
</head>
<head type=sub>
<s n=0224>
<w NP0>Dominic <w VVZ>sees <w AJ0-NN1>double
</head>
Where captions would interrupt the normal flow, pointers are used
as discussed in section ??.
Quotations
A quotation is an extract from some other work than the text itself
which is embedded within it, for example as an epigraph or illustration.
It is marked up using the <quote> element. This may contain any
combination of other chunks (for example paragraphs, poems, lists) but
may not directly contain phrase-level elements. Any reference for the
citation should also be contained within it.
For example:
<quote>
<p>
<s n=2080>
<w DT0>This <w NN1>way <w PRP>for <w AT0>the <w AJ0>sorrowful <w NN1>city<c PUN>.
<s n=2081>
<w DT0>This <w NN1>way <w PRP>for <w AJ0>eternal <w NN1>suffering<c PUN>.
<s n=2082>
<w DT0>This <w NN1>way <w TO0>to <w VVI>join <w AT0>the <w AJ0>lost
<w NN0>people<c PUN>&hellip
<s n=2083>
<w VVB>Abandon <w DT0>all <w NN1>hope<c PUN>, <w PNP>you <w PNQ>who
<w VVB>enter<c PUN>&hellip
<bibl><s n=2084>
<w NP0>Dante </bibl>
</p>
</quote>
Spoken paragraphs
As noted above, the
<sp> element is used to mark parts of a
written text which were or are intended to be spoken, for example the
speeches in a dramatic text or a published interview. Such parts are
generally readily identifiable by the use of such conventions as speaker
prefixes (the label supplying the name of the speaker) and stage
directions, for which the following specific tags are defined:
- <spkr>
- contains the speech prefix used in the original source to
identify the speaker of a passage written to be spoken.
- <stage>
- contains any kind of stage direction within a dramatic text.
The <sp> element is used only for speaker turns identified
as such in a written text, by contrast with the element <u>
discussed in section Utterances, which is used only
for speaker turns identified in a spoken text, i.e. one which has been
transcribed from audio tape.
If present, a <spkr> element should appear at the start of
the
<sp> element, followed by one or more <p> elements
containing the actual speech. Any <stage> element present will
usually be relocated to the end of the paragraph in which it occurs and
replaced by a <ptr> element, as discussed in section ??.
For example:
<sp>
<spkr>
<s n=00156>
<w CRD>M
</spkr>
<p>
<s n=00157>
<ptr target=HHWST01C><w VVB>Give <w DPS>her <w NN1>medicine<c PUN>.
<s n=00158>
<w PNP>I<w VM0>'ll <w VVI>kill <w PNX>myself
<w CJS>if <w PNP>she <w VVZ>dies<c PUN>.
</p>
<stage id=HHWST01C type=u>
<s n=00159>
<c PUL>(<w NP0>Sinking <w PRP>to <w NN2>knees
<w CJC>and <w AJ0-VVG>banging <w NN1>head
<w PRP>on <w NN1>floor<c PUR>)
</stage>
</sp>
Poems
Poems or fragments of verse or song may appear both within and
between paragraphs. The <l> (line) element is used to mark each
metrical line, and any titles or headings present are marked with
<head> elements. Each such group of lines is marked as a <poem>
element, with no indication of its completeness.
No provision is made for marking units of verse such as stanzas,
verse paragraphs etc. A part attribute is defined for
the <l> which allows incomplete lines to be indicated, but in
the current version of the corpus this always takes the value ‘u’
(for unknown).
For example:
<poem>
<l part=u>
<s n=0900>
<w PNP>I <w VVB>send <w DPS>my <w NN1>soul <w PRP>through
<w NN1>time <w CJC>and <w NN1>space <w TO0>to <w VVI>greet
<w PNP>you<c PUN>.
</l>
<l part=u>
<s n=0901>
<w PNP>You <w VBD>were <w AT0>a <w NN1>poet<c PUN>.
<s n=0902>
<w PNP>You <w VM0>will <w VVI>understand<c PUN>.
</l>
</poem>
Note that the <l> element is not used to mark typographic
lineation; on the few occasions where this has been recorded, it is
marked with the <lb> tag discussed in section Miscellaneous phrase-level elements
below.
Lists
A list is a collection of distinct items flagged as such by special
layout in written texts, often functioning as a single syntactic unit.
Lists may appear within or between paragraphs. Where marked, lists are
tagged with the <list> element.
A <list> element consists of an optional <head>
element, followed by one or more <item> elements, each of which
may optionally be preceded by a <label> element, used to hold
the identifier or tag sometimes attached to a list item, for example ‘(a)’.
It may also contain a word or phrase used for a similar purpose.
The <item> element may appear only inside lists. It contains
the same mixture of elements as a paragraph, and may thus contain one
or more nested lists. It may also contains a series of paragraphs, each
marked with a <p> element.
Here is an example of a simple list:
<list>
<item>
<s n=0087>
<w VBZ>Is <w DPS>your <w NN1>nylon <hi r=it> <w NN1>nightie </hi>
<w AJ0>fireproof<c PUN>?
</item>
<item>
<s n=0088>
<w AT0>The <w NN1>hurricane <w VBD>was <hi r=it> <w AJ0-AV0>mighty </hi>
<w AJ0>fierce<c PUN>. <pb n=78>
</item>
<item>
<s n=0089>
<w VM0>Will <w PNP>you <hi r=it> <w VVI>mow </hi> <w AT0>the <w NN1>lawn<c PUN>?
</item>
<item>
<s n=0090>
<w VDD>Did <w PNP>you <hi r=it> <w VVI>know </hi> <w AT0>the <w NN1>time<c PUN>?
</item>
</list>
Here is an example of a labelled list:
<list>
<label>
<s n=0423>
<w CRD>1<c PUN>. </label>
<item>
<s n=0424>
<w NN1-NP0>Surya <c PUN>&mdash <w NN1>Sun <c PUN>&mdash <w AJ0>Creative
<w NN1>agent
</item>
<label>
<s n=0425>
<w CRD>2<c PUN>. </label>
<item>
<p>
<s n=0426>
<w NN1-NP0>Vayu <c PUN>&mdash <w NN1>Air <c PUN>&mdash <w NP0>Preserving
<w NN1>agent <pb n=43>
</p>
</item>
<label>
<s n=0427>
<w CRD>3<c PUN>. </label>
<item>
<p>
<s n=0428>
<w NN2>Agni <c PUN>&mdash <w NN1>Fire <c PUN>&mdash <w AJ0>Destructive
<w NN1>agent
</p>
</item>
</list>
Notes and citations
Annotations occurring in written texts, and bibliographic citations
or references, have been marked up in some texts, using the
<note>
element. This element has the following additional attributes:
- type
- identifies the provenance of the note, i.e. editorial or
authorial. Legal values are:
- ed
- note supplied by BNC transcriber or encoder
- orig
- note present in the original source text
- ed
- code for the person or organization responsible for BNC-supplied
note. Legal values are:
- lancs
- Note supplied by UCREL grammarians
- longm
- Note supplied by Longman transcribers
- oucs
- Note supplied by OUCS staff
- oup
- Note supplied by OUP transcribers
- undef
- Provenance of note unknown or unspecified
- place
- specifies the location of an original note in the source text.
Legal values are:
- foot
- foot of page
- end
- end of current division or text
- side
- left or right margin
- unspec
- unknown or unspecified.
Notes within headers are tagged using a distinct <bibNote>
element, which is a departure from TEI-recommended practice, as is the
use of the <note> element for both original and supplied
annotation. The two usages are distinguished by the type
attribute.
Here for example is a typical transcriber's note:
<note type=ed>
<s n=0001>
<w NN1-NP0>Page <w NN2>numbers <w XX0>not <w AJ0>available
</note>
Original notes may contain any mixture of other chunks, and may also
contain paragraphs: they may appear in written texts only. They will
normally be relocated to the end of the section in which they appear,
and their original position marked by a <ptr> element, as
discussed in section ??.
For example:
<s n=053>
<w CJS-PRP>As <w AT0>the <w NP0>UK<w POS>'s <w AJ0>main <w AJ0>independent
<w NN1>AIDS <w AV0>home <w NN1-VVB>care <w NN1>provider<c PUN>,
<w PNP>we <w VVD>cared <w AVP-PRP>for <w PRP>around <w NN0>25%
<w PRF>of <w DT0>all <w DT0>those <w PNQ>who <w VVD>died
<w PRF>of <w NN1>AIDS <w ORD>last <w NN1>year <ptr target=A02NT001><c PUN>.
<s n=054>
<w PRP>In <w NP0>London<c PUN>, <w NN1-VVB>demand <w PRP>for <w DPS>our
<w NN1>Home <w NN1-VVB>Care <w NN2>services <w VVD-VVN>doubled <w AVP-PRP>over <w AT0>the
<w ORD>last <w CRD>twelve <w NN2>months<c PUN>.
<!-- ... -->
<s n=056>
<w PNP>I <w VVB>expect <w NN1-VVB>demand <w PRP>for <w DT0>this
<w NN1>service <w TO0>to <w VVI>continue <w TO0>to <w VVI>grow
<w AVP-PRP>over <w AT0>the <w AJ0>coming <w NN1>year<c PUN>.
</p>
<note id=a02nt001 n=2 type=orig>
<s n=057>
<w NN1>AIDS <w NN2>deaths<c PUN>: <w NP0>April <w CRD>1990 <c PUN>&mdash
<w NP0>March <w CRD>1991<c PUN>, <w NP0>UK <w NN1>total <c PUL>(<w NN1-NP0>CDSC
<w NN2>figures <c PUN>&mdash <w CRD>584 <w NP0>April <w CRD>1991<c PUN>.<c PUR>)
<s n=058>
<w DPS>Our <w NN1>Home <w NN1-VVB>Care <w NN2>teams <w VVD>saw <w CRD>141
<w NN0>AIDS <w AJ0-VVD>related <w NN2>deaths <w ORD>last <w NN1>year
</note>
Note the use of the
n attribute to carry the
original footnote number in the above example.
Bibliographic citations or references within running texts may also
be marked, using the <bibl> element; this is done in some texts
only in the present version of the corpus.
For example:
<bibl>
<s n=1379>
<w NP0>Mordechai <w NP0>Chaim <w NP0>Rumkowski<c PUN>,
<w AJS>Eldest <w PRF>of <w AT0>the <w NN2>Jews <w PRP>in
<w AT0>the <w NN1-NP0>Lodz <w NN1>ghetto<c PUN>,
<w VVG>speaking <w PRP>in <w CRD>1942 </bibl>
Phrase-level elements
Phrase-level elements are elements which cannot appear directly
within a textual division, but must be contained by some other element.
In practice, this means they will be contained within an <s>
element.
Highlighted phrases
Typographic highlighting in the original may not be marked in the
transcript at all. Alternatively, highlighted phrases, and the kind of
highlighting used, may be recorded in one of two ways:
- using the global rend (rendition) attribute
- using the <hi> (highlighted) element
The former is used where the function of the highlighting is
clear, for example to mark a heading, and where the boundaries of the
highlighted phrase therefore coincide with the boundaries of some other
cdif element. The latter is used where the function is not
clear, where the DTD does not provide a tag to identify the
feature concerned or where the highlighted phrase is not coterminous
with some other element.
When the <hi> element is used, its rend
attribute must be supplied. On all other cdif elements,
the
rend attribute is optional. Its value indicates the nature
of the highlighting used, e.g. italic font, quoted, small caps etc. A
list of the values used for this attribute is given in section
?? below.
It should be noted that the purpose of the rend
attribute is not to provide information adequate to the
needs of a typesetter, but simply to record some qualitative information
about the original. In particular, the present version of the corpus
includes no indication of size of type or style of writing.
Like all other phrase-level elements, each <hi> element must
be entirely contained by an <s> element. This implies that
where, for example, a bolded passage contains more than one sentence, or
an italicised phrase begins in one verse line and ends in another, the
<hi> element must be closed at the end of the enclosing element,
and then re-opened within the next.
For example, in the following four lines of verse, the first three
are rendered in italics, and the
rend attribute is
therefore specified for each
<l> element. In the fourth line,
only the first few words are in italics: a
<hi> element must be
used within the
<l> to carry this information.
<l part=u rend=it>
<s n=394><w PNP>It <w VBD>was <w CRD>one <w PRF>of <w AT0>a <w NN0>pair<c PUN>.
<s n=395><w DPS>Its <w AJ0>precious <w NN1>twin
</l>
<l part=u rend=it>
<s n=396><w VBD>was <w VVN>stolen <w PRP>by <w AT0>the <w NN2>soldiers<c PUN>.
<s n=397><w DT0>All <w AT0>the <w NN1>time
</l>
<l part=u r=it>
<s n=398><w DPS>her <w NN1>uncle <w VVD>stood <w AV0>there
<w VVG>clutching <w DT0>this <w CRD-PNI>one <w PRP>in
</l>
<l part=u>
<s n=399><hi rend=it> <w DPS>his <w AJ0>big <w NN1>fist </hi> <c PUN>&mdash <w AV0>so<c PUN>!
<s n=400><w PNP>She <w VDZ>does <w DT0>a little <w NN1>mime<c PUN>.
</l>
Miscellaneous phrase-level elements
The following miscellaneous phrase-level elements also appear within
<s> elements in written texts:
- <pb>
- marks the start of a new page in the original source; used to
indicate where e.g. articles in periodicals are split across several
pages.
- <lb>
- marks the start of a new (printed) line in the original source.
- <name>
- proper name of a person, place or institution.
- <salute>
- a formulaic greeting or form of address appearing at the start or
the end of a spoken or a written text.
In this example, the presence of a page break between two verse
lines is indicated by the
<pb> element:
<l part=u>
<s n=1403>
<c PUN>&mdash <w CJC>and <w NN2>creditors <w VVB>grow <w AJ0>cruel<c PUN>,
</l>
<l part=u>
<pb n=75>
</l>
<l part=u>
<s n=1404>
<w AV0>so <w PNP>he <w VVZ>bows <w CJC>and <w NN2-VVZ>scrapes<c PUN>,
</l>
In the following example, the
<lb> element has been used to mark the
position of line breaks in the source text, since they seem to be taking
the place of conventional punctuation:
<caption>
<s n=1503>
<w NN1>Man <w PRF>of <w AT0>the <w NN1>Year <lb> <w NN1>Design <lb> <w NN1>Design
<w NN1>Concept <lb> <w AJ0>Technical <w NN1>Innovation <w NN1>Safety
<w NN1-NP0>Achievement <lb> <w NP0>Environmental <w NN1-NP0>Contribution
<lb> <w NN1-VVG>Marketing <w NN1>Initiative <lb> <w NN1>Manufacturer <w PRP>in
<w NN1-NP0>Motorsport <lb> <w AJ0-NN1>Specialist <w NN1>Manufacturer
<w NN1>Mid-size <w NN1-NP0>Manufacturer <lb> <w AJ0>Large <w NN1>Manufacturer
</caption>
In the following example, the
<salute> element has been used
to separate the addressee of a letter from the rest of the text:
<s n=0343>
<w VVB>Ask <w TO0>to <w VVI>see <w NN2>examples <w PRF>of <w DPS>their <w NN1>work
<w CJC>and <w NN1-VVB>contact <w NN2>references <w PRP>from <w DPS>their
<w AJ0-NN1>past <w NN2>projects<c PUN>.
<s n=0344>
<salute> <w NP0>JOHN <w NN0>DIBBLE <w NN1>Partner<c PUN>, <w NP0>Atlam
<w NN1>Design <w NN1-NP0>Partnership<c PUN>, <w NP0>Portland <w NP0>House<c PUN>,
<w NP0>Portland <w NP0>Street<c PUN>, <w NP0>Leamington <w NP0>Spa<c PUN>,
<w NP0>Warwickshire<c PUN>. </salute>