2 The SARA protocol

The SARA protocol was designed for use with TCP, though any other network could be used. The only assumption made about the network is that it is capable of delivering null-terminated strings in the order they were sent.

All strings used as messages are variable-length ASCII strings. All message strings must be terminated by a null character (hex 00).

All transactions consist of a message sent from the client to the server followed by a reply from the server to the client. Client messages begin with a keyword and may contain other data, depending on the keyword. Server responses begin either OK or NO, followed by additional data depending on the message keyword.

There is one exception to this rule. Certain transactions are classified as interruptable. If a data package containing the string INT is available from the client socket during an interruptable transaction, then the transaction is halted, and the server writes the string NO ABORT to the socket. In this case, there is one more client message than server replies. The MSG_OOB flag must be specified when sending the string. The behaviour of the server on receiving any data other than the string INT after a read and before a write is undefined.

A SARA session consists of these phases:

The client connects to the server, and the server accepts the call.
The server tries to create a process to accept data packages from the client. If it cannot do this (say because memory is short on the server), then the server closes the socket.
The user logs on.
Set-up messages are exchanged.
A client session takes place.
The user logs off.
The server closes the socket.

Set-up messages are messages that are used in phase 4 of this process.

Once a connexion has been established, the server must receive packages regularly. If the time-out period elapses without any package being received, then the server will close the connexion. To keep the connexion alive, the client should send the command TIMER, though any package which is not a legal server command may be used. The client should not send ``keep-alive'' packages between a client read and a subsequent write, since these may be interpreted as interrupts, as noted above.

Note that the time-out operates only while the server is waiting to receive packages. It does not prevent the server from spending a long time in a calculation.

The rest of this section documents all legal messages.

2.1 `BIB`

Any enquiry for bibliographic data starts with this message. The form is BIB textid, where textid is the three-character identifier of the text for which data is required.

If data is available, the reply returned is in the form OK type num, where type indicates the type of data available, and num is the number of items available. Currently, the assigned types are 0 for written texts (two items, a title and a description), and 1 for spoken texts (multiple items, a title, followed by descriptions of speakers).

If no bibliographic data is available for the text indicated, then the reply will be NO BIB.

2.2 `BIBITEM`

This message returns a bibliographic string item. It has the form BIBITEM textid num, where textid is the three character identifier of the text and num the number of the bibliographic item required.

The reply is OK str, where str is the bibliographic data required.

2.3 `CSCORE`

Obtain a collocation score. It has the form CSCORE str num query where str is a search term, num is a number, and query is a CQL query expression.

The server responds NO SYNTAX if the CQL query cannot be parsed. Otherwise it establishes the number of occurrences of the word str within num words of a solution to query establishes the number of occurrences of the word str within num words of a solution to query. The reply is OK len where len is this number.

2.4 `DMATCH`

This call must follow a LOOKUP and has the form DMATCH num. The numth member of the wordlist created by LOOKUP is found, together with its frequency. The form of the reply is OK freq str1 (str2), where freq is the frequency of the matching word, str1 is the matching word in a form suitable for display (i.e. with character entity references replaced), and str2 the matching word in a form suitable for any subsequent queries using the same word (i.e. with character entity references unchanged).

2.5 `DOWNLOAD`

This is used to request a file from the server. Version 930 of the server supports three possible filenames: the specific files elements.txt and header.txt, or the corpus description file. The first two are obsolescent and may be withdrawn in later releases. The usage and format of these files are described in section 2.39 below.

The form of the message is DOWNLOAD file where file is either the literal header or an explicit filename. In the former case, a file with the corpus name and the extension .dsc will be downloaded. In the latter case, a file with the name given and the extension .txt will be downloaded.

The server replies OK if the file is available and NO FILE if it is not. Subsequent LINE messages retrieve the file block by block.

Note that downloadable files are stored on the server in Unix text file format, with records indicated by newline characters. When downloaded, they are automatically converted to PC text format, with record ends indicated by return/newline pairs. If files are to be moved between client and server machines by any other means (such as FTP), a similar conversion is required; for example by specifying ASCII file transfer rather than binary.

2.6 `FILTER`

Assign a filter to a query. The call has the form FILTER query name, where query is a query number and name the name of a filter, chosen from the following list. Filters are used to process individual solutions before returning them to the client.

The following filters are currently available:

ADJPOS Trim solution so that no partial POS codes are transmitted.
ADJSGML Trim solution so that no partial SGML tags are transmitted.
CMAP Map characters as defined by CHAR set-up messages
NOPOSX Delete all old-style POS entities.
NORMCR0 Turn single linefeed characters into carriage-return linefeed sequences.
NORMSPACE Normalise all white space so that sequences of white space characters become single spaces.
NOSGMLX Remove all SGML markup.

2.7 `GET`

This call gets a single solution for a query. It has the form GET query num scope where query identifies the query, using the identifier supplied by a previous call of QNAME. The solution is solution number num in sequence. scope is either the name of one or more SGML elements to be used to bound the solution or an integer indicating the number of words of context required.

The reply is OK text num offset len pos str where:

text is the text identifier
num is the number of the s-unit containing the solution
offset is the offset of the solution in the returned text
len is the length of the solution
pos is the part-of-speech code of the solution (obsolete)
str is the solution text

The argument scope will usually be a single SGML element name. However, a series of names may be supplied separated by commas. In this case the element whose last start-tag before the hit is latest in the file will be used to bound the solution. If scope is an integer, then the amount downloaded will be the smallest collection of elements containing the hit such that at least scope words precede the hit.

If the text file is not available, the response OK will still be generated along with a solution text stating that the text is unavailable. The text identifier returned will be valid and num will be set to -1.

The reply NO SOL is returned if the solution is not available for any other reason.

2.8 `GET1SOL`

This call has the form GET1SOL text gi att where text is a short text name, gi is an SGML element name and att the name of an attribute. The server looks for the first gi element in the document text that has an attribute called att, and returns the value of this attribute.

If the attribute name is set to - then the content of the first element gi in document text is returned.

The form of the return is OK str where str is the value.

2.9 `GETHEAD`

GETHEAD is used to extract data from a given position for browsing. The format is GETHEAD textid offset num where:

textid is the text id
offset is the offset of the desired information
num initialises the depth count

The return string is OK newpos jump newd bTag str, unless the text cannot be found, in which case the reply is NO TEXT.

If the server finds content at the specified offset, it reads all the content into str, sets bTag false, reads and discards any end-tags following the content, adjusting the depth count accordingly but never allowing it to become negative. newpos is set to the new offset and newd to the new depth.

If the server finds an SGML start tag, at the specified offset, it reads the tag itself into str. It sets variable bTag true and sets newd one greater than the current depth count. It sets newpos to the offset of the end of the tag. Finally it sets jump to be the offset at which the element being opened ends.

Empty elements are treated as content. w-tags and s-tags are treated as content. str is trimmed of leading and trailing blanks; its spacing is normalised and its characters mapped.

2.10 `GETHEAD2`

This call is used to locate a string whose file position is known in a string returned from GETHEAD. The location cannot be deduced without such a call because of the tidying GETHEAD performs on solutions.

The format is GETHEAD2 txt offset i0 i1 where txt and offset were the values used to extract the solution in GETHEAD and i0 and i1 are the coordinates of the solution returned by LOC. The server calculates the offset and length of the solution in the string and replies OK offset length.

2.11 `GETPOS`

The argument is a string str, an l-word for which the server is to find all possible parts of speech. The reply takes the form OK num s1 ... sn where num is the number of solutions and s1 ... sn are the different POS codes found.

2.12 `GETSC`

The arguments are a string str, which must be the corpus name, and a number num which is the absolute number of a corpus text. The return is OK str avail where str is the name of the text with this number, and avail is 1 if the text is available, or 0 if it is not, unless the corpus name is wrong, in which case it is NOP

2.13 `INFO`

This message allows the client and server to exchange information. It is the only message that can legally be sent before the user logs on. The form of the message is INFO num, where num is the number of the code page that should be used to translate character references. The following code pages may be specified:

850 Windows ANSI

The response is OK num dv sv cv name where

num is the server time-out value in seconds;
dv is the version number of the corpus description file. (The DOWNLOAD message may be used to obtain files whose versions have changed)
sv is the version number of the server
cv is the smallest acceptable client version number
name is the corpus name

2.14 `LINE`

This message gets the next block from a file requested by the DOWNLOAD message. The reply is OK str where str is the next block of the file; its first three bytes should be discarded. When there are no more blocks, the reply is NO MORE.

2.15 `LOC`

This message finds the location of a solution (unlike GET, which gets the text). It has the form LOC query num, where query is the query name and num the number of the desired solution. The form of the return is OK nt nc nw where nt is the text number, nc is the character offset of the solution and nw is the word number.

LOC and GETHEAD2 were implemented specifically to allow the text of a hit to be marked while examining a text in the tree-browse window. The normal way to recover solutions is to call GET repeatedly.

2.16 `LOG`

Used to log on. Before a successful LOG the system replies NO LOGIN to any message.

The two string arguments are the user's name and password.

The response to a correct login is OK followed by a copyright message. The response to a bad login is NO BADLOG. In response to failure of the last allowed login attempt, the server may close the connection without a reply.

2.17 `LOGOUT`

Used to log off. There is no reply.

2.18 `LOOKUP`

Look up entries in the dictionary. The sole argument is a string pattern and the reply is OK num where num is the number of words in the dictionary that begin with the string pattern. DMATCH can be used to retrieve the words.

2.19 `MAXLENGTH`

Tells the server the maximum length of a solution that may be returned. The desired limit is the argument, and the server responds OK num where num is the limit actually set, which may be smaller than requested.

2.20 `MOTD`

Gets the Message of the Day from the server. The response is OK str where str is the message.

2.21 `OPEN`

Open a saved query. The argument is the query name. The reply is OK num if the operation is successful and the file contains num solutions; it is NO if the file cannot be opened.

2.22 `PWD`

Change password. The two arguments are the old and the new password. The response is OK if the change is allowed.

2.23 `QNAME`

Allocate a query name. There are no arguments. The response is OK query where query is the name.

2.24 `RGET`

This behaves just like DMATCH but recovers a word from the last regular expression word lookup set up by RLOOKUP (qv).

2.25 `RLOOKUP`

This message finds all words matching a given regular expression. The call is RLOOKUP regexp where regexp is the regular expression. The reply is OK num if there are num solutions. RGET may be used to recover individual solutions.

The RLOOKUP call stores its result in an internal buffer of limited size. If there are more solutions than can be stored, the string returned will be NO TOOMANY.

The following one-character regular expressions match a single character:

char An ordinary character (not one of the special characters discussed below) is a one-character regular expression that matches that character.
escape-char A backslash followed by any special character is a one-character regular expression that matches the special character itself. The special characters are + . * [ \ : (period, asterisk, left square bracket, and backslash, respectively), which are always special, except when they appear within square brackets.
period A period is a one-character regular expression that matches any character
bracketed string A non-empty string of characters enclosed in square brackets is a one-character regular expression that matches any one character in that string. If, however, the first character of the string is a circumflex, the one-character regular expression matches any character other than the remaining characters in the string. The circumflex has this special meaning only if it occurs first in the string. The minus-sign (-) may be used to indicate a range of consecutive ASCII characters; for example, [0-9] is equivalent to [0123456789]. The minus-sign loses this special meaning if it occurs first (or following an initial circumflex) or last in the string. The right square bracket does not terminate such a string if it occurs first (or following an initial circumflex); that is, []a-f] matches either a single right square bracket or one of the letters a, b, c, d, e or f.

The following rules may be used to construct regular expressions:

star * A regular expression followed by a star is a regular expression that matches zero or more occurrences of the one-character regular expression.
plus + A regular expression followed by a plus is a regular expression that matches one or more occurrences of the one-character regular expression.
query ? A regular expression followed by a question mark is a regular expression that matches zero or one occurrences of the one-character regular expression.
concatenation The concatenation of two or more regular expressions is a regular expression that matches the concatenation of the strings matched by each component of the regular expression.
parens ( ) A regular expression enclosed in parentheses matches a match for the regular expression
alternation Two regular expressions separated by the alternation or disjunction symbol (vertical bar or pipe) matches anything that matches either of the expressions.

The order of precedence of operators at the same parenthesis level is [ ] (character classes), then * + ? (closures), then concatenation, then | (alternation).

The regular expression evaluator can detect multiple pattern matches. Thus tea?will match both te and tea.

2.26 `SAVE`

This message changes the saved state of a named solution set. The format is SAVE flag query where flag is 0 to turn off saving and 1 to turn it on, and query is the query name.

2.27 `SOLVE`

This is the call used to solve a CQL query. The form of the call is SOLVE query str where query is the query name and str is the query expression. The client must use a query name allocated by QNAME. The format of CQL queries is documented in 2.32 below. The reply is one of NO 0 if there are no solutions; OK num ntxt if there are num solutions occurring in ntxt texts; or NO SYNTAX if the query expression could not be parsed; NO SPACE means that the server cannot save the solution because its disk is full; NO STREAMS means that there are not enough free streams to solve the query; NO FILES means that the system has more open queries than it can handle.

The SOLVE message is an interruptable transaction.

Individual solutions may be retrieved using GET.

2.28 `SQTABLE`

Used to thin a solution set to a specified subset. The format is SQTABLE q1 q2 num1 ... numx where q1 is the initial query, q2 is the thinned result, and num1 to numx are the indices of the solutions to be retained.

The reply is OK num ntxt, where the new set of solutions has num members in ntxt texts.

2.29 `SUBCORPUS`

Gets details of a subcorpus. SUBCORPUS str is the required form, but str must be the corpus name in the current implementation. The reply is OK num m z where num is the number of words in the dictionary, m the mean frequency, and z the standard deviation of frequency.

2.30 `THIN`

Cuts a solution set down using one of a number of criteria. The form of the message is THIN name newname method size seed where query is the query name, newname is a name for the thinned query set, method is an integer specifying how the thinning is to be performed, and size is the desired number of solutions. The following methods are defined:

0 truncate the solution set to the required length
1 randomly select the required number of solutions, using seed as random number generator
2 select one solution from each text; in this case, the size parameter is ignored.

The reply is OK num ntxt where num is the number of solutions after thinning, and ntxt is the number of texts represented in the new solution set. If the new query file cannot be created, the reply is NO FILES.

2.31 `WEB`

If the server advertises a web site, this message will return the reply OK url, where url is the URL of the web site. If it does not, then the reply will be NO

2.32 CQL query structure

This section defines the syntax and semantics of queries expressed in the SARA Corpus Query Language (CQL).

2.33 Atomic queries

Atomic queries are not made up of smaller queries, though they may have components. The following items are regarded as atomic queries:

words
An uninterrupted sequence of alpha-numeric characters or a single punctuation symbol is recognized as a single word by the server. Its solution set is the set of all occurrences of the word. Any sequence of characters will however be recognized as a word if it is enclosed in double quotes.
L-words An L-word is a sequence w=p where w is a word and p a POS code. The solutions are the set of occurrences of the word with the appropriate code.
Phrase queries A phrase query is one or more words in single quotation marks. A single quotation mark may be included by escaping it with a preceding backslash. A phrase query is analysed into L-words before being evaluated. Since there may be several L-words to a word or several words to an L-word, this analysis is necessary if certain combinations are to be found. For example, the phrase query in addition to is equivalent to [quot ]in addition to[quot ] whereas in homage to is [quot ]in[quot ] [quot ]homage[quot ] [quot ]to[quot ]. The query can\�t is equivalent to [quot ]ca[quot ] [quot ]n�t[quot ].
File queries An existing solution set set can be included in a query using the term :set.
Regular expressions
A regular expression may be included in a query using the syntax {regexp} where regexp is the expression. The solutions are all solutions to words that match the expression. The syntax of regular expressions was given in 2.25 .

If a regular expression is likely to have fewer than 10 matches in the dictionary, it is more efficient to specify all matches explicitly in a disjunction. However, there is a limit (currently 20) on the number of query streams permitted per process, which makes large disjunctions impossible. A regular expression query tests successive terms from the dictionary, merging in solutions for those which match. Although slower than a disjunctive query, this process uses only one query stream.
Bracketed expressions Any CQL query placed in parentheses will be treated as an atomic query.

2.34 Unary operators

The following unary operators may be applied to query terms:

Case-sensitivity If a word is preceded by a dollar sign, matching is case-sensitive. Normally word-matching ignores case. Because the index is case-insensitive, case-sensitive searches require the server to look in the corpus texts. Hence they are fairly costly.
Header searching If a query term is preceded by a commercial at sign, matching is carried out within headers as well as within the bodies of texts.

2.35 SGML queries

An SGML query looks superficially like SGML markup. Thus it has the form

 <[/]element attributes>

attributes

name=value

name

value

2.38

Note that the server does not perform an exact text match on SGML markup. It will find any start tag in which the gi is the same as the element specified and the attributes have the stated values. The attributes may appear in a different order and may be mixed with other attributes not used in the query. The solutions are the offsets of the start symbol of the matching SGML tags.

Variables may be used as attribute values. Variables take the form of an integer preceded by an underline character, for example

_12

2.36 Combining queries

2.36.1 Concatenation

Two queries written in sequence match occasions where a solution to the first query is directly followed by a solution to the second.

In a concatenation of queries the following extra notation may be used:

a query preceded by a question mark matches zero or one solutions to the query
a query preceded by an exclamation mark matches anything that is not a solution to the query
a single underline character matches anything.

2.36.2 Disjunction

The term query1|query2 matches anything that is a solution to either query1 or query2.

2.37 Scoping queries

In a scoped query, all components of the solution must fall within a given scope. The scope may be specified as an SGML element or as a number of words.

An SGML scoped query is written q/qs where q is a query join and qs is an SGML query as specified in 2.35 .

A numerically scoped query is written query/num where num is the number of words. When counting words, punctuation marks are included, but SGML markup is excluded.

A query join is a sequence q1 * ... * qn where each qi is a query. If a query join is encountered outside a scope, the largest possible SGML scope (equal to a single text) is assumed.

A join made with operator * must occur with elements in the order specified. If the operator # is used the order is ignored.

2.38 How attributes are indexed

The general approach to indexing attributes is to index the value as a text string of each attribute. There are a few cases, however, where special treatment is required. These special cases are explained in this section. The treatment of attribute values in SARA emerged in an extremely ad hoc fashion. Although it is not difficult to see in retrospect how the whole apparatus could be simplified, this has not been done in the present version.

Certain attributes are declared as plural. When an attribute is plural, the indexer treats its value as a string of values, decomposes the string into a list of values and indexes each of these separately.

Individual attributes are treated differently, depending on their declared values, as follows:

CDATA The attribute value is indexed just as it is.
CAT The attribute value is indexed in upper case.
NUMBER The attribute value is indexed in upper case.
NAME The attribute value is indexed in upper case.
ID The attribute root is indexed. The value is stored as position data.
REFID The attribute root is indexed. The value is stored as position data.
NULL Suppresses indexing altogether.
MULTID The attribute root is indexed. The value is stored as position data.
MULTIDREF The attribute root is indexed. The value is stored as position data.

The terminology is somewhat misleading. An ID attribute can have any name. A REFID attribute can have any name but must refer to an attribute of another element called ID.

The values of ID and similar attributes are composed from a text identifier, the root and an integer using a shifting set of rules, radices etc.

This is not the place to explain the reason for storing ID values as position data.

2.39 File formats

As of version 0.930 of the client/server software, all elements of the SARA package (that is, the indexer, the server and the client) use one common file to access information about a BNC-style corpus. The filescdif.txt, elements.txt and header.txt are obsolescent and may be deleted, as may various parameter files only used by the indexer.

This new file is called the corpus description file. Its name must be the corpus name and its extension must be dsc. The following corpus names are in use:

bnc1 The main corpus
bncsam1 The old C6 sampler
bncsam2 The new C6 sampler

The server tells the client the name of the corpus it uses as part of the INFO exchange (see 1.13).

The description file consists of a number of lines each with a keyword and an argument string. Arguments are separated by blanks. Lines beginning with the character # are treated as comments.

The rest of this section lists the keywords supported.

2.40 `VER n`

n is the version number multiplied by 100. This must be the first line of the file and may not be preceded by a comment. With software version 0.930 all header versions have been set to 1.00.

2.41 `ATT s1 s2 n d desc`

s1 is an attribute of s2. n must be one of the following strings:

Type                Value    Use of desc field
CDATA                0        Unused
CAT                  1        Alternatives as in ATTRIB declaration 
NUMBER               2        Unused
NAME                 3        Unused
REFID                4        Referenced element
ID                   5        Root
NULL                 6        Unused
MULTID               7        Root
MULTIDREFS           8        Root

2.38

Example:

att default CAT YES|NO Whether a default is available

2.42 CHAR name n

Declares a character entity. For example:

char yacute 253

If the number n is present then it must denote the character value of the character on any system on which SARA will run. In this case the indexer will replace the entity by the character value and the entity will never be encountered.

If the number is not present, as in

char yacute

not

2.43 ELT gi n h desc

s is an element of type n. The values for n are:

0 empty element
e non-empty element
s s-type element

By an s-type element is meant an element with omitted end tags where the end tag is always found immediately before the next start tag of the same kind, or at the end of the document if there are no more such start tags.

h is b if the tag appears in text bodies, h if it only appears in headers.

desc is a brief description of the element.

Example:

elt locale e h description of a place where speech recorded

2.44 ENT name

Declares a special entity and its printed form. For example:

ent alien [alien]

Special entities are always displayed in their printed form when non-SGML output is required.

2.45 ITEM

see MENU

2.46 MENU k shortname

List a menu to be used to solicit the value of a parameter. All menus must begin with a line such as

MENU 0 spoken_class

What follows depends on the value of the parameter k. If it is 0 then a number of item statements such as

ITEM 2 C1
ITEM 3 C2
ITEM 4 DE

If k is non-zero then it must refer to an enumeration declared in a TYPE statement. Note however that the value k=1 is reserved.

Note: as of software version 0.930 any attribute may have a menu. Declaring an empty menu for an attribute suppresses its display.

A MENU statement must directly follow the attribute it serves.

2.47 `POS n desc`

A POS statement lists a part-of-speech code. For example:

POS VBD past form of the verb "BE" , i.e. WAS, WERE

2.48 `PUN n desc`

A PUN statement lists a part-of-speech code that is classed as punctuation. For example:

PUN PUL left bracket (i.e. ( or [ )

2.49 `TYPE n`

n must be an integer greater than 1. The items that follow the statement are just the same as after a MENU statement. The only point of the type statement is that it saves having to list the same enumeration of items (eg ISO country codes) in several places. The integer n is used in a MENU statement to show that the items from type n should be used.

[Note that the break table in the client is still hard-coded]