Previous
Up
Next
2 The SARA protocol

2 The SARA protocol

The SARA protocol was designed for use with TCP, though any other network could be used. The only assumption made about the network is that it is capable of delivering null-terminated strings in the order they were sent.

All strings used as messages are variable-length ASCII strings. All message strings must be terminated by a null character (hex 00).

All transactions consist of a message sent from the client to the server followed by a reply from the server to the client. Client messages begin with a keyword and may contain other data, depending on the keyword. Server responses begin either OK or NO, followed by additional data depending on the message keyword.

There is one exception to this rule. Certain transactions are classified as interruptable. If a data package containing the string INT is available from the client socket during an interruptable transaction, then the transaction is halted, and the server writes the string NO ABORT to the socket. In this case, there is one more client message than server replies. The MSG_OOB flag must be specified when sending the string. The behaviour of the server on receiving any data other than the string INT after a read and before a write is undefined.

A SARA session consists of these phases:

  1. The client connects to the server, and the server accepts the call.
  2. The server tries to create a process to accept data packages from the client. If it cannot do this (say because memory is short on the server), then the server closes the socket.
  3. The user logs on.
  4. Set-up messages are exchanged.
  5. A client session takes place.
  6. The user logs off.
  7. The server closes the socket.
  8. Set-up messages are messages that are used in phase 4 of this process.

    Once a connexion has been established, the server must receive packages regularly. If the time-out period elapses without any package being received, then the server will close the connexion. To keep the connexion alive, the client should send the command TIMER, though any package which is not a legal server command may be used. The client should not send ``keep-alive'' packages between a client read and a subsequent write, since these may be interpreted as interrupts, as noted above.

    Note that the time-out operates only while the server is waiting to receive packages. It does not prevent the server from spending a long time in a calculation.

    The rest of this section documents all legal messages.

    2.1 BIB

    Any enquiry for bibliographic data starts with this message. The form is BIB textid, where textid is the three-character identifier of the text for which data is required.

    If data is available, the reply returned is in the form OK type num, where type indicates the type of data available, and num is the number of items available. Currently, the assigned types are 0 for written texts (two items, a title and a description), and 1 for spoken texts (multiple items, a title, followed by descriptions of speakers).

    If no bibliographic data is available for the text indicated, then the reply will be NO BIB.

    2.2 BIBITEM

    This message returns a bibliographic string item. It has the form BIBITEM textid num, where textid is the three character identifier of the text and num the number of the bibliographic item required.

    The reply is OK str, where str is the bibliographic data required.

    2.3 CSCORE

    Obtain a collocation score. It has the form CSCORE str num query where str is a search term, num is a number, and query is a CQL query expression.

    The server responds NO SYNTAX if the CQL query cannot be parsed. Otherwise it establishes the number of occurrences of the word str within num words of a solution to query establishes the number of occurrences of the word str within num words of a solution to query. The reply is OK len where len is this number.

    2.4 DMATCH

    This call must follow a LOOKUP and has the form DMATCH num. The numth member of the wordlist created by LOOKUP is found, together with its frequency. The form of the reply is OK freq str1 (str2), where freq is the frequency of the matching word, str1 is the matching word in a form suitable for display (i.e. with character entity references replaced), and str2 the matching word in a form suitable for any subsequent queries using the same word (i.e. with character entity references unchanged).

    2.5 DOWNLOAD

    This is used to request a file from the server. Version 930 of the server supports three possible filenames: the specific files elements.txt and header.txt, or the corpus description file. The first two are obsolescent and may be withdrawn in later releases. The usage and format of these files are described in section 2.39 below.

    The form of the message is DOWNLOAD file where file is either the literal header or an explicit filename. In the former case, a file with the corpus name and the extension .dsc will be downloaded. In the latter case, a file with the name given and the extension .txt will be downloaded.

    The server replies OK if the file is available and NO FILE if it is not. Subsequent LINE messages retrieve the file block by block.

    Note that downloadable files are stored on the server in Unix text file format, with records indicated by newline characters. When downloaded, they are automatically converted to PC text format, with record ends indicated by return/newline pairs. If files are to be moved between client and server machines by any other means (such as FTP), a similar conversion is required; for example by specifying ASCII file transfer rather than binary.

    2.6 FILTER

    Assign a filter to a query. The call has the form FILTER query name, where query is a query number and name the name of a filter, chosen from the following list. Filters are used to process individual solutions before returning them to the client.

    The following filters are currently available:

    2.7 GET

    This call gets a single solution for a query. It has the form GET query num scope where query identifies the query, using the identifier supplied by a previous call of QNAME. The solution is solution number num in sequence. scope is either the name of one or more SGML elements to be used to bound the solution or an integer indicating the number of words of context required.

    The reply is OK text num offset len pos str where:

    The argument scope will usually be a single SGML element name. However, a series of names may be supplied separated by commas. In this case the element whose last start-tag before the hit is latest in the file will be used to bound the solution. If scope is an integer, then the amount downloaded will be the smallest collection of elements containing the hit such that at least scope words precede the hit.

    If the text file is not available, the response OK will still be generated along with a solution text stating that the text is unavailable. The text identifier returned will be valid and num will be set to -1.

    The reply NO SOL is returned if the solution is not available for any other reason.

    2.8 GET1SOL

    This call has the form GET1SOL text gi att where text is a short text name, gi is an SGML element name and att the name of an attribute. The server looks for the first gi element in the document text that has an attribute called att, and returns the value of this attribute.

    If the attribute name is set to - then the content of the first element gi in document text is returned.

    The form of the return is OK str where str is the value.

    2.9 GETHEAD

    GETHEAD is used to extract data from a given position for browsing. The format is GETHEAD textid offset num where:

    The return string is OK newpos jump newd bTag str, unless the text cannot be found, in which case the reply is NO TEXT.

    If the server finds content at the specified offset, it reads all the content into str, sets bTag false, reads and discards any end-tags following the content, adjusting the depth count accordingly but never allowing it to become negative. newpos is set to the new offset and newd to the new depth.

    If the server finds an SGML start tag, at the specified offset, it reads the tag itself into str. It sets variable bTag true and sets newd one greater than the current depth count. It sets newpos to the offset of the end of the tag. Finally it sets jump to be the offset at which the element being opened ends.

    Empty elements are treated as content. w-tags and s-tags are treated as content. str is trimmed of leading and trailing blanks; its spacing is normalised and its characters mapped.

    2.10 GETHEAD2

    This call is used to locate a string whose file position is known in a string returned from GETHEAD. The location cannot be deduced without such a call because of the tidying GETHEAD performs on solutions.

    The format is GETHEAD2 txt offset i0 i1 where txt and offset were the values used to extract the solution in GETHEAD and i0 and i1 are the coordinates of the solution returned by LOC. The server calculates the offset and length of the solution in the string and replies OK offset length.

    2.11 GETPOS

    The argument is a string str, an l-word for which the server is to find all possible parts of speech. The reply takes the form OK num s1 ... sn where num is the number of solutions and s1 ... sn are the different POS codes found.

    2.12 GETSC

    The arguments are a string str, which must be the corpus name, and a number num which is the absolute number of a corpus text. The return is OK str avail where str is the name of the text with this number, and avail is 1 if the text is available, or 0 if it is not, unless the corpus name is wrong, in which case it is NOP

    2.13 INFO

    This message allows the client and server to exchange information. It is the only message that can legally be sent before the user logs on. The form of the message is INFO num, where num is the number of the code page that should be used to translate character references. The following code pages may be specified:

    The response is OK num dv sv cv name where

    2.14 LINE

    This message gets the next block from a file requested by the DOWNLOAD message. The reply is OK str where str is the next block of the file; its first three bytes should be discarded. When there are no more blocks, the reply is NO MORE.

    2.15 LOC

    This message finds the location of a solution (unlike GET, which gets the text). It has the form LOC query num, where query is the query name and num the number of the desired solution. The form of the return is OK nt nc nw where nt is the text number, nc is the character offset of the solution and nw is the word number.

    LOC and GETHEAD2 were implemented specifically to allow the text of a hit to be marked while examining a text in the tree-browse window. The normal way to recover solutions is to call GET repeatedly.

    2.16 LOG

    Used to log on. Before a successful LOG the system replies NO LOGIN to any message.

    The two string arguments are the user's name and password.

    The response to a correct login is OK followed by a copyright message. The response to a bad login is NO BADLOG. In response to failure of the last allowed login attempt, the server may close the connection without a reply.

    2.17 LOGOUT

    Used to log off. There is no reply.

    2.18 LOOKUP

    Look up entries in the dictionary. The sole argument is a string pattern and the reply is OK num where num is the number of words in the dictionary that begin with the string pattern. DMATCH can be used to retrieve the words.

    2.19 MAXLENGTH

    Tells the server the maximum length of a solution that may be returned. The desired limit is the argument, and the server responds OK num where num is the limit actually set, which may be smaller than requested.

    2.20 MOTD

    Gets the Message of the Day from the server. The response is OK str where str is the message.

    2.21 OPEN

    Open a saved query. The argument is the query name. The reply is OK num if the operation is successful and the file contains num solutions; it is NO if the file cannot be opened.

    2.22 PWD

    Change password. The two arguments are the old and the new password. The response is OK if the change is allowed.

    2.23 QNAME

    Allocate a query name. There are no arguments. The response is OK query where query is the name.

    2.24 RGET

    This behaves just like DMATCH but recovers a word from the last regular expression word lookup set up by RLOOKUP (qv).

    2.25 RLOOKUP

    This message finds all words matching a given regular expression. The call is RLOOKUP regexp where regexp is the regular expression. The reply is OK num if there are num solutions. RGET may be used to recover individual solutions.

    The RLOOKUP call stores its result in an internal buffer of limited size. If there are more solutions than can be stored, the string returned will be NO TOOMANY.

    The following one-character regular expressions match a single character:

    The following rules may be used to construct regular expressions:

    The order of precedence of operators at the same parenthesis level is [ ] (character classes), then * + ? (closures), then concatenation, then | (alternation).

    The regular expression evaluator can detect multiple pattern matches. Thus tea?will match both te and tea.

    2.26 SAVE

    This message changes the saved state of a named solution set. The format is SAVE flag query where flag is 0 to turn off saving and 1 to turn it on, and query is the query name.

    2.27 SOLVE

    This is the call used to solve a CQL query. The form of the call is SOLVE query str where query is the query name and str is the query expression. The client must use a query name allocated by QNAME. The format of CQL queries is documented in 2.32 below. The reply is one of NO 0 if there are no solutions; OK num ntxt if there are num solutions occurring in ntxt texts; or NO SYNTAX if the query expression could not be parsed; NO SPACE means that the server cannot save the solution because its disk is full; NO STREAMS means that there are not enough free streams to solve the query; NO FILES means that the system has more open queries than it can handle.

    The SOLVE message is an interruptable transaction.

    Individual solutions may be retrieved using GET.

    2.28 SQTABLE

    Used to thin a solution set to a specified subset. The format is SQTABLE q1 q2 num1 ... numx where q1 is the initial query, q2 is the thinned result, and num1 to numx are the indices of the solutions to be retained.

    The reply is OK num ntxt, where the new set of solutions has num members in ntxt texts.

    2.29 SUBCORPUS

    Gets details of a subcorpus. SUBCORPUS str is the required form, but str must be the corpus name in the current implementation. The reply is OK num m z where num is the number of words in the dictionary, m the mean frequency, and z the standard deviation of frequency.

    2.30 THIN

    Cuts a solution set down using one of a number of criteria. The form of the message is THIN name newname method size seed where query is the query name, newname is a name for the thinned query set, method is an integer specifying how the thinning is to be performed, and size is the desired number of solutions. The following methods are defined:

    The reply is OK num ntxt where num is the number of solutions after thinning, and ntxt is the number of texts represented in the new solution set. If the new query file cannot be created, the reply is NO FILES.

    2.31 WEB

    If the server advertises a web site, this message will return the reply OK url, where url is the URL of the web site. If it does not, then the reply will be NO

    2.32 CQL query structure

    This section defines the syntax and semantics of queries expressed in the SARA Corpus Query Language (CQL).

    2.33 Atomic queries

    Atomic queries are not made up of smaller queries, though they may have components. The following items are regarded as atomic queries:

    2.34 Unary operators

    The following unary operators may be applied to query terms:

    2.35 SGML queries

    An SGML query looks superficially like SGML markup. Thus it has the form

     <[/]element attributes> 
    . If the / character is present end-tags are matched, otherwise start-tags. attributes must be a list of terms of the form name=value where name is the name of the attribute and value is its value. See Section 2.38 for details of how attributes are indexed.

    Note that the server does not perform an exact text match on SGML markup. It will find any start tag in which the gi is the same as the element specified and the attributes have the stated values. The attributes may appear in a different order and may be mixed with other attributes not used in the query. The solutions are the offsets of the start symbol of the matching SGML tags.

    Variables may be used as attribute values. Variables take the form of an integer preceded by an underline character, for example

     _12 
    When variables are used in this way they have the effect of restricting solutions to those where the value replacing each occurrence of the same variable is the same.

    2.36 Combining queries

    2.36.1 Concatenation

    Two queries written in sequence match occasions where a solution to the first query is directly followed by a solution to the second.

    In a concatenation of queries the following extra notation may be used:

    2.36.2 Disjunction

    The term query1|query2 matches anything that is a solution to either query1 or query2.

    2.37 Scoping queries

    In a scoped query, all components of the solution must fall within a given scope. The scope may be specified as an SGML element or as a number of words.

    An SGML scoped query is written q/qs where q is a query join and qs is an SGML query as specified in 2.35 .

    A numerically scoped query is written query/num where num is the number of words. When counting words, punctuation marks are included, but SGML markup is excluded.

    A query join is a sequence q1 * ... * qn where each qi is a query. If a query join is encountered outside a scope, the largest possible SGML scope (equal to a single text) is assumed.

    A join made with operator * must occur with elements in the order specified. If the operator # is used the order is ignored.

    2.38 How attributes are indexed

    The general approach to indexing attributes is to index the value as a text string of each attribute. There are a few cases, however, where special treatment is required. These special cases are explained in this section. The treatment of attribute values in SARA emerged in an extremely ad hoc fashion. Although it is not difficult to see in retrospect how the whole apparatus could be simplified, this has not been done in the present version.

    Certain attributes are declared as plural. When an attribute is plural, the indexer treats its value as a string of values, decomposes the string into a list of values and indexes each of these separately.

    Individual attributes are treated differently, depending on their declared values, as follows:

    The terminology is somewhat misleading. An ID attribute can have any name. A REFID attribute can have any name but must refer to an attribute of another element called ID.

    The values of ID and similar attributes are composed from a text identifier, the root and an integer using a shifting set of rules, radices etc.

    This is not the place to explain the reason for storing ID values as position data.

    2.39 File formats

    As of version 0.930 of the client/server software, all elements of the SARA package (that is, the indexer, the server and the client) use one common file to access information about a BNC-style corpus. The filescdif.txt, elements.txt and header.txt are obsolescent and may be deleted, as may various parameter files only used by the indexer.

    This new file is called the corpus description file. Its name must be the corpus name and its extension must be dsc. The following corpus names are in use:

    The server tells the client the name of the corpus it uses as part of the INFO exchange (see 1.13).

    The description file consists of a number of lines each with a keyword and an argument string. Arguments are separated by blanks. Lines beginning with the character # are treated as comments.

    The rest of this section lists the keywords supported.

    2.40 VER n

    n is the version number multiplied by 100. This must be the first line of the file and may not be preceded by a comment. With software version 0.930 all header versions have been set to 1.00.

    2.41 ATT s1 s2 n d desc

    s1 is an attribute of s2. n must be one of the following strings:

    Type                Value    Use of desc field
    CDATA                0        Unused
    CAT                  1        Alternatives as in ATTRIB declaration 
    NUMBER               2        Unused
    NAME                 3        Unused
    REFID                4        Referenced element
    ID                   5        Root
    NULL                 6        Unused
    MULTID               7        Root
    MULTIDREFS           8        Root
    The terminology of this table is explained in Section 2.38 .

    Example:

    att default CAT YES|NO Whether a default is available
    Note that attributes must be listed after the element to which they belong. Attributes listed before the first ELT statement are treated as global, that is, they may belong to any element.

    2.42 CHAR name n

    Declares a character entity. For example:

    char yacute 253

    If the number n is present then it must denote the character value of the character on any system on which SARA will run. In this case the indexer will replace the entity by the character value and the entity will never be encountered.

    If the number is not present, as in

    char yacute 
    then the treatment of the character depends on which code page is chosen by the server. If the chosen code page maps the entity then it will be mapped to the given character in server replies; it must not be mapped in client messages. If the entity is completely unmapped then it will appear in SGML entity notation.

    2.43 ELT gi n h desc

    s is an element of type n. The values for n are:

    By an s-type element is meant an element with omitted end tags where the end tag is always found immediately before the next start tag of the same kind, or at the end of the document if there are no more such start tags.

    h is b if the tag appears in text bodies, h if it only appears in headers.

    desc is a brief description of the element.

    Example:

    elt locale e h description of a place where speech recorded

    2.44 ENT name

    Declares a special entity and its printed form. For example:

    ent alien [alien]

    Special entities are always displayed in their printed form when non-SGML output is required.

    2.45 ITEM

    see MENU

    2.46 MENU k shortname

    List a menu to be used to solicit the value of a parameter. All menus must begin with a line such as

    MENU 0 spoken_class

    What follows depends on the value of the parameter k. If it is 0 then a number of item statements such as

    ITEM 2 C1
    ITEM 3 C2
    ITEM 4 DE
    give the actual parameter values and their menu equivalents.

    If k is non-zero then it must refer to an enumeration declared in a TYPE statement. Note however that the value k=1 is reserved.

    Note: as of software version 0.930 any attribute may have a menu. Declaring an empty menu for an attribute suppresses its display.

    A MENU statement must directly follow the attribute it serves.

    2.47 POS n desc

    A POS statement lists a part-of-speech code. For example:

    POS VBD past form of the verb "BE" , i.e. WAS, WERE

    2.48 PUN n desc

    A PUN statement lists a part-of-speech code that is classed as punctuation. For example:

    PUN PUL left bracket (i.e. ( or [ )

    2.49 TYPE n

    n must be an integer greater than 1. The items that follow the statement are just the same as after a MENU statement. The only point of the type statement is that it saves having to list the same enumeration of items (eg ISO country codes) in several places. The integer n is used in a MENU statement to show that the items from type n should be used.

    [Note that the break table in the client is still hard-coded]


    Previous
    Up
    Next