A Note on Anonymization Lou Burnard The following remarks attempt to summarize a brief discussion at OUCS of Chambers' notes on anonymization. They relate only to proper names of living individuals and (in some circumstances) organizations. In general, the only reason for not retaining proper names of organizations or places that we could envisage might be that they helped in the identification of living people; we thought that the chances of this are remote if courses (2) or (3) below is adopted, and irrelevant if course (1) is adopted. We identified the following three possible course of action with respect to proper names: 1. Replace the general promise of complete confidentiality offered to contributors by a request for them to censor material in advance 2. Agree on some minimal level of information to be extracted from existing proper names and encode it consistently 3. Omit all proper names. Taking these in reverse order: (3) Requires no expansion to CDIF (each name would be represented as ) but some care by data capturers who will have to decide whether or not to replace a proper name by an entity reference, assuming that names of historical or public figures are not to be anonymized. This may not always be obvious e.g. if a letter refers to "Maggie" only knowledge of the context will indicate whether or not the recipient's aunty or an ex-prime minister is intended. Anonymizing ALL proper names seems to us to be over kill -- and also seriously compromizes the usefulness of the resulting data. (2) Consider the following variations: Mrs M. Turnpike Maggie Turnpike Margaret Maggie Aunty Mags Leaving aside anaphoric references such as "she" or (arguably) "your aunt" which we assume can be left intact, we ask what residue of information could we extract from these while preserving anonymity? To some extent, linguistic features such as titles or honorifics (Mr, Mrs etc), can be identified automatically. A tagging distinguishing between forenames and lastnames, or even distinguishing male, female or androgynous names is also not hard to imagine. There seem to be three possible categories: - identity of the person (i.e. the fact that all of the above refer to the same individual -- if indeed they do) - demographic characteristics of the person (e.g. gender, age, class) - description of the way the person is referred to (e.g. register, level of informality etc) All of these are encodable as attributes of a element, e.g. we might have within the text: XXXX XXXX etc. with elsewhere (presumably in the header) This would require a great deal of work on the part of the encoders, but would be relatively easy to check automatically. One approach would be to simply mark all potential proper names in a first pass, and then attempt to group together automatically all variants on the same name in a second editorial pass. Clearly several detailed value judgments would be needed, and a clear set of guidelines as to e.g. how to characterize the modes of naming must be defined. I've assumed here that the actual content of the name would be simply removed or replaced by XXXX. Another possibility might be to use a cipher such that e.g. 'Maggie' would always generate 'Xzygh1' in whatever situation the string appeared, so that some limited matching could still be carried out. It's debatable how useful this would be. Another alternative would be to have a dictionary in which comparable forms were comparably mapped (e.g. 'Margaret' is translated into 'Susan', 'Jones' into 'Brown' etc.) I don't even want to think about how whether this would work for non-English proper names. (1) Finally, we consider the first possibility. Chambers' note seems to suggest that even their current wording is not felt to be persuasive by some people, and it's clear that there will always be some who are unwilling to believe any assurances we offer. We need to identify a course of action which will be both workable and demonstrably fair to people's desires to safeguard their privacy. On the other hand, if people are seriously concerned about invasion of privacy, they need not give the material to us, or can censor it in advance themselves. Self censorship will give inconsistent results, but will at the same time provide very interesting data on which areas people are genuinely sensitive about. Organizations donating large amounts of data -- if there are any -- could simply specify a desired anonymization policy rather than physically blot out the offending words, provided that this is implementable (e.g. 'remove all references to members of the Board' is not acceptable, whereas 'remove the following names wherever they appear : Blenkinsop, Burns, Thatcher...' is). Human nature being what it is, it's more likely that even they will prefer to apply the blue pencil themselves -- which makes the job of data capture that much easier. As noted above, encoding material that has been suppressed for some reason is already possible within CDIF, and does not represent a great deal of additional encoding effort. On balance therefore, we incline to recommend this course. We suggest that potential donors should be made aware of the importance (to us) of retaining as much information about the form etc. of names as possible, and presented with a series of options that they may elect to adopt to protect their own privacy concerns: 1) they may leave information untouched (but, clearly, will not give us any material they know to be sensitive) 2) they may instruct us simply to remove all proper names other than those of historical or public figures 3) they may instruct us to remove specific proper names 4) they may themselves pre-edit the material, indicating any portions they wish to be suppressed