Documentation

Search & index

Development

Tutorial¶

deduce is a rule-based de-identification method for clinical text written in Dutch, which finds and removes information in one or more categories of interest (e.g. person names, names of institutions, locations). In principle, deduce can work ‘out of the box’, however, based on both scientific research and personal experience, deduce is unlikely to remove all sensitive information when no effort goes into some customization. This tutorial should help you reach that goal. Along with basic steps to get started and highlights of some features, further in this tutorial, we describe how to tailor deduce to your specific data.

It’s useful to note that from version 2.0.0, deduce is built using docdeid(docs, GitHub), a small framework that helps build de-identifiers. Before you start customizing deduce, checking the docdeid docs will probably make it easier still.

In case you get stuck with applying or modifying deduce, its always possible to ask for help, by creating an issue in our issue tracker!

Installation¶

pip install deduce

Getting started¶

The basic way to use deduce, is to pass text to the deidentify method of a Deduce object:

from deduce import Deduce

deduce = Deduce()

text = (
    "betreft: Jan Jansen, bsn 111222333, patnr 000334433. De patient J. Jansen is 64 jaar oud en woonachtig in "
    "Utrecht. Hij werd op 10 oktober 2018 door arts Peter de Visser ontslagen van de kliniek van het UMCU. "
    "Voor nazorg kan hij worden bereikt via j.JNSEN.123@gmail.com of (06)12345678."
)

doc = deduce.deidentify(text)

The output is available in the Document object:

from pprint import pprint

pprint(doc.annotations)

AnnotationSet({
    Annotation(text="(06)12345678", start_char=272, end_char=284, tag="telefoonnummer"),
    Annotation(text="111222333", start_char=25, end_char=34, tag="bsn"),
    Annotation(text="Peter de Visser", start_char=153, end_char=168, tag="persoon"),
    Annotation(text="j.JNSEN.123@gmail.com", start_char=247, end_char=268, tag="email"),
    Annotation(text="patient J. Jansen", start_char=56, end_char=73, tag="patient"),
    Annotation(text="Jan Jansen", start_char=9, end_char=19, tag="patient"),
    Annotation(text="10 oktober 2018", start_char=127, end_char=142, tag="datum"),
    Annotation(text="64", start_char=77, end_char=79, tag="leeftijd"),
    Annotation(text="000334433", start_char=42, end_char=51, tag="id"),
    Annotation(text="Utrecht", start_char=106, end_char=113, tag="locatie"),
    Annotation(text="UMCU", start_char=202, end_char=206, tag="instelling"),
})

print(doc.deidentified_text)

"""betreft: [PERSOON-1], bsn [BSN-1], patnr [ID-1]. De [PERSOON-1] is [LEEFTIJD-1] jaar oud en woonachtig in 
[LOCATIE-1]. Hij werd op [DATUM-1] door arts [PERSOON-2] ontslagen van de kliniek van het [INSTELLING-1]. 
Voor nazorg kan hij worden bereikt via [EMAIL-1] of [TELEFOONNUMMER-1]."""

Additionally, if the names of the patient are known, they may be added as metadata, where they will be picked up by deduce:

from deduce.person import Person

patient = Person(first_names=["Jan"], initials="JJ", surname="Jansen")
doc = deduce.deidentify(text, metadata={'patient': patient})

print (doc.deidentified_text)

"""betreft: [PATIENT], bsn [BSN-1], patnr [ID-1]. De [PATIENT] is [LEEFTIJD-1] jaar oud en woonachtig in 
[LOCATIE-1]. Hij werd op [DATUM-1] door arts [PERSOON-2] ontslagen van de kliniek van het [INSTELLING-1]. 
Voor nazorg kan hij worden bereikt via [EMAIL-1] of [TELEFOONNUMMER-1]."""

As you can see, adding known names keeps references to [PATIENT] in text. It also increases recall, as not all known names are contained in the lookup lists.

Included components¶

A docdeid de-identifier is made up of document processors, such as annotators, annotation processors, and redactors, that are applied sequentially in a pipeline. The most important components that make up deduce are described below.

Annotators¶

The Annotator is responsible for tagging pieces of information in the text as sensitive information that needs to be removed. deduce includes various annotators, described below:

Group	Annotator Name	Annotator Type	Explanation
names	prefix_with_initial	`deduce.annotator.TokenPatternAnnotator`	Matches a prefix followed by initial(s)
	prefix_with_interfix	`deduce.annotator.TokenPatternAnnotator`	Matches a prefix followed by an interfix and something that resembles a name
	prefix_with_name	`deduce.annotator.TokenPatternAnnotator`	Matches a prefix followed by something that resembles a name
	interfix_with_name	`deduce.annotator.TokenPatternAnnotator`	Matches an interfix followed by something that resembles a name
	initial_with_name	`deduce.annotator.TokenPatternAnnotator`	Matches an initial followed by something that resembles a name
	initial_interfix	`deduce.annotator.TokenPatternAnnotator`	Matches an initial followed by an interfix and something that resembles a name
	first_name_lookup	`docdeid.process.MultiTokenLookupAnnotator`	Lookup based on first names from Voornamenbank (Meertens Instituut)
	surname_lookup	`docdeid.process.MultiTokenLookupAnnotator`	Lookup based on surnames from Familienamenbank (Meertens Instituut)
	patient_name	`deduce.annotator.PatientNameAnnotator`	Custom logic to match patient name, if supplied in document metadata
	name_context	`deduce.annotator.ContextAnnotator`	Matches names based on annotations found above, with the following context patterns: `interfix_right`: An interfix and something that resembles a name, when preceded by a detected initial or name `initial_left`: An initial, when followed by a detected initial, name or interfix `naam_left`: Something that resembles a name, when followed by a name `naam_right`: Something that resembles a name, when preceded by a name `prefix_left`: A prefix, when followed by a prefix, initial, name or interfix
	eponymous_disease	`docdeid.process.MultiTokenLookupAnnotator`	Lookup based on eponymous diseases, which will be tagged with `pseudo_name` and removed later (along with any overlap)
locations	placename	`docdeid.process.MultiTokenLookupAnnotator`	Lookup based on a compiled list of regions, provinces, municipalities and residences
	street_pattern	`docdeid.process.RegexpAnnotator`	Matches streetnames based on a pattern (ending in straat, plein, dam, etc.)
	street_lookup	`docdeid.process.MultiTokenLookupAnnotator`	Lookup based on a list of streetnames from Basisadministratie Gemeenten
	housenumber	`deduce.annotator.ContextAnnotator`	Matches housenumber and housenumberletters, based on the following context patterns: `housenumber_right`: a 1-4 digit number, preceded by a streetname `housenumber_housenumberletter_right`: a 1-4 digit number and a single letter, preceded by a streetname `housenumberletter_right`: a single letter, preceded by a housenumber
	postal_code	`docdeid.process.RegexpAnnotator`	Matches Dutch postal codes, i.e. four digits followed by two letters
	postbus	`docdeid.process.RegexpAnnotator`	Matches postbus, i.e. ‘Postbus’ followed by a 1-5 digit number, optionally with periods between them.
institution	hospital	`docdeid.process.MultiTokenLookupAnnotator`	Lookup based on a list of hospitals.
	institution	`docdeid.process.MultiTokenLookupAnnotator`	Lookup based on a list of healthcare institutions, based on Zorgkaart Nederland.
dates	date_dmy_1	`docdeid.process.RegexpAnnotator`	Matches dates in dmy format, e.g. 01-01-2012
	date_dmy_2	`docdeid.process.RegexpAnnotator`	Matches dates in dmy format, e.g. 01 jan 2012
	date_ymd_1	`docdeid.process.RegexpAnnotator`	Matches dates in ymd format, e.g. 2012-01-01
	date_ymd_2	`docdeid.process.RegexpAnnotator`	Matches dates in ymd format, e.g. 2012 jan 01
ages	age	`deduce.annotator.RegexpPseudoAnnotator`	Matches ages based on a number of digit patterns followed by jaar/jaar oud. Excludes matches that are preceded/followed by one of the `pre_pseudo` / `post_pseudo` words, e.g. ‘sinds 10 jaar`
identifiers	bsn	`deduce.annotator.BsnAnnotator`	Matches Dutch social security numbers (BSN), based on a 9-digit pattern that also passes the ‘elfproef’
	identifier	`docdeid.process.RegexpAnnotator`	Matches any 7+ digit number as identifier
phone_numbers	phone	`deduce.annotator.PhoneNumberAnnotator`	Matches phone numbers, based on regular expression pattern, optionally with a digit too few or a digit too much (common typos)
email_addresses	email	`docdeid.process.RegexpAnnotator`	Matches e-mail addresses, based on regular expression pattern
urls	url	`docdeid.process.RegexpAnnotator`	Matches urls, based on regular expression pattern

It’s possible to add, remove, apply subsets, or to implement custom annotators, those options are described further down under customizing deduce.

Other processors¶

In addition to annotators, a docdeid de-identifier contains annotation processors, which do some operation to the set of annotations generated previously, and redactors, which take the annotation and replace them in the text. Other processors included in deduce are listed below:

Name	Group	Description
person_annotation_converter	names	Maps name tags to either PERSON or PATIENT, and removes overlap with ‘pseudo_name’.
remove_street_tags	locations	Removes any matched street names that are not followed by a housenumber
clean_street_tags	locations	Cleans up street tags, e.g. straat+huisnummer -> locatie
overlap_resolver	post_processing	Makes sure overlap among annotations is resolved.
merge_adjacent_annotations	post_processing	If there are any adjacent annotations with the same tag, they are merged into a single annotation.
redactor	post_processing	Takes care of replacing the annotated PHIs with `[TAG]` (e.g. `[LOCATION-1]`, `[DATE-2]`)

Lookup sets¶

In order to match tokens to known identifiable words or concepts, deduce has the following builtin lookup sets:

Name	Size	Examples
prefix	45	bc., dhr., mijnheer
initial	54	Q, I, U
interfix	44	van de, von, v/d
first_name	14690	Martin, Alco, Wieke
interfix_surname	2384	Rijke, Butter, Agtmaal
surname	10346	Kosters, Hilderink, Kogelman
hospital	9283	Oude en Nieuwe Gasthuis, sint Jans zkh., Dijklander
hospital_abbr	21	UMCG, WKZ, PMC
healthcare_institution	244342	Gezondheidscentrum Wesselerbrink, Fysiotherapie Heer, Ergotherapie Tilburg-Waalwyk eo.
placename	12049	De Plaats, Diefdijk (U), Het Haantje (DR)
street	769569	Ds. Van Diemenstraat, Jac. v den Eyndestr, Matenstr
eponymous_disease	22512	tumor van Brucellosis, Lobomycosis reactie, syndroom van Alagille
common_word	1008	al, tuin, brengen
medical_term	6939	bevattingsvermogen, iliacaal, oor
stop_word	101	kan, heb, dat

Customizing deduce¶

We highly recommend making some effort to customize deduce, as even some basic effort will almost surely increase accuracy. Below are outlined some ways to achieve this, including: making changes to the config, adding/removing custom pipeline components, and modifying the builtin lookup sets.

Adding a custom config¶

A default base_config.json (source on GitHub) file is packaged with deduce. Among with some basic settings, it defines all annotators (also listed above). Override settings, by providing an additional user config to Deduce, either as a file or as a dict:

from deduce import Deduce

deduce = Deduce(config='my_own_config.json')
deduce = Deduce(config={'redactor_open_char': '**', 'redactor_close_char': '**'})

This will only override settings that are explicitly set in the user config, all other settings are kept as is. If you want to add or delete annotators (e.g. changing regular expressions), it’s easiest to make a copy of base_config.json, and load it as follows:

from deduce import Deduce

deduce = Deduce(load_base_config=False, config='my_own_config.json')

Note that you will now miss out on any updates to the base config that are packaged with new versions of Deduce. For that reason, a better way to add/remove processors is to interact with Deduce.processors directly after creating the model.

Using `disabled` keyword to disable components¶

It’s possible to disable specific (groups of) annotators or processors when deidentifying a text. For example, to apply all annotators, except those in the identifiers group:

from deduce import Deduce

deduce = Deduce()
deduce.deidentify(text, disabled={'identifiers'})

Or, to disable one specific date annotator in the dates group, but keeping the other date patterns:

from deduce import Deduce

deduce = Deduce()
deduce.deidentify("text", disabled={'date_dmy_1'})

Using `enabled` keyword¶

Although it’s also possible to enable only some processors, this is only useful in a limited amount of cases. You must manually specify the groups, individual annotators, and postprocessors to have a sensible output. For example, to de-identify only e-mail addresses, use:

from deduce import Deduce

deduce = Deduce()
deduce.deidentify("text", enabled={
    'email-addresses', # annotator group, with annotators:
    'email', 
    'post_processing', # post processing group, with processors:
    'overlap_resolver',
    'merge_adjacent_annotations',
    'redactor'
})

The following example however will apply no annotators, as the email annotator is enabled, but its’ group email-addresses is not:

from deduce import Deduce

deduce = Deduce()
deduce.deidentify("text", enabled={'email'})

Implementing custom components¶

It’s possible to implement the following custom components, Annotator, AnnotationProcessor, Redactor and Tokenizer. This is done by implementing the abstract classes defined in the docdeid package, which is described here: docdeid docs - docdeid components.

In our case, we can add or remove custom document processors by interacting with the deduce.processors attribute directly:

from deduce import Deduce

deduce = Deduce()

# remove date annotators
del deduce.processors['dates']

# add another annotator
deduce.processors.add_processor( 
    'some_new_category', 
    MyCustomAnnotator(), 
    position=0
) 

Note that by default, processors are applied in the order they are added to the pipeline. To prevent a new annotator being added after the post_processing group (meaning the annotations would not be redacted in the text), use the position keyword of the add_processor method, as in the example above.

Changing tokenizer¶

There might be a case where you want to add a custom annotator to deduce that requires its own tokenizing logic. Replacing the builtin tokenizer is not recommended, as builtin annotators depend on it, but it’s possible to add more tokenizers as follows:

from deduce import Deduce

deduce = Deduce()
deduce.tokenizers['my_custom_tokenizer'] = MyCustomTokenizer() # make sure this implements abstract docdeid.tokenize.Tokenizer

Then annotators can use:

import docdeid as dd

def annotate(doc: dd.Document):
    tokens = doc.get_tokens("my_custom_tokenizer")

Tailoring lookup structures¶

Updating the builtin lookup sets and tries is a very useful and straightforward way to tailor deduce. Changes can be made directly from the Deduce.lookup_structs attribute, as such:

from deduce import Deduce

deduce = Deduce()

# sets
deduce.lookup_structs['first_names'].add_items_from_iterable(["naam", "andere_naam"])
deduce.lookup_structs['whitelist'].add_items_from_iterable(["woord", "ander_woord"])

# tries
deduce.lookup_structs['residences'].add_items(["kleine", "plaats", "in", "de", "regio"])
deduce.lookup_structs['institutions'].add_items_from_iterable(["verzorgingstehuis", "hier", "om", "de", "hoek"])

Full documentation on sets and tries, and how to modify them, is available in the docdeid API.

Larger changes may also be made by copying the source files and modifying them directly, by pointing deduce to the directory with modified sources:

from deduce import Deduce

deduce = Deduce(lookup_data_path="/my/path")

It’s important to copy the directory, or your changes will be overwritten with the next deduce update. Currently, there is no additional documentation available on how to structure and transform the lookup items in the directory, other than inspecting the pre-packaged files. Also remember that any updates to lookup values in next releases of Deduce will not be applied if deduce loads items from a copy, differences need to be tracked manually with each release.