deduce

Migrating to version 3.0.0

Version 3.0.0 of deduce includes many optimizations that allow more accurate de-identification, some already included in 2.1.0 - 2.5.0. It also includes some structural optimizations. Version 3.0.0 should be backwards compatible, but some functionality is scheduled for removal in 3.1.0. Those changes are listed below.

Custom config

Adding a custom config is now possible as a dict or as a filename pointing to a json. Both should be presented to deduce with the config keyword, e.g.:

deduce = Deduce(config='my_own_config.json')
deduce = Deduce(config={'redactor_open_char': '**', 'redactor_close_char': '**'})

The config_file keyword is no longer used, please use config instead.

Lookup structure names

For consistency, lookup structures names are now all in singular form:

Old name

New name

prefixes

prefix

first_names

first_name

interfixes

interfixes

interfix_surnames

interfix_surname

surnames

surname

streets

street

placenames

placename

hospitals

hospital

healthcare_institutions

healthcare_institution

Additionally, the first_name_exceptions and surname_exceptions list are removed. The exception items are now simply removed from the original list in a more structured way, so there is no need to explicitly filter exceptions in patterns, etc.

The annotator_type field in config

In a config, each each annotator should specify annotator_type, so Deduce knows what annotator to load. In 3.0.0 we simplified this a bit. In most cases, the annotator_type field should be set to module.Class of the annotator that should be loaded, and Deduce will handle the rest (sometimes with a little bit of magic, so all arguments are presented with the right type). You should make the following changes:

annotator_type

Change

multi_token

docdeid.process.MultiTokenLookupAnnotator

dd_token_pattern

This used to load docdeid.process.TokenPatternAnnotator, but this is now replaced by deduce.annotator.TokenPatternAnnotator. The latter is more poweful, but needs a different pattern. A docdeid.process.TokenPatternAnnotator can no longer be loaded through config, although adding it manually to Deduce.processors is always possible.

token_pattern

deduce.annotator.TokenPatternAnnotator

annotation_context

deduce.annotator.ContextAnnotator

custom

Use module.Class directly, where module and class fields used to be specified in args. They should be removed there.

regexp

docdeid.process.RegexpAnnotator

Migrating to version 2.0.0

Version 2.0.0 of deduce sees a major refactor that enables speedup, configuration, customization, and more. With it, the interface to apply deduce to text changes slightly. Updating your code to the new interface should not take more than a few minutes. The details are outlined below.

Calling deduce

deduce is now called from Deduce.deidentify, which replaces the annotate_text and deidentify_annotations functions. Those functions will give a DeprecationWarning from version 2.0.0, and will be deprecated from version 2.1.0.

deprecated new
from deduce import annotate_text, deidentify_annotations

text = "Jan Jansen"

annotated_text = annotate_text(text)
deidentified_text = deidentify_annotations(annotated_text)
from deduce import Deduce

text = "Jan Jansen"

deduce = Deduce()
doc = deduce.deidentify(text)   

Accessing output

The annotations and deidentified text are now available in the Document object. Intext annotations can still be useful for comparisons, they can be obtained by passing the document to a util function from the docdeid library (note that the format has changed).

deprecated new
print(annotated_text)
'<PERSOON Jan Jansen>'

print(deidentified_text)
'<PERSOON-1>'
import docdeid as dd

print(dd.utils.annotate_intext(doc))
'<PERSOON>Jan Jansen</PERSOON>'

print(doc.annotations)
AnnotationSet({
    Annotation(
        text="Jan Jansen", 
        start_char=0, 
        end_char=10, 
        tag="persoon", 
        length="10"
    )
})

print(doc.deidentified_text)
'<PERSOON-1>'

Adding patient names

The patient_first_names, patient_initials, patient_surname and patient_given_name keywords of annotate_text are replaced with a structured way to enter this information, in the Person class. This class can be passed to deidentify() as metadata. The use of a given name is deprecated, it can instead be added as a separate first name. The behaviour is still the same.

deprecated new
from deduce import annotate_text, deidentify_annotations

text = "Jan Jansen"

annotated_text = annotate_text(
    text, 
    patient_first_names="Jan Hendrik", 
    patient_initials="JH", 
    patient_surname="Jansen", 
    patient_given_name="Joop"
)
deidentified_text = deidentify_annotations(annotated_text)
from deduce import Deduce
from deduce.person import Person

text = "Jan Jansen"
patient = Person(
    first_names=['Jan', 'Hendrik', 'Joop'], 
    initials="JH", 
    surname="Jansen"
)

deduce = Deduce()
doc = deduce.deidentify(text, metadata={'patient': patient})   

Enabling/disabling specific categories

Previously, the annotate_text function offered disabling specific categories by using dates, ages, names, etc. keywords. This behaviour can be achieved by setting the disabled argument of the Deduce.deidentify method. Note that the identification logic of Deduce is now further split up into Annotator classes, allowing disabling/enabling specific components. You can read more about the specific annotators and other components in the tutorial here, and more information on enabling, disabling, replacing or modifying specific components here.

deprecated new
from deduce import annotate_text, deidentify_annotations

text = "Jan Jansen"

annotated_text = annotate_text(
    text,
    dates=False,
    ages=False
)
deidentified_text = deidentify_annotations(annotated_text)
from deduce import Deduce

text = "Jan Jansen"

deduce = Deduce()
doc = deduce.deidentify(
    text, 
    disabled={'dates', 'ages'}
)