Documentation

Search & index

Development

Migrating to version `3.0.0`¶

Version 3.0.0 of deduce includes many optimizations that allow more accurate de-identification, some already included in 2.1.0 - 2.5.0. It also includes some structural optimizations. Version 3.0.0 should be backwards compatible, but some functionality is scheduled for removal in 3.1.0. Those changes are listed below.

Custom config¶

Adding a custom config is now possible as a dict or as a filename pointing to a json. Both should be presented to deduce with the config keyword, e.g.:

deduce = Deduce(config='my_own_config.json')
deduce = Deduce(config={'redactor_open_char': '**', 'redactor_close_char': '**'})

The config_file keyword is no longer used, please use config instead.

Lookup structure names¶

For consistency, lookup structures names are now all in singular form:

Old name	New name
prefixes	prefix
first_names	first_name
interfixes	interfixes
interfix_surnames	interfix_surname
surnames	surname
streets	street
placenames	placename
hospitals	hospital
healthcare_institutions	healthcare_institution

Additionally, the first_name_exceptions and surname_exceptions list are removed. The exception items are now simply removed from the original list in a more structured way, so there is no need to explicitly filter exceptions in patterns, etc.

The `annotator_type` field in config¶

In a config, each each annotator should specify annotator_type, so Deduce knows what annotator to load. In 3.0.0 we simplified this a bit. In most cases, the annotator_type field should be set to module.Class of the annotator that should be loaded, and Deduce will handle the rest (sometimes with a little bit of magic, so all arguments are presented with the right type). You should make the following changes:

annotator_type	Change
multi_token	`docdeid.process.MultiTokenLookupAnnotator`
dd_token_pattern	This used to load `docdeid.process.TokenPatternAnnotator`, but this is now replaced by `deduce.annotator.TokenPatternAnnotator`. The latter is more poweful, but needs a different pattern. A `docdeid.process.TokenPatternAnnotator` can no longer be loaded through config, although adding it manually to `Deduce.processors` is always possible.
token_pattern	`deduce.annotator.TokenPatternAnnotator`
annotation_context	`deduce.annotator.ContextAnnotator`
custom	Use `module.Class` directly, where `module` and `class` fields used to be specified in `args`. They should be removed there.
regexp	`docdeid.process.RegexpAnnotator`

Migrating to version `2.0.0`¶

Version 2.0.0 of deduce sees a major refactor that enables speedup, configuration, customization, and more. With it, the interface to apply deduce to text changes slightly. Updating your code to the new interface should not take more than a few minutes. The details are outlined below.

Calling `deduce`¶

deduce is now called from Deduce.deidentify, which replaces the annotate_text and deidentify_annotations functions. Those functions will give a DeprecationWarning from version 2.0.0, and will be deprecated from version 2.1.0.

deprecated	new
from deduce import annotate_text, deidentify_annotations text = "Jan Jansen" annotated_text = annotate_text(text) deidentified_text = deidentify_annotations(annotated_text)	from deduce import Deduce text = "Jan Jansen" deduce = Deduce() doc = deduce.deidentify(text)

deprecated

new

from deduce import annotate_text, deidentify_annotations

text = "Jan Jansen"

annotated_text = annotate_text(text)
deidentified_text = deidentify_annotations(annotated_text)

from deduce import Deduce

text = "Jan Jansen"

deduce = Deduce()
doc = deduce.deidentify(text)   

Accessing output¶

The annotations and deidentified text are now available in the Document object. Intext annotations can still be useful for comparisons, they can be obtained by passing the document to a util function from the docdeid library (note that the format has changed).

deprecated	new
print(annotated_text) '<PERSOON Jan Jansen>' print(deidentified_text) '<PERSOON-1>'	import docdeid as dd print(dd.utils.annotate_intext(doc)) '<PERSOON>Jan Jansen</PERSOON>' print(doc.annotations) AnnotationSet({ Annotation( text="Jan Jansen", start_char=0, end_char=10, tag="persoon", length="10" ) }) print(doc.deidentified_text) '<PERSOON-1>'

deprecated

new

print(annotated_text)
'<PERSOON Jan Jansen>'

print(deidentified_text)
'<PERSOON-1>'

import docdeid as dd

print(dd.utils.annotate_intext(doc))
'<PERSOON>Jan Jansen</PERSOON>'

print(doc.annotations)
AnnotationSet({
    Annotation(
        text="Jan Jansen", 
        start_char=0, 
        end_char=10, 
        tag="persoon", 
        length="10"
    )
})

print(doc.deidentified_text)
'<PERSOON-1>'

Adding patient names¶

The patient_first_names, patient_initials, patient_surname and patient_given_name keywords of annotate_text are replaced with a structured way to enter this information, in the Person class. This class can be passed to deidentify() as metadata. The use of a given name is deprecated, it can instead be added as a separate first name. The behaviour is still the same.

deprecated	new
from deduce import annotate_text, deidentify_annotations text = "Jan Jansen" annotated_text = annotate_text( text, patient_first_names="Jan Hendrik", patient_initials="JH", patient_surname="Jansen", patient_given_name="Joop" ) deidentified_text = deidentify_annotations(annotated_text)	from deduce import Deduce from deduce.person import Person text = "Jan Jansen" patient = Person( first_names=['Jan', 'Hendrik', 'Joop'], initials="JH", surname="Jansen" ) deduce = Deduce() doc = deduce.deidentify(text, metadata={'patient': patient})

deprecated

new

from deduce import annotate_text, deidentify_annotations

text = "Jan Jansen"

annotated_text = annotate_text(
    text, 
    patient_first_names="Jan Hendrik", 
    patient_initials="JH", 
    patient_surname="Jansen", 
    patient_given_name="Joop"
)
deidentified_text = deidentify_annotations(annotated_text)

from deduce import Deduce
from deduce.person import Person

text = "Jan Jansen"
patient = Person(
    first_names=['Jan', 'Hendrik', 'Joop'], 
    initials="JH", 
    surname="Jansen"
)

deduce = Deduce()
doc = deduce.deidentify(text, metadata={'patient': patient})   

Enabling/disabling specific categories¶

Previously, the annotate_text function offered disabling specific categories by using dates, ages, names, etc. keywords. This behaviour can be achieved by setting the disabled argument of the Deduce.deidentify method. Note that the identification logic of Deduce is now further split up into Annotator classes, allowing disabling/enabling specific components. You can read more about the specific annotators and other components in the tutorial here, and more information on enabling, disabling, replacing or modifying specific components here.

deprecated	new
from deduce import annotate_text, deidentify_annotations text = "Jan Jansen" annotated_text = annotate_text( text, dates=False, ages=False ) deidentified_text = deidentify_annotations(annotated_text)	from deduce import Deduce text = "Jan Jansen" deduce = Deduce() doc = deduce.deidentify( text, disabled={'dates', 'ages'} )

deprecated

new

from deduce import annotate_text, deidentify_annotations

text = "Jan Jansen"

annotated_text = annotate_text(
    text,
    dates=False,
    ages=False
)
deidentified_text = deidentify_annotations(annotated_text)

from deduce import Deduce

text = "Jan Jansen"

deduce = Deduce()
doc = deduce.deidentify(
    text, 
    disabled={'dates', 'ages'}
)   

Migrating to version 3.0.0¶

Custom config¶

Lookup structure names¶

The annotator_type field in config¶

Migrating to version 2.0.0¶

Calling deduce¶

Accessing output¶

Adding patient names¶

Enabling/disabling specific categories¶

Migrating to version `3.0.0`¶

The `annotator_type` field in config¶

Migrating to version `2.0.0`¶

Calling `deduce`¶