Documentation
Search & index
Development
- API
- Changelog
- 3.0.6 (2025-07-18)
- 3.0.5 (2025-06-18)
- 3.0.4 (2025-05-06)
- 3.0.3 (2024-07-16)
- 3.0.2 (2024-02-15)
- 3.0.1 (2023-12-20)
- 3.0.0 (2023-12-20)
- 2.5.0 (2023-11-28)
- 2.4.2 (2023-11-22)
- 2.4.3 (2023-11-22)
- 2.4.2 (2023-11-21)
- 2.4.1 (2023-11-15)
- 2.4.0 (2023-11-15)
- 2.3.1 (2023-11-01)
- 2.3.0 (2023-10-25)
- 2.2.0 (2023-09-28)
- 2.1.0 (2023-08-07)
- 2.0.3 (2023-04-06)
- 2.0.2 (2023-03-28)
- 2.0.1 (2022-12-09)
- 2.0.0 (2022-12-05)
- 1.0.8 (2021-11-29)
- 1.0.7 (2021-11-03)
- 1.0.6 (2021-10-06)
- 1.0.5 (2021-10-05)
- 1.0.4 (2021-10-05)
- 1.0.3 (2021-07-07)
- 1.0.2
- 1.0.1
- 1.0.0
- Contributing
- License
Migrating to version 3.0.0¶
Version 3.0.0 of deduce includes many optimizations that allow more accurate de-identification, some already included in 2.1.0 - 2.5.0. It also includes some structural optimizations. Version 3.0.0 should be backwards compatible, but some functionality is scheduled for removal in 3.1.0. Those changes are listed below.
Custom config¶
Adding a custom config is now possible as a dict or as a filename pointing to a json. Both should be presented to deduce with the config keyword, e.g.:
deduce = Deduce(config='my_own_config.json')
deduce = Deduce(config={'redactor_open_char': '**', 'redactor_close_char': '**'})
The config_file keyword is no longer used, please use config instead.
Lookup structure names¶
For consistency, lookup structures names are now all in singular form:
Old name |
New name |
|---|---|
prefixes |
prefix |
first_names |
first_name |
interfixes |
interfixes |
interfix_surnames |
interfix_surname |
surnames |
surname |
streets |
street |
placenames |
placename |
hospitals |
hospital |
healthcare_institutions |
healthcare_institution |
Additionally, the first_name_exceptions and surname_exceptions list are removed. The exception items are now simply removed from the original list in a more structured way, so there is no need to explicitly filter exceptions in patterns, etc.
The annotator_type field in config¶
In a config, each each annotator should specify annotator_type, so Deduce knows what annotator to load. In 3.0.0 we simplified this a bit. In most cases, the annotator_type field should be set to module.Class of the annotator that should be loaded, and Deduce will handle the rest (sometimes with a little bit of magic, so all arguments are presented with the right type). You should make the following changes:
annotator_type |
Change |
|---|---|
multi_token |
|
dd_token_pattern |
This used to load |
token_pattern |
|
annotation_context |
|
custom |
Use |
regexp |
|
Migrating to version 2.0.0¶
Version 2.0.0 of deduce sees a major refactor that enables speedup, configuration, customization, and more. With it, the interface to apply deduce to text changes slightly. Updating your code to the new interface should not take more than a few minutes. The details are outlined below.
Calling deduce¶
deduce is now called from Deduce.deidentify, which replaces the annotate_text and deidentify_annotations functions. Those functions will give a DeprecationWarning from version 2.0.0, and will be deprecated from version 2.1.0.
| deprecated | new |
|---|---|
from deduce import annotate_text, deidentify_annotations
text = "Jan Jansen"
annotated_text = annotate_text(text)
deidentified_text = deidentify_annotations(annotated_text)
|
from deduce import Deduce
text = "Jan Jansen"
deduce = Deduce()
doc = deduce.deidentify(text)
|
Accessing output¶
The annotations and deidentified text are now available in the Document object. Intext annotations can still be useful for comparisons, they can be obtained by passing the document to a util function from the docdeid library (note that the format has changed).
| deprecated | new |
|---|---|
print(annotated_text)
'<PERSOON Jan Jansen>'
print(deidentified_text)
'<PERSOON-1>'
|
import docdeid as dd
print(dd.utils.annotate_intext(doc))
'<PERSOON>Jan Jansen</PERSOON>'
print(doc.annotations)
AnnotationSet({
Annotation(
text="Jan Jansen",
start_char=0,
end_char=10,
tag="persoon",
length="10"
)
})
print(doc.deidentified_text)
'<PERSOON-1>'
|
Adding patient names¶
The patient_first_names, patient_initials, patient_surname and patient_given_name keywords of annotate_text are replaced with a structured way to enter this information, in the Person class. This class can be passed to deidentify() as metadata. The use of a given name is deprecated, it can instead be added as a separate first name. The behaviour is still the same.
| deprecated | new |
|---|---|
from deduce import annotate_text, deidentify_annotations
text = "Jan Jansen"
annotated_text = annotate_text(
text,
patient_first_names="Jan Hendrik",
patient_initials="JH",
patient_surname="Jansen",
patient_given_name="Joop"
)
deidentified_text = deidentify_annotations(annotated_text)
|
from deduce import Deduce
from deduce.person import Person
text = "Jan Jansen"
patient = Person(
first_names=['Jan', 'Hendrik', 'Joop'],
initials="JH",
surname="Jansen"
)
deduce = Deduce()
doc = deduce.deidentify(text, metadata={'patient': patient})
|
Enabling/disabling specific categories¶
Previously, the annotate_text function offered disabling specific categories by using dates, ages, names, etc. keywords. This behaviour can be achieved by setting the disabled argument of the Deduce.deidentify method. Note that the identification logic of Deduce is now further split up into Annotator classes, allowing disabling/enabling specific components. You can read more about the specific annotators and other components in the tutorial here, and more information on enabling, disabling, replacing or modifying specific components here.
| deprecated | new |
|---|---|
from deduce import annotate_text, deidentify_annotations
text = "Jan Jansen"
annotated_text = annotate_text(
text,
dates=False,
ages=False
)
deidentified_text = deidentify_annotations(annotated_text)
|
from deduce import Deduce
text = "Jan Jansen"
deduce = Deduce()
doc = deduce.deidentify(
text,
disabled={'dates', 'ages'}
)
|