- API
- Changelog
- (unreleased)
- 3.0.2 (2023-02-15)
- 3.0.1 (2023-12-20)
- 3.0.0 (2023-12-20)
- 2.5.0 (2023-11-28)
- 2.4.2 (2023-11-22)
- 2.4.3 (2023-11-22)
- 2.4.2 (2023-11-21)
- 2.4.1 (2023-11-15)
- 2.4.0 (2023-11-15)
- 2.3.1 (2023-11-01)
- 2.3.0 (2023-10-25)
- 2.2.0 (2023-09-28)
- 2.1.0 (2023-08-07)
- 2.0.3 (2023-04-06)
- 2.0.2 (2023-03-28)
- 2.0.1 (2022-12-09)
- 2.0.0 (2022-12-05)
- 1.0.8 (2021-11-29)
- 1.0.7 (2021-11-03)
- 1.0.6 (2021-10-06)
- 1.0.5 (2021-10-05)
- 1.0.4 (2021-10-05)
- 1.0.3 (2021-07-07)
- 1.0.2
- 1.0.1
- 1.0.0
- Contributing
- License
Migrating to version 3.0.0
¶
Version 3.0.0
of deduce
includes many optimizations that allow more accurate de-identification, some already included in 2.1.0
- 2.5.0.
It also includes some structural optimizations. Version 3.0.0
should be backwards compatible, but some functionality is scheduled for removal in 3.1.0
. Those changes are listed below.
Custom config¶
Adding a custom config is now possible as a dict
or as a filename pointing to a json
. Both should be presented to deduce
with the config
keyword, e.g.:
deduce = Deduce(config='my_own_config.json')
deduce = Deduce(config={'redactor_open_char': '**', 'redactor_close_char': '**'})
The config_file
keyword is no longer used, please use config
instead.
Lookup structure names¶
For consistency, lookup structures names are now all in singular form:
Old name |
New name |
---|---|
prefixes |
prefix |
first_names |
first_name |
interfixes |
interfixes |
interfix_surnames |
interfix_surname |
surnames |
surname |
streets |
street |
placenames |
placename |
hospitals |
hospital |
healthcare_institutions |
healthcare_institution |
Additionally, the first_name_exceptions
and surname_exceptions
list are removed. The exception items are now simply removed from the original list in a more structured way, so there is no need to explicitly filter exceptions in patterns, etc.
The annotator_type
field in config¶
In a config, each each annotator should specify annotator_type
, so Deduce
knows what annotator to load. In 3.0.0
we simplified this a bit. In most cases, the annotator_type
field should be set to module.Class
of the annotator that should be loaded, and Deduce
will handle the rest (sometimes with a little bit of magic, so all arguments are presented with the right type). You should make the following changes:
annotator_type |
Change |
---|---|
multi_token |
|
dd_token_pattern |
This used to load |
token_pattern |
|
annotation_context |
|
custom |
Use |
regexp |
|
Migrating to version 2.0.0
¶
Version 2.0.0
of deduce
sees a major refactor that enables speedup, configuration, customization, and more. With it, the interface to apply deduce
to text changes slightly. Updating your code to the new interface should not take more than a few minutes. The details are outlined below.
Calling deduce
¶
deduce
is now called from Deduce.deidentify
, which replaces the annotate_text
and deidentify_annotations
functions. Those functions will give a DeprecationWarning
from version 2.0.0
, and will be deprecated from version 2.1.0
.
deprecated | new |
---|---|
from deduce import annotate_text, deidentify_annotations
text = "Jan Jansen"
annotated_text = annotate_text(text)
deidentified_text = deidentify_annotations(annotated_text)
|
from deduce import Deduce
text = "Jan Jansen"
deduce = Deduce()
doc = deduce.deidentify(text)
|
Accessing output¶
The annotations and deidentified text are now available in the Document
object. Intext annotations can still be useful for comparisons, they can be obtained by passing the document to a util function from the docdeid
library (note that the format has changed).
deprecated | new |
---|---|
print(annotated_text)
'<PERSOON Jan Jansen>'
print(deidentified_text)
'<PERSOON-1>'
|
import docdeid as dd
print(dd.utils.annotate_intext(doc))
'<PERSOON>Jan Jansen</PERSOON>'
print(doc.annotations)
AnnotationSet({
Annotation(
text="Jan Jansen",
start_char=0,
end_char=10,
tag="persoon",
length="10"
)
})
print(doc.deidentified_text)
'<PERSOON-1>'
|
Adding patient names¶
The patient_first_names
, patient_initials
, patient_surname
and patient_given_name
keywords of annotate_text
are replaced with a structured way to enter this information, in the Person
class. This class can be passed to deidentify()
as metadata. The use of a given name is deprecated, it can instead be added as a separate first name. The behaviour is still the same.
deprecated | new |
---|---|
from deduce import annotate_text, deidentify_annotations
text = "Jan Jansen"
annotated_text = annotate_text(
text,
patient_first_names="Jan Hendrik",
patient_initials="JH",
patient_surname="Jansen",
patient_given_name="Joop"
)
deidentified_text = deidentify_annotations(annotated_text)
|
from deduce import Deduce
from deduce.person import Person
text = "Jan Jansen"
patient = Person(
first_names=['Jan', 'Hendrik', 'Joop'],
initials="JH",
surname="Jansen"
)
deduce = Deduce()
doc = deduce.deidentify(text, metadata={'patient': patient})
|
Enabling/disabling specific categories¶
Previously, the annotate_text
function offered disabling specific categories by using dates
, ages
, names
, etc. keywords. This behaviour can be achieved by setting the disabled
argument of the Deduce.deidentify
method. Note that the identification logic of Deduce is now further split up into Annotator
classes, allowing disabling/enabling specific components. You can read more about the specific annotators and other components in the tutorial here, and more information on enabling, disabling, replacing or modifying specific components here.
deprecated | new |
---|---|
from deduce import annotate_text, deidentify_annotations
text = "Jan Jansen"
annotated_text = annotate_text(
text,
dates=False,
ages=False
)
deidentified_text = deidentify_annotations(annotated_text)
|
from deduce import Deduce
text = "Jan Jansen"
deduce = Deduce()
doc = deduce.deidentify(
text,
disabled={'dates', 'ages'}
)
|