deduce

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

(unreleased)

Removed

  • the config_file keyword, now replaced by config which accepts both filenames and dicts

  • old lookup list names, e.g. prefixes now replaced by prefix

  • annotator types custom, regexp, token_pattern, dd_token_pattern and annotation_context, all replaced by setting class directly as annotator_type

  • everything in deduce.pattern, patient patterns now replaced by PatientNameAnnotator

3.0.2 (2023-02-15)

Changed

  • recognize 4+ spaces as a token, blocking annotations

3.0.1 (2023-12-20)

Fixed

  • a bug with packaging base_config.json

3.0.0 (2023-12-20)

Added

  • speed optimizations, ~250%

  • pseudo-annotating eponymous diseases (e.g. Creutzfeldt-Jakob)

  • PatientNameAnnotator, which replaces deduce.pattern

  • a structured way for loading and building lookup structures (lists and tries), including caching

  • pre_match_words for some regexp annotators, speeding up the annotating

  • option to present a user config as dict (using config keyword)

Changed

  • speedup for TokenPatternAnnotator

  • some internals of ContextPatternAnnotator

  • initials now detected by lookup list, rather than pattern

  • redactor open and close chars from < > to [ ], as previous chars caused issues in html (so deidentified text now shows [PATIENT], [LOCATIE], etc.)

  • names of lookup structures to singular (prefix, rather than prefixes)

  • INSTELLING tag to ZIEKENHUIS and ZORGINSTELLING

  • refactored and simplified annotator loading, specifically the annotator_type config keyword now accepts references to classes (e.g deduce.annotator.TokenPatternAnnotator)

  • renamed interfix_with_capital annotator to interfix_with_name

Deprecated

  • the config_file keyword, now replaced by config which accepts both filenames and dicts

  • old lookup list names, e.g. prefixes now replaced by prefix

  • annotator types custom, regexp, token_pattern, dd_token_pattern and annotation_context, all replaced by setting class directly as annotator_type

  • everything in deduce.pattern, patient patterns now replaced by PatientNameAnnotator

Removed

  • automated coverage reporting on coveralls.io

  • options lowercase_lookup, lowercase_neg_lookup for token patterns

  • utils.any_in_text

Fixed

  • some small additions/removals for specific lookup lists

  • smaller bugs related to overlapping matches

2.5.0 (2023-11-28)

Added

  • the RegexpPseudoAnnotator component for filtering regexp matches based on preceding/following words

  • a prefix_with_interfix pattern for names, detecting e.g. Dr. van Loon

Changed

  • the age detection component, with improved logic and pseudo patterns

  • annotations are no longer counted adjacent when separated by a comma

  • streets are prioritized over names when merging overlapping annotations

  • removed some false positives for postal codes ending in gr or ie

  • extended the postbus pattern for xx.xxx format (old notation)

  • some smaller optimizations and exceptions for institution, hospital, placename, residence, medical term, first name, and last name lookup lists

Fixed

  • a bug with BsnAnnotator with non-digit characters in regexp

2.4.2 (2023-11-22)

Changed

  • multi-token lookup for first- and last names, so multi token names are now detected

  • some small lookup list additions

2.4.3 (2023-11-22)

Changed

  • extended list of medical terms

2.4.2 (2023-11-21)

Changed

  • name lookup list contents, extending names and adding more exceptions

2.4.1 (2023-11-15)

Added

  • detection of initials Ch., Chr., Ph. and Th.

2.4.0 (2023-11-15)

Added

  • logic for detecting hospitals, with added whitelist and separate annotator

Changed

  • logic for detecting (non-hospital) institutions, with extended lookup list

Removed

  • the separate Altrecht annotator, now included in the lookup list

2.3.1 (2023-11-01)

Fixed

  • include data files recursively in package

2.3.0 (2023-10-25)

Added

  • lookup lists (and logic) for Dutch provinces, regions, municipalities and streets

Changed

  • name of residences annotator to placenames, now includes provinces, regions and municipalities

  • lookup lists (and logic) for residences

  • logic for streets, housenumber and housenumber letters

2.2.0 (2023-09-28)

Changed

  • tokenizer logic:

    • a token is now a sequence of alphanumeric characters, a single newline, or a single special character.

    • whitespaces are no longer considered tokens

  • moved token pattern logic to config, using a new TokenPatternAnnotator

  • moved context pattern logic to config, using a new ContextAnnotator

  • many updates to name detection logic

    • lookup list optimizations

    • added, removed and simplified patterns

2.1.0 (2023-08-07)

Added

  • a component for deidentifying BSN-nummers

Changed

  • updated dependencies

  • by default, deduce now recognizes and tags bsn nummers

  • by default, deduce now recognizes all other 7+ digit numbers as identifiers

  • improved regular expressions for e-mail address and url matching, with separate tags

  • logic for detecting phone numbers (improvements for hyphens, whitespaces, false positive identifiers)

  • improved regular expression for age matching

  • date detection logic:

    • now only recognizes combinations of day, month and year (day/month combinations caused many false positives)

    • detects year-month-day format in addition to (day-month-year)

  • loading a custom config now only replaces the config options that are explicitly set, using defaults for those not included in the custom config

Deprecated

  • backwards compatibility, which was temporary added to transition from v1 to v2

Removed

  • a separate patient identifier tag, now superseded by a generic tag

  • detection of day/month combinations for dates, as this caused many false positives (e.g. lab values, numeric scores)

Fixed

  • annotations can no longer be counted as adjacent when separated by newline or tab (and will thus not be merged)

2.0.3 (2023-04-06)

Fixed

  • removed ‘decibutus’ from list of institutions as it caused many false positives

2.0.2 (2023-03-28)

Changed

  • upgraded dependencies, including markdown-it-py which had a vulnerability

2.0.1 (2022-12-09)

Changed

  • upgraded dependencies

2.0.0 (2022-12-05)

Added

  • introduced new interface for deidentification, using Deduce() class

  • a separate documentation page, with tutorial and migration guide

  • support for python 3.10 and 3.11

Changed

  • major refactor that touches pretty much every line of code

  • use docdeid package for logic

  • speedups: now 973% faster

  • use lookup sets instead of lookup lists

  • refactor tokenizer

  • refactor annotators into separate classes, using structured annotations

  • guidelines for contributing

Removed

  • the annotate_text and deidentify_annotations functions

  • all in-text annotation (under the hood) and associated functions

  • support for given names. given names can be added as another first name in the Person class.

  • support for python 3.7 and 3.8

Fixed

  • < and > are no longer replaced by ( and ) respectively

  • deduce does not strip text (whitespaces, tabs at beginning/end of text) anymore

1.0.8 (2021-11-29)

Added

  • warn if there are any structured annotations whose annotated text does not match the original text in the span denoted by the structured annotation

Fixed

  • various modifications related to adding or subtracting spaces in annotated texts

  • remove the lowercasing of institutions’ names

  • therefore, all structured annotations have texts matching the original text in the same span

1.0.7 (2021-11-03)

Changed

  • Internal code formatting improvements

Added

  • Contributing guidelines

1.0.6 (2021-10-06)

Fixed

  • Bug with multiple 4-digit mg dosages in one text

1.0.5 (2021-10-05)

Fixed

  • Minor bug where tag flattening had no effect

1.0.4 (2021-10-05)

Added

  • Changelog

  • Additional unit tests for whitespace/punctuation

Fixed

  • Various whitespace/punctuation bugs

  • Bug with nested tags not related to person names

  • Bug with adjacent tags not being merged

1.0.3 (2021-07-07)

Added

  • Structured annotations

  • Some unit tests

Fixed

  • Error with outdated unicode package

  • Bug with context

1.0.2

Release to PyPI

1.0.1

Small bugfix for None as input

1.0.0

Initial version