deduce

deduce

deduce.annotation_processor

Contains components for processing AnnotationSet.

class deduce.annotation_processor.DeduceMergeAdjacentAnnotations(slack_regexp: Optional[str] = None, check_overlap: bool = True)

Bases: MergeAdjacentAnnotations

Merge adjacent tags, according to deduce logic: adjacent annotations with mixed patient/person tags are replaced with a patient annotation, in other cases only annotations with equal tags are considered adjacent.

class deduce.annotation_processor.PersonAnnotationConverter

Bases: AnnotationProcessor

Responsible for processing the annotations produced by all name annotators (regular and context-based).

Any overlap with annotations that are contain “pseudo” in their tag are removed, as are those annotations. Then resolves overlap between remaining annotations, and maps the tags to either “patient” or “persoon”, based on whether “patient” is in the tag (e.g. voornaam_patient => patient, achternaam_onbekend => persoon).

process_annotations(annotations: AnnotationSet, text: str) AnnotationSet

Process an AnnotationSet.

Parameters:
  • annotations – The input AnnotationSet.

  • text – The corresponding text.

Returns:

An AnnotationSet that is processed according to the class logic.

class deduce.annotation_processor.RemoveAnnotations(tags: list[str])

Bases: AnnotationProcessor

Removes all annotations with corresponding tags.

process_annotations(annotations: AnnotationSet, text: str) AnnotationSet

Process an AnnotationSet.

Parameters:
  • annotations – The input AnnotationSet.

  • text – The corresponding text.

Returns:

An AnnotationSet that is processed according to the class logic.

class deduce.annotation_processor.CleanAnnotationTag(tag_map: dict[str, str])

Bases: AnnotationProcessor

Cleans annotation tags based on the corresponding mapping.

process_annotations(annotations: AnnotationSet, text: str) AnnotationSet

Process an AnnotationSet.

Parameters:
  • annotations – The input AnnotationSet.

  • text – The corresponding text.

Returns:

An AnnotationSet that is processed according to the class logic.

deduce.annotator

Contains components for annotating.

class deduce.annotator.TokenPatternAnnotator(pattern: list[dict], *args, ds: Optional[DsCollection] = None, skip: Optional[list[str]] = None, **kwargs)

Bases: Annotator

Annotates based on token patterns, which should be provided as a list of dicts. Each position in the list denotes a token position, e.g.: [{‘is_initial’: True}, {‘like_name’: True}] matches sequences of two tokens, where the first one is an initial, and the second one is like a name.

Parameters:
  • pattern – The pattern

  • ds – Any datastructures, that can be used for lookup or other logic

  • skip – Any string values that should be skipped in matching (e.g. periods)

annotate(doc: Document) list[docdeid.annotation.Annotation]

Annotate the document, by matching the pattern against all tokens.

Parameters:

doc – The document being processed.

Returns:

A list of Annotation.

class deduce.annotator.ContextAnnotator(*args, ds: Optional[DsCollection] = None, iterative: bool = True, **kwargs)

Bases: TokenPatternAnnotator

Extends existing annotations to the left or right, based on specified patterns.

Parameters:
  • ds – Any datastructures, that can be used for lookup or other logic

  • iterative – Whether the extension process should repeat, or stop after one

  • iteration.

annotate(doc: Document) list[docdeid.annotation.Annotation]

Wrapper for annotating.

Parameters:

doc – The document to process.

Returns:

An empty list, as annotations are modified and not added.

class deduce.annotator.PatientNameAnnotator(tokenizer: Tokenizer, *args, **kwargs)

Bases: Annotator

Annotates patient names, based on information present in document metadata. This class implements logic for detecting first name(s), initials and surnames.

Parameters:

tokenizer – A tokenizer, that is used for breaking up the patient surname into multiple tokens.

next_with_skip(token: Token) Optional[Token]

Find the next token, while skipping certain punctuation.

annotate(doc: Document) list[docdeid.annotation.Annotation]

Annotates the document, based on the patient metadata.

Parameters:

doc – The input document.

Returns: A document with any relevant Annotations added.

class deduce.annotator.RegexpPseudoAnnotator(*args, pre_pseudo: Optional[list[str]] = None, post_pseudo: Optional[list[str]] = None, lowercase: bool = True, **kwargs)

Bases: RegexpAnnotator

Regexp annotator that filters out matches preceded or followed by certain terms. Currently matches on sequential alpha characters preceding or following the match. This annotator does not depend on any tokenizer.

Parameters:
  • pre_pseudo – A list of strings that invalidate a match when preceding it

  • post_pseudo – A list of strings that invalidate a match when following it

  • lowercase – Whether to match lowercase

class deduce.annotator.BsnAnnotator(bsn_regexp: str, *args, capture_group: int = 0, **kwargs)

Bases: Annotator

Annotates Burgerservicenummer (BSN), according to the elfproef logic. See also: https://nl.wikipedia.org/wiki/Burgerservicenummer

Parameters:
  • bsn_regexp – A regexp to match potential BSN nummers. The simplest form could be 9-digit numbers, but matches with periods or other punctutation can also be accepted. Any non-digit characters are removed from the match before the elfproef is applied.

  • capture_group – The regexp capture group to consider.

annotate(doc: Document) list[docdeid.annotation.Annotation]

Generate annotations for a document.

Parameters:

doc – The document that should be annotated.

Returns:

A list of annotations.

class deduce.annotator.PhoneNumberAnnotator(phone_regexp: str, *args, min_digits: int = 9, max_digits: int = 11, **kwargs)

Bases: Annotator

Annotates phone numbers, based on a regexp and min and max number of digits. Additionally employs some logic like detecting parentheses and hyphens.

Parameters:
  • phone_regexp – The regexp to detect phone numbers.

  • min_digits – The minimum number of digits that need to be present.

  • max_digits – The maximum number of digits that need to be present.

annotate(doc: Document) list[docdeid.annotation.Annotation]

Generate annotations for a document.

Parameters:

doc – The document that should be annotated.

Returns:

A list of annotations.

deduce.deduce

Loads Deduce and all its components.

class deduce.deduce.Deduce(load_base_config: bool = True, config: Optional[Union[str, dict]] = None, lookup_data_path: Union[str, Path] = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/deduce/checkouts/latest/deduce/data/lookup'), build_lookup_structs: bool = False)

Bases: DocDeid

Main class for de-identifiation.

Inherits from docdeid.DocDeid, and as such, most information on deidentifying text with a Deduce object is available there.

Parameters:
  • load_base_config – Whether or not to load the base config that is packaged with deduce. This loads some sensible defaults, although further customization is always recommended.

  • config – A specific user config, either as a dict, or pointing to a json file. When load_base_config is set to True, only settings defined in config are overwritten, and other defaults are kept. When load_base_config is set to False, no defaults are loaded and only configuration from config is applied.

  • looup_data_path – The path to look for lookup data, by default included in the package. If you want to make changes to source files, it’s recommended to copy the source data and pointing deduce to this folder with this argument.

  • build_lookup_structs – Will always reload and rebuild lookup structs rather than using the cache when this is set to True.

deduce.lookup_struct_loader

Some functions for creating lookup structures from raw items.

deduce.lookup_struct_loader.load_common_word_lookup(raw_itemsets: dict[str, set[str]]) LookupSet

Load common_word LookupSet.

deduce.lookup_struct_loader.load_whitelist_lookup(raw_itemsets: dict[str, set[str]]) LookupSet

Load whitelist LookupSet.

Composed of medical terms, top 1000 frequent words (except surnames), and stopwords.

deduce.lookup_struct_loader.load_eponymous_disease_lookup(raw_itemsets: dict[str, set[str]], tokenizer: Tokenizer) LookupTrie

Loads eponymous disease LookupTrie (e.g. Henoch-Schonlein).

deduce.lookup_struct_loader.load_prefix_lookup(raw_itemsets: dict[str, set[str]]) LookupSet

Load prefix LookupSet (e.g. ‘dr’, ‘mw’).

deduce.lookup_struct_loader.load_first_name_lookup(raw_itemsets: dict[str, set[str]], tokenizer: Tokenizer) LookupTrie

Load first_name LookupTrie.

deduce.lookup_struct_loader.load_interfix_lookup(raw_itemsets: dict[str, set[str]]) LookupSet

Load interfix LookupSet (‘van der’, etc.).

deduce.lookup_struct_loader.load_surname_lookup(raw_itemsets: dict[str, set[str]], tokenizer: Tokenizer) LookupTrie

Load surname LookupTrie.

deduce.lookup_struct_loader.load_street_lookup(raw_itemsets: dict[str, set[str]], tokenizer: Tokenizer) LookupTrie

Load street LookupTrie.

deduce.lookup_struct_loader.load_placename_lookup(raw_itemsets: dict[str, set[str]], tokenizer: Tokenizer) LookupTrie

Load placename LookupTrie.

deduce.lookup_struct_loader.load_hospital_lookup(raw_itemsets: dict[str, set[str]], tokenizer: Tokenizer) LookupTrie

Load hopsital LookupTrie.

deduce.lookup_struct_loader.load_institution_lookup(raw_itemsets: dict[str, set[str]], tokenizer: Tokenizer) LookupTrie

Load institution LookupTrie.

deduce.lookup_structs

Responsible for loading, building and caching all lookup structures.

deduce.lookup_structs.load_raw_itemset(path: Path) set[str]

Load the raw items from a lookup list. This works by loading the data in items.txt, removing the data in exceptions.txt (if any), and then applying the transformations in transform_config.json (if any). If there are nested lookup lists, they will be loaded and treated as if they are on items.txt.

Parameters:

path – The path.

Returns:

The raw items, as a set of strings.

deduce.lookup_structs.load_raw_itemsets(base_path: Path, subdirs: list[str]) dict[str, set[str]]

Loads one or more raw itemsets. Automatically parses its name from the folder name.

Parameters:
  • base_path – The base path containing the lists.

  • subdirs – The lists to load.

Returns:

The raw itemsetes, represented as a dictionary mapping the name of the lookup list to a set of strings.

deduce.lookup_structs.validate_lookup_struct_cache(cache: dict, base_path: Path, deduce_version: str) bool

Validates lookup structure data loaded from cache. Invalidates when changes in source are detected, or when deduce version doesn’t match.

Parameters:
  • cache – The data loaded from the pickled cache.

  • base_path – The base path to check for changed files.

  • deduce_version – The current deduce version.

Returns:

True when the lookup structure data is valid, False otherwise.

deduce.lookup_structs.load_lookup_structs_from_cache(base_path: Path, deduce_version: str) Optional[DsCollection]

Loads lookup struct data from cache. Returns None when no cache is present, or when it’s invalid.

Parameters:
  • base_path – The base path where to look for the cache.

  • deduce_version – The current deduce version, used to validate.

Returns:

A DsCollection if present and valid, None otherwise.

deduce.lookup_structs.cache_lookup_structs(lookup_structs: DsCollection, base_path: Path, deduce_version: str) None

Saves lookup structs to cache, along with some metadata.

Parameters:
  • lookup_structs – The lookup structures to cache.

  • base_path – The base path for lookup structures.

  • deduce_version – The current deduce version.

deduce.lookup_structs.get_lookup_structs(lookup_path: Path, tokenizer: Tokenizer, deduce_version: str, build: bool = False, save_cache: bool = True) DsCollection

Loads all lookup structures, and handles caching. :param lookup_path: The base path for lookup sets. :param tokenizer: The tokenizer, used to create sequences for LookupTrie :param deduce_version: The current deduce version, used to validate cache. :param build: Whether to do a full build, even when cache is present and valid. :param save_cache: Whether to save to cache. Only used after building.

Returns: The lookup structures.

deduce.person

class deduce.person.Person(first_names: Optional[list[str]] = None, initials: Optional[str] = None, surname: Optional[str] = None)

Bases: object

Contains information on a person.

Usable in a document metadata, where annotators can access it for annotation.

first_names: Optional[list[str]] = None
initials: Optional[str] = None
surname: Optional[str] = None
classmethod from_keywords(patient_first_names: str = '', patient_initials: str = '', patient_surname: str = '', patient_given_name: str = '') Person

Get a Person from keywords. Mainly used for compatibility with keyword as used in deduce<=1.0.8.

Parameters:
  • patient_first_names – The patient first names, separated by whitespace.

  • patient_initials – The patient initials.

  • patient_surname – The patient surname.

  • patient_given_name – The patient given name.

Returns:

A Person object containing the patient information.

__eq__(other)

Return self==value.

deduce.redactor

class deduce.redactor.DeduceRedactor(open_char: str = '[', close_char: str = ']', check_overlap: bool = True)

Bases: SimpleRedactor

Implements the redacting logic of Deduce:

  • All annotations with “patient” tag are replaced with <PATIENT>

  • All other annotations are replaced with <TAG-n>, with n identifying a group

    of annotations with a similar text (edit_distance <= 1).

redact(text: str, annotations: AnnotationSet) str

Redact the text.

Parameters:
  • text – The input text.

  • annotations – The annotations that are produced by previous document

  • processors.

Returns:

The redacted text.

deduce.tokenizer

class deduce.tokenizer.DeduceTokenizer(merge_terms: Optional[Iterable] = None)

Bases: Tokenizer

Tokenizes text, where a token is any sequence of alphanumeric characters (case insensitive), a single newline/tab character, or a single special character. It does not include whitespaces as tokens.

Parameters:
  • merge_terms – An iterable of strings that should not be split (i.e. always

  • tokens). (returned as) –

deduce.utils

deduce.utils.str_match(str_1: str, str_2: str, max_edit_distance: Optional[int] = None) bool

Match two strings, potentially in a fuzzy way.

Parameters:
  • str_1 – The first string.

  • str_2 – The second string.

  • max_edit_distance – Max edit distance between the two strings. Will use

  • used. (exact matching if argument is not) –

Returns:

True if the strings match, False otherwise.

deduce.utils.class_for_name(module_name: str, class_name: str) type

Will import and return the class by name.

Parameters:
  • module_name – The module where the class can be found.

  • class_name – The class name.

Returns:

The class.

deduce.utils.initialize_class(cls: type, args: dict, extras: dict) object

Initialize a class. Any arguments in args are passed to the class initializer. Any items in extras are passed to the class initializer if they are present.

Parameters:
  • cls – The class to initialze.

  • args – The arguments to pass to the initalizer.

  • extras – A superset of arguments that should be passed to the initializer.

  • class. (Will be checked against the) –

Returns:

An instantiated class, with the relevant arguments and extras.

deduce.utils.overwrite_dict(base: dict, add: dict) dict

Overwrites the items of the first dict with those of the second.

Accepts nested dictionaries.

deduce.utils.has_overlap(intervals: list[tuple]) bool

Checks if there is any overlap in a list of tuples. Assumes the interval ranges from the first to the second element of the tuple. Any other elements are ignored.

Parameters:

intervals – The intervals, as a list of tuples

Returns:

True if there is any overlap between tuples, False otherwise.

deduce.utils.repl_segments(s: str, matches: list[tuple]) list[list[str]]

Segment a string into consecutive substrings, with one or more options for each substring.

Parameters:
  • s – The input string.

  • matches – A list of matches, consisting of a tuple with start- and end char, followed by a list of options for that substring, e.g. (5, 8, [“Mr.”, “Meester”]).

Returns:

A list of options that together segement the entire string, e.g. [[“Prof.”, “Professor”], [” “], [“Meester”, “Mr.”], [” Lievenslaan”]].

deduce.utils.str_variations(s: str, repl: dict[str, list[str]]) list[str]

Gets all possible textual variations of a string, by combining any subset of replacements defined in the repl dictionary. E.g.: the input string ‘Prof. Mr. Lievenslaan’ combined with the mapping {‘Prof.’: [‘Prof.’, ‘Professor’], ‘Mr.’: [‘Mr.’, ‘Meester’]} will result in the following variations: [‘Prof. Mr. Lievenslaan’, ‘Professor Mr. Lievenslaan’, ‘Prof. Meester Lievenslaan’, ‘Professor Meester Lievenslaan’].

Parameters:
  • s – The input string

  • repl – A mapping of substrings to one or multiple replacements, e.g. {‘Professor’: [‘Professor’, ‘Prof.’, ‘prof.’]}. The key will be matched using re.finditer, so both literal phrases and regular expressions can be used.

Returns:

A list containing all possible textual variations.

deduce.utils.apply_transform(items: set[str], transform_config: dict) set[str]

Applies a transformation to a set of items.

Parameters:
  • items – The input items.

  • transform_config – The transformation, including configuration (see

  • examples). (transform.json for) –

Returns: The transformed items.

deduce.utils.optional_load_items(path: Path) Optional[set[str]]

Load items (lines) from a textfile, returning None if file does not exist.

Parameters:

path – The full path to the file.

Returns: The lines of the file as a set if the file exists, None otherwise.

deduce.utils.optional_load_json(path: Path) Optional[dict]

Load json, returning None if file does not exist.

Parameters:

path – The full path to the file.

Returns: The json data as a dict if the file exists, None otherwise.

deduce.utils.lookup_set_to_trie(lookup_set: LookupSet, tokenizer: Tokenizer) LookupTrie

Converts a LookupSet into an equivalent LookupTrie.

Parameters:
  • lookup_set – The input LookupSet

  • tokenizer – The tokenizer used to create sequences

Returns: A LookupTrie with the same items and matching pipeline as the input LookupSet.