Skip to content

Co-reference Resolution

When a document refers to the same person in multiple ways, Xybern Redact ensures every variant gets the same pseudonym. A legal contract that mentions "Michael Chen", "Mr. Chen", "Dr. Chen", and "M. Chen" will have all four replaced consistently - the LLM never sees any form of the real name.


What Gets Matched

After the primary named-entity pass detects a full name, a second variant pass searches the remaining text for the following forms:

Variant type Example
Title and last name Mr. Chen, Ms. Chen, Dr. Chen, Mrs. Chen, Prof. Chen
Initial and last name M. Chen
Bare last name Chen (when unambiguous - see below)

Titles are preserved in front of the pseudonym so the anonymized text reads naturally. "Mr. Chen agreed" becomes "Mr. Morgan Ross agreed".


Bare Last Name Matching

A bare last name (e.g. "Chen" appearing alone) is replaced only when:

  • The name is 4 or more characters long
  • It is unambiguous - no other person in the document shares the same last name
  • The spaCy NLP model is active (in regex-fallback mode, full names may still be present in the text, making bare-last substitution unsafe)

If two people named "Michael Chen" and "Sarah Chen" both appear in the document, bare "Chen" references are left unchanged because they cannot be attributed with certainty.


Same-name Merging

If the NLP model detects "Dr. Chen" as a separate entity (rather than recognising it as a variant of "Michael Chen"), Xybern checks whether the detected name shares a last name with an already-known person. When there is exactly one match, the existing pseudonym is reused instead of generating a new one.

This means a document where spaCy detects "Michael Chen" and "Dr. Chen" as separate tokens will still produce a single consistent pseudonym for both.


Pseudonym Protection

All pseudonyms placed during the primary pass are temporarily protected before the variant pass runs. This prevents a pseudonym that happens to contain a common surname (e.g. "Drew Chen" assigned to a different person) from being incorrectly overwritten by a variant rule targeting "Chen".


Example

Input:

This agreement is between Michael Chen (hereinafter "Chen") and the Company.
Mr. Chen agrees to the terms. Dr. Chen will sign on behalf of the group.
M. Chen has reviewed the document.
Sarah Johnson attended. Ms. Johnson reviewed the contract.

Output:

This agreement is between Morgan Ross (hereinafter "Morgan Ross") and the Company.
Mr. Morgan Ross agrees to the terms. Dr. Morgan Ross will sign on behalf of the group.
Morgan Ross has reviewed the document.
Drew Chen attended. Ms. Drew Chen reviewed the contract.

Every reference to Michael Chen resolves to "Morgan Ross". Sarah Johnson and Ms. Johnson both resolve to "Drew Chen". The two people receive independent pseudonyms.


Limitations

  • Pronouns and contextual references - "he", "his", "the CEO", "the plaintiff" are not resolved. Co-reference resolution covers name variants only, not pronouns or role descriptions. Full pronoun resolution requires a dedicated co-reference model and is outside the current scope.
  • Ambiguous last names - if two people share a last name, bare-last matching is skipped for that name to avoid misattribution.
  • Single-word names - a name detected as a single token (e.g. a mononym) does not generate variants since there is no first/last name split to work from.