ArXiv Preprint
Language Models (LMs) have been shown to leak information about training data
through sentence-level membership inference and reconstruction attacks.
Understanding the risk of LMs leaking Personally Identifiable Information (PII)
has received less attention, which can be attributed to the false assumption
that dataset curation techniques such as scrubbing are sufficient to prevent
PII leakage. Scrubbing techniques reduce but do not prevent the risk of PII
leakage: in practice scrubbing is imperfect and must balance the trade-off
between minimizing disclosure and preserving the utility of the dataset. On the
other hand, it is unclear to which extent algorithmic defenses such as
differential privacy, designed to guarantee sentence- or user-level privacy,
prevent PII disclosure. In this work, we propose (i) a taxonomy of PII leakage
in LMs, (ii) metrics to quantify PII leakage, and (iii) attacks showing that
PII leakage is a threat in practice. Our taxonomy provides rigorous game-based
definitions for PII leakage via black-box extraction, inference, and
reconstruction attacks with only API access to an LM. We empirically evaluate
attacks against GPT-2 models fine-tuned on three domains: case law, health
care, and e-mails. Our main contributions are (i) novel attacks that can
extract up to 10 times more PII sequences as existing attacks, (ii) showing
that sentence-level differential privacy reduces the risk of PII disclosure but
still leaks about 3% of PII sequences, and (iii) a subtle connection between
record-level membership inference and PII reconstruction.
Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, Santiago Zanella-Béguelin
2023-02-01