Progetti di ricerca

AnonymAI

Legally Compliant Text and Voice Anonymization through Artificial Intelligence

AnonymAI allows for automatic anonymization and the protection of users’ personal data contained in large collections of unstructured natural language content, in compliance with European and International regulations and standards. 

The service was developed under the “AnonymAI: Legal Compliant Text and Voice Anonymization through Artificial Intelligence” research project (August 2020 – April 2021) supported by the H2020 project “NGI Trust”, in collaboration with ICT Legal Consulting (ICTLC), an international law firm specialized in Privacy Protection, Security and Intellectual Property Law in the ICT field.

The adopted approach and partnership of AnonymAI allows for integration between innovative technologies and regulatory obligations and makes it possible to specify anonymization profiles that are at the same time legally compliant and customized to the needs of end-users (i.e. to avoid masking data that is relevant for secondary linguistic analysis whenever possible), thus putting people and their rights at the center of technology design. Our anonymization service includes: 

– a cutting-edge automatic anonymization tool that, thanks to the use of NLP technologies with a hybrid approach based on both a Deep Learning and a rule-based system, allows for more precise configuration in each domain’s unstructured content (text and voice transcriptions). It also includes both common personal data (proper names, locations, ID numbers, telephone numbers, and e-mail addresses) and special categories of personal data. This results in system scalability to new user requirements and to new relevant Personal Identifiable Information (PII), so to provide services for a wide variety of domains; 

– a balanced integration of innovative technologies with legal and regulatory obligations. By providing a set of guidelines, ethical considerations, and a checklist for self-assessment and self-guidance in terms of compliance posture with the legal requirements, we promote effective and fair use of technology in the context of data analytics and data anonymization that guarantees the interests of all actors involved, including data subjects.

AnonymAI prototype combines state-of-the-art Deep Learning models (fine-tuned multilingual BERT models) with NLP techniques (supporting Gazetteers and pattern-based entities detection) and supports a large hierarchical tag set of information types (developed within the AnonymAI research project in conjunction with ICT Legal Consulting) including both direct (i.e. name, ID, email) and indirect identifiers (i.e. location, age, gender). Through the integration of domain Gazetteers and Ontologies, the prototype is capable of differentiating between public and private entities’ names in order to apply anonymization only when it is needed. 

The system is by design multilingual and currently supports English and Italian languages; other languages can be added by extending the current training dataset.

Textual documents may contain personal and sensitive data, which can be used, directly or indirectly, to identify an individual: eg. in free text sections of anonymous surveys, in internal company reports, in transcripts of phone conversations, etc. With the introduction of Regulation (EU) 2016/679 (GDPR) and the need to comply with the ISO/IEC 27001 requirements, privacy enhancing technologies have become particularly important for companies, public administrations, and other organizations.

The needs for anonymization vary from domain to domain and personal data must be discerned from business-relevant and other useful information. For example, in the medical domain, in order to share data useful for research, medical records must be anonymized with respect to the names of patients, doctors and other personally identifiable information (PII), without removing references to treatments, type of medical devices or institutions essential for enhancing clinical discoveries and treatments.

Anonymization and pseudo-anonymization are a way to retain the benefits of using data (and therefore maintaining data-driven procedures), mitigating the risks that come with the processing and storing of personal data, including special categories.

NGI Trust, the Horizon 2020 project of the European Union (grant agreement n.825618) promotes the development of a human-centric internet and a stronger European ecosystem of researchers, innovators, and technology developers in the field of privacy and trust enhancing technologies.

The list of projects that have been selected under the third NGI Trust open call can be found here: https://wiki.geant.org/display/NGITrust/Funded+Projects+Call+3.