Privacy Preserving Machine Learning

Privacy Preserving Machine Learning — Course Page

Course taught by Aurélien Bellet
Master 2 Data Science, University of Lille

About this course

Personal data is being collected at an unprecedented scale by businesses and public organizations, driven by the progress of data science and AI. While training machine learning models on such personal or otherwise confidential data can be beneficial in many applications, this can also lead to undesirable (sometimes catastrophic) disclosure of sensitive information. In particular, machine learning models often contain precise information about individual data points that were used to train them. We must therefore deal with two conflicting objectives: maximizing the utility of the machine learning model while protecting the privacy of individuals whose data is used in the analysis. Unfortunately, recent years have shown that standard data anonymization techniques cannot reliably prevent leakage without largely destroying utility.

So how can we achieve both utility and privacy, or rather, obtain a good trade-off between the two? This course focuses on differential privacy, a mathematical definition of privacy which comes with rigorous guarantees as well as an algorithmic framework that allows the design of practical privacy preserving algorithms for data analytics and machine learning. In recent years, differential privacy has become the gold standard in various fields and has recently seen several real-world deployments by companies and government agencies.

This course will start by briefly illustrating why classic data anonymization schemes fail and how machine learning models and other aggregate statistics may leak sensitive information about individual data points. This will set the stage for differential privacy, which provides a principled probabilistic definition of what makes an algorithm private. We will introduce the formal definition of differential privacy and analyze its key properties. We will then turn to the design of differentially private algorithms in the standard centralized model, where a trusted curator wants to release the result of an analysis in a privacy-preserving way. We will introduce key algorithmic tools for privately answering simple numeric and non-numeric queries and prove their privacy and utility guarantees. These mechanisms will serve as building blocks to construct private machine learning algorithms. We will in particular focus on private empirical risk minimization with techniques based on randomly perturbing non-private models (output perturbation) or gradients (private stochastic gradient descent). Finally, we will consider the decentralized model of differential privacy, where individuals or data owners do not trust a curator to handle their private data. We will introduce local differential privacy, discuss intermediate trust models, and present applications to private federated learning where several data owners collaborate to train a model without sharing their data.

Lecture slides

Lecture 1: Introduction & course overview
Lecture 2: Differential privacy & first building blocks
Lecture 3: The exponential mechanism & advanced composition
Lecture 4: Differentially private empirical risk minimization
Lecture 5: Differentially private stochastic gradient descent
Lecture 6: Beyond the centralized model of differential privacy

Practical sessions in Python (Jupyter notebooks)

Practical 1: Differential privacy for numeric queries
Practical 2: Differential privacy for non-numeric queries & using composition results
Practical 3: Differentially private ERM via output perturbation
Practical 4: Differentially private SGD
Practical 5: Local differential privacy & federated learning

Assignments

There will be two assignments. The final grade for the course will be the average over the two assignments.

Assignment 1 (lab session): Practical 3 will be graded. The deadline for sending your report is November 21. Refer to the notebook for detailed instructions.
Assignment 2 (paper presentation): By groups of 2 students, you will choose a research paper on privacy-preserving machine learning and write a report on the paper. The report should present the problem tackled in the paper, the main results and how they advance the previous literature, as well as a critical view of the merits and drawbacks of the proposed approach and a discussion of potential open questions. The report should be written in a way that is understandable by other students in the class who have not read the paper. Each group will also give a 10 minutes presentation to the class (followed by 3 minutes questions) on January 12. The report (in pdf format) should be sent to me by email no later than January 10.
You can find below a list of suggested papers. Please coordinate among the class to avoid picking the same paper twice. You are welcome to suggest a paper which is not in this list, but you should confirm your choice with me first.
- Abadi et al. Deep Learning with Differential Privacy. CCS 2016
- Andrés et al. Geo-Indistinguishability: Differential Privacy for Location-Based Systems. CCS 2013
- Carlini et al. Extracting Training Data from Large Language Models. Usenix Security 2021
- Carlini et al. Membership Inference Attacks From First Principles. S&P 2022
- Cyffers & Bellet. Privacy Amplification by Decentralization. AISTATS 2022
- De et al. Unlocking High-Accuracy Differentially Private Image Classification through Scale. arXiv 2022
- Erlingsson et al. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. CCS 2014
- Feldman et al. Privacy Amplification by Iteration. FOCS 2018
- Feldman et al. Hiding Among the Clones: A Simple and Nearly Optimal Analysis of Privacy Amplification by Shuffling. FOCS 2021.
- Feyisetan et al. Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations. WSDM 2020.
- Garfinkel et al. Issues Encountered Deploying Differential Privacy. WPES@CCS 2018
- Geiping et al. Inverting Gradients - How easy is it to break privacy in federated learning? NeurIPS 2020
- Jayaraman & Evans. Evaluating Differentially Private Machine Learning in Practice. USENIX Security 2019
- McKenna & Sheldon. Permute-and-Flip: A new mechanism for differentially private selection. NeurIPS 2020
- McMahan et al. Learning Differentially Private Recurrent Language Models. ICLR 2018
- Nasr et al. Adversary Instantiation: Lower Bounds for Differentially Private Machine Learning. S&P 2021.
- Papernot et al. Scalable Private Learning with PATE. ICLR 2018
- Wang et al. Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain. UAI 2018
- Yu et al. Differentially Private Fine-tuning of Language Models. ICLR 2022.

Textbooks and survey papers

C. Dwork and A. Roth, The Algorithmic Foundations of Differential Privacy, Foundations and Trends in Theoretical Computer Science, 2014
(Reference textbook on differential privacy)
K. Nissim et al., Differential Privacy: A Primer for a Non-technical Audience, Journal of Entertainment & Technology Law, 2018
(Introduction to differential privacy for a non-technical audience, with many illustrating examples and discussions regarding legal aspects)
S. Vadhan, The Complexity of Differential Privacy, Tutorials on the Foundations of Cryptography, 2017
(Another textbook on differential privacy, with emphasis on computational aspects)
C. Castellucia and B. Nguyen, Techniques d’anonymisation tabulaire : concepts et mise en œuvre, 2020
(Gentle introduction to classic data anonymization techniques as well as to differential privacy, in French)
P. Kairouz et al., Advances and Open Problems in Federated Learning, 2019
(Reference survey on all aspects of federated learning, with Section 4 devoted to privacy)