March 29, 2023
Portfolio
Unusual

This is what AI builders should know about data protection and privacy

No items found.
This is what AI builders should know about data protection and privacyThis is what AI builders should know about data protection and privacy
All posts
Editor's note: 

Last week, Open AI disclosed a data leak in which snippets of some users’ conversations with ChatGPT were visible to other ChatGPT users. The leak added fuel to the fire of growing concerns around AI and data privacy. Users are flocking to AI chatbots for summarizing documents, synthesizing notes, debugging code, and drafting email replies, without thinking twice about whether the prompts contains sensitive business or personal data. This has businesses particularly concerned. Some companies, including JPMorgan, have already moved to restrict employee use of ChatGPT.

Every software company is subject to regulation around data protection and privacy. This is certainly not the first time users or businesses have trusted a third-party app with sensitive data (hello social media, hello B2B SaaS). AI companies must deal with all the usual challenges plus one big new one: black box models that can behave in unexpected ways. The main risk is that a model might reveal sensitive data, either inadvertently or as the result of an adversarial attack. The good news is there are several ways AI builders can mitigate that risk. 

So, how should AI builders think about data protection and privacy? What are some key concerns for builders developing their own AI models or leveraging third-party models like those developed by OpenAI? 

I’ll start with an overview of privacy regulations and concepts that are relevant for businesses today, then dive into the specific considerations for AI builders.

Privacy regulations and challenges for AI

Data protection and privacy requirements are shaped by regulatory compliance. AI builders already have plenty of data regulations to comply with, including GDPR in Europe and CCPA/CPRA in California, as well as subject-specific regulations like HIPAA and PCI-DSS. New, AI-focused regulations are likely to emerge that may clarify how pre-existing privacy regulations apply in the context of AI (the proposed EU AI Act, for example, largely defers to GDPR on data protection and privacy), but for now we’ll focus on known regulations. Without getting into the weeds, privacy regulations require organizations to do things like:

  • Obtain consent from users to collect and process personal data
  • Ensure that some regulated data types are only stored and processed in the country/region where they were generated (data residency)
  • Prevent unauthorized access to or disclosure of regulated data
  • Document how and why regulated data is being processed
  • Process user requests to access or delete personal data 

Complying with all these regulations is complex for any organization handling regulated data. Here we’ll focus on some of the challenges unique to AI. Organizations developing and training AI models face additional complexity for several reasons:

  • Getting the data to train models can be a challenge if regulations govern where or how the data is stored, and who can have access. Even when regulations don’t prohibit it, security teams may be reluctant to give data science teams access to production data.
  • Models can be vulnerable to attacks that allow hackers to extract training data, or may inadvertently leak training data if poorly designed and not properly tested. Security teams may be reluctant to use production data in model training for these reasons.
  • AI models have a reputation for being a “black box.” There’s often a tradeoff between model performance and explainability that can make it hard to document how data is being processed.
  • Deleting user data records may not be enough to satisfy the “right to delete” if models were trained on datasets including those records. Organizations may be required to retrain models from scratch or apply other methods to ensure that the models have “forgotten” individual data. 

Considering these challenges, let’s look at some of the ways AI builders can reduce the risk of data protection and privacy violations. 

How can AI builders reduce the risk of data protection and privacy violations?

Synthetic or anonymized training data

Training data is at the root of many of the challenges above, both for users training a model from scratch, or fine-tuning a model with private data. Using real data to train and fine-tune models is ideal from a performance perspective, but can be risky from a data protection and privacy standpoint if the dataset includes regulated data. There are a number of options that allow organizations to reduce risk by modifying training data, or by using training methods that preserve privacy. Different use cases will require different combinations of approaches, depending on the sensitivity of the data and how the model will be used.  

Builders can use tools like Gretel, Mostly, or Tonic to create synthetic training data that preserves the characteristics of the original dataset. Synthetic data is then used in place of real data for model training, practically eliminating the risk that real user data could be exposed. Since synthetic data isn’t regulated under GDPR and CPRA, privacy and compliance teams don’t need to worry about giving data science access, or explaining how the data is processed. It’s not a perfect solution, though. While research suggests that models trained on synthetic data can be just as performant as those trained on real data, some builders may want to verify that by testing against a model trained on the real data before deploying to production.

Instead of using fully synthetic datasets, organizations can also use various data anonymization techniques to de-identify production data. At its most basic, this means redacting PII from datasets. If PII can’t be fully removed, it can be masked — surgically replaced with a token or with synthetic data (e.g., replace my real name “Allison” with a pseudonym “Laura”). 

If builders do choose to train models with production datasets that include personal information, it’s important to have user consent, typically by ensuring that a privacy policy is in place that specifies how the data will be processed. It’s also important to have good hygiene around training datasets and model lineage, so that you know where and how personal data was processed and can comply with data deletion requests.

Privacy-preserving model training

It is possible to train models such that the privacy of individual records in the training data is preserved. Differential privacy provides a way to quantify the risk that a model might remember and leak its training data. There are different methods for achieving differential privacy in model training, but it most commonly works by injecting noise into the dataset or model output. Training differentially private models allows for training models on real data while minimizing risk of a leak. 

For the most sensitive use cases, newer technologies allow for model training while maintaining the privacy of the underlying data. Organizations can leverage confidential computing infrastructure like secure enclaves to train models in an isolated hardware environment. This allows data science teams to train models without ever seeing the training data, keeping data confidential and protected throughout the training process. Secure enclaves can be difficult to set up and may not be appropriate for all use cases, but a number of startups are working to make this technology more accessible. 

Testing for vulnerabilities

Most teams test models for performance before deploying to production, but it’s equally important to stress test for security and privacy. Testing should ensure that models aren’t vulnerable to adversarial machine learning attacks, such as those enumerated in the MITRE ATLAS threat landscape for AI. Prompt injection has gotten a lot of attention recently with the rise of AI chatbots, as attackers were able to craft clever prompts that forced Microsoft’s Bing AI to reveal its instruction and codename. Other attack techniques focus on getting the model to reveal its training data. Model inference attacks allow hackers to infer that a particular record was included in the training data, and model inversion attacks allow hackers to extract training data from model output. Stress testing and penetration testing models before release can help organizations avoid vulnerabilities that would result in a data leak. It’s also key to monitor and test models in production to ensure continued resilience against known threat vectors. 

Filtering model inputs and outputs

In production, developers may wish to filter model input and output as an added layer of protection. From a privacy standpoint, some organizations may wish to redact PII from model input or prompts before it reaches the model, to avoid storing or processing PII. They may also wish to check model output for PII or other sensitive information to avoid an accidental leak. As filtering becomes more sophisticated, it will be possible to detect prompt injection and attacks before they reach the model, protecting models from adversarial activity in production. 

Conclusion

I’ve reviewed some of the unique challenges AI builders face when it comes to data protection and privacy, and addressed some safeguards that can be applied. Currently organizations face tradeoffs between privacy and performance and between privacy and ease of implementation. As AI regulations mature and as privacy-preserving technologies continue to improve, I expect that these tradeoffs will lessen, making privacy-preserving AI easier to achieve. For now, it’s important for AI builders to work with security, privacy, and compliance teams to determine the right approach based on their unique use cases and risk profile.

If you’re a founder building at the intersection of AI and privacy, I would love to hear from you. I’d also love to hear from anyone grappling with these challenges in your organizations today. Get in touch on LinkedIn or at allison@unusual.vc.

All posts

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

All posts
March 29, 2023
Portfolio
Unusual

This is what AI builders should know about data protection and privacy

No items found.
This is what AI builders should know about data protection and privacyThis is what AI builders should know about data protection and privacy
Editor's note: 

Last week, Open AI disclosed a data leak in which snippets of some users’ conversations with ChatGPT were visible to other ChatGPT users. The leak added fuel to the fire of growing concerns around AI and data privacy. Users are flocking to AI chatbots for summarizing documents, synthesizing notes, debugging code, and drafting email replies, without thinking twice about whether the prompts contains sensitive business or personal data. This has businesses particularly concerned. Some companies, including JPMorgan, have already moved to restrict employee use of ChatGPT.

Every software company is subject to regulation around data protection and privacy. This is certainly not the first time users or businesses have trusted a third-party app with sensitive data (hello social media, hello B2B SaaS). AI companies must deal with all the usual challenges plus one big new one: black box models that can behave in unexpected ways. The main risk is that a model might reveal sensitive data, either inadvertently or as the result of an adversarial attack. The good news is there are several ways AI builders can mitigate that risk. 

So, how should AI builders think about data protection and privacy? What are some key concerns for builders developing their own AI models or leveraging third-party models like those developed by OpenAI? 

I’ll start with an overview of privacy regulations and concepts that are relevant for businesses today, then dive into the specific considerations for AI builders.

Privacy regulations and challenges for AI

Data protection and privacy requirements are shaped by regulatory compliance. AI builders already have plenty of data regulations to comply with, including GDPR in Europe and CCPA/CPRA in California, as well as subject-specific regulations like HIPAA and PCI-DSS. New, AI-focused regulations are likely to emerge that may clarify how pre-existing privacy regulations apply in the context of AI (the proposed EU AI Act, for example, largely defers to GDPR on data protection and privacy), but for now we’ll focus on known regulations. Without getting into the weeds, privacy regulations require organizations to do things like:

  • Obtain consent from users to collect and process personal data
  • Ensure that some regulated data types are only stored and processed in the country/region where they were generated (data residency)
  • Prevent unauthorized access to or disclosure of regulated data
  • Document how and why regulated data is being processed
  • Process user requests to access or delete personal data 

Complying with all these regulations is complex for any organization handling regulated data. Here we’ll focus on some of the challenges unique to AI. Organizations developing and training AI models face additional complexity for several reasons:

  • Getting the data to train models can be a challenge if regulations govern where or how the data is stored, and who can have access. Even when regulations don’t prohibit it, security teams may be reluctant to give data science teams access to production data.
  • Models can be vulnerable to attacks that allow hackers to extract training data, or may inadvertently leak training data if poorly designed and not properly tested. Security teams may be reluctant to use production data in model training for these reasons.
  • AI models have a reputation for being a “black box.” There’s often a tradeoff between model performance and explainability that can make it hard to document how data is being processed.
  • Deleting user data records may not be enough to satisfy the “right to delete” if models were trained on datasets including those records. Organizations may be required to retrain models from scratch or apply other methods to ensure that the models have “forgotten” individual data. 

Considering these challenges, let’s look at some of the ways AI builders can reduce the risk of data protection and privacy violations. 

How can AI builders reduce the risk of data protection and privacy violations?

Synthetic or anonymized training data

Training data is at the root of many of the challenges above, both for users training a model from scratch, or fine-tuning a model with private data. Using real data to train and fine-tune models is ideal from a performance perspective, but can be risky from a data protection and privacy standpoint if the dataset includes regulated data. There are a number of options that allow organizations to reduce risk by modifying training data, or by using training methods that preserve privacy. Different use cases will require different combinations of approaches, depending on the sensitivity of the data and how the model will be used.  

Builders can use tools like Gretel, Mostly, or Tonic to create synthetic training data that preserves the characteristics of the original dataset. Synthetic data is then used in place of real data for model training, practically eliminating the risk that real user data could be exposed. Since synthetic data isn’t regulated under GDPR and CPRA, privacy and compliance teams don’t need to worry about giving data science access, or explaining how the data is processed. It’s not a perfect solution, though. While research suggests that models trained on synthetic data can be just as performant as those trained on real data, some builders may want to verify that by testing against a model trained on the real data before deploying to production.

Instead of using fully synthetic datasets, organizations can also use various data anonymization techniques to de-identify production data. At its most basic, this means redacting PII from datasets. If PII can’t be fully removed, it can be masked — surgically replaced with a token or with synthetic data (e.g., replace my real name “Allison” with a pseudonym “Laura”). 

If builders do choose to train models with production datasets that include personal information, it’s important to have user consent, typically by ensuring that a privacy policy is in place that specifies how the data will be processed. It’s also important to have good hygiene around training datasets and model lineage, so that you know where and how personal data was processed and can comply with data deletion requests.

Privacy-preserving model training

It is possible to train models such that the privacy of individual records in the training data is preserved. Differential privacy provides a way to quantify the risk that a model might remember and leak its training data. There are different methods for achieving differential privacy in model training, but it most commonly works by injecting noise into the dataset or model output. Training differentially private models allows for training models on real data while minimizing risk of a leak. 

For the most sensitive use cases, newer technologies allow for model training while maintaining the privacy of the underlying data. Organizations can leverage confidential computing infrastructure like secure enclaves to train models in an isolated hardware environment. This allows data science teams to train models without ever seeing the training data, keeping data confidential and protected throughout the training process. Secure enclaves can be difficult to set up and may not be appropriate for all use cases, but a number of startups are working to make this technology more accessible. 

Testing for vulnerabilities

Most teams test models for performance before deploying to production, but it’s equally important to stress test for security and privacy. Testing should ensure that models aren’t vulnerable to adversarial machine learning attacks, such as those enumerated in the MITRE ATLAS threat landscape for AI. Prompt injection has gotten a lot of attention recently with the rise of AI chatbots, as attackers were able to craft clever prompts that forced Microsoft’s Bing AI to reveal its instruction and codename. Other attack techniques focus on getting the model to reveal its training data. Model inference attacks allow hackers to infer that a particular record was included in the training data, and model inversion attacks allow hackers to extract training data from model output. Stress testing and penetration testing models before release can help organizations avoid vulnerabilities that would result in a data leak. It’s also key to monitor and test models in production to ensure continued resilience against known threat vectors. 

Filtering model inputs and outputs

In production, developers may wish to filter model input and output as an added layer of protection. From a privacy standpoint, some organizations may wish to redact PII from model input or prompts before it reaches the model, to avoid storing or processing PII. They may also wish to check model output for PII or other sensitive information to avoid an accidental leak. As filtering becomes more sophisticated, it will be possible to detect prompt injection and attacks before they reach the model, protecting models from adversarial activity in production. 

Conclusion

I’ve reviewed some of the unique challenges AI builders face when it comes to data protection and privacy, and addressed some safeguards that can be applied. Currently organizations face tradeoffs between privacy and performance and between privacy and ease of implementation. As AI regulations mature and as privacy-preserving technologies continue to improve, I expect that these tradeoffs will lessen, making privacy-preserving AI easier to achieve. For now, it’s important for AI builders to work with security, privacy, and compliance teams to determine the right approach based on their unique use cases and risk profile.

If you’re a founder building at the intersection of AI and privacy, I would love to hear from you. I’d also love to hear from anyone grappling with these challenges in your organizations today. Get in touch on LinkedIn or at allison@unusual.vc.

All posts

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.