I’ve Been Thinking About How We’re Getting AI Privacy Wrong
Most teams redact. Almost nobody anonymizes. Here’s why that distinction keeps me up at night.
Let me tell you something that happened on a call last month.
A senior engineer at a mid-size fintech company — smart person, clearly cares about doing things right — walked me through their LLM pipeline. They’d built something genuinely impressive. The model was sharp, the prompts were well-crafted, the outputs were good. And then he said, almost as an aside:
“Oh, and before anything goes to the model, someone on the team does a quick scan and removes anything that looks like PII.”
I asked how many records they process per day.
About four thousand, he said.
I didn’t say anything immediately. But I was thinking: that’s not a privacy control. That’s a prayer.
This isn’t a knock on that engineer. Honestly, most teams I talk to are doing the same thing. Redacting manually, assuming it’s good enough, moving on. And I get it — there’s always something more urgent to build. But the gap between what teams think they’re doing and what’s actually happening with their data is wider than most people realize.
So I want to spend this issue actually explaining the difference. Not in a compliance-handbook way. In a “let me tell you what’s really going on” way.
Redaction and anonymization are not the same thing
I know that feels obvious when I say it out loud. But I’ve watched these two words get used interchangeably by engineering teams, compliance leads, and even some legal teams, so let’s be precise.
Redaction is the act of removing or hiding a sensitive value. You take “John Smith, DOB 12/03/1985, Account #78239” and you turn it into “[NAME], DOB [DATE], Account #[ID].” The value is gone. The slot it occupied is still there.
Anonymization is different. It doesn’t leave a slot. It fills the slot with something else — a realistic, synthetic substitute that carries the same semantic weight without carrying any real identity. “John Smith, DOB 12/03/1985” becomes “Alex T., DOB 04/17/1976.”
To the LLM, both versions are equally useful. The model doesn’t care whether the name is real. It cares about the structure, the relationships, the context. Both versions give it that. But only one of them actually protects the individual whose data you’re processing.
This is the part that I think gets missed in most conversations about AI privacy. People focus on whether the data is “removed” when they should be asking whether it’s been “transformed.” Removal creates gaps. Transformation preserves meaning while eliminating risk.
Three reasons redaction keeps failing in production
I’ve seen this pattern enough times to know it’s not a one-off. Redaction consistently breaks down in AI pipelines for the same reasons.
The structure never goes away.
When you blank out a field, the surrounding text still tells a story. The sentence structure, the field positions, the numerical ranges, the writing style — these all carry information that a determined actor (or a sufficiently capable model) can use to infer what was removed. You haven’t eliminated the signal. You’ve just made it slightly harder to read.
Human review doesn’t scale.
I have a lot of respect for the engineers who are manually reviewing prompts before they go to a model. It takes discipline to build that habit. But four thousand records a day, reviewed by a person? That’s not a system. That’s a bottleneck waiting to become an incident. One late night, one distracted afternoon, one field that slipped through — that’s all it takes.
Blank fields make your AI worse.
This one doesn’t get talked about enough. LLMs reason from context. When chunks of your input read [REDACTED], the model sees gaps, and those gaps genuinely hurt the quality of the output. So you’re not just creating a privacy risk with manual redaction — you’re also degrading the performance of the AI you spent so much time building. Anonymization solves both problems at once.
What a real anonymization pipeline looks like
A proper LLM anonymizer doesn’t just delete things. It intercepts your data before it reaches the model, runs it through a detection and replacement layer, and sends the model a clean version that preserves all the context it needs.
The pipeline, simplified:
Your Data → PII Detection → Synthetic Replacement → LLM → Response → De-anonymization (if needed)
That middle step — the detection and replacement — is where most of the engineering complexity lives. And it’s more complex than it looks.
A single NLP model isn’t sufficient. General Named Entity Recognition models are good at catching names and locations. They’re not great at email addresses, financial account numbers, or domain-specific identifiers. You need at minimum two specialized models running in parallel, with a merge algorithm that handles the cases where both models detect overlapping entities in the same span of text.
Structured data — CSVs, Excel files, databases — is a completely different problem. You can’t pass a spreadsheet to an NLP model and expect it to behave. The absence of natural sentence structure confuses context-aware models. Real pipelines need to combine NLP inference with heuristic rules: column-header analysis, regex patterns for emails and phone numbers, and thresholding logic that wipes entire columns when the PII density is high enough that row-by-row processing is unnecessary.
And then there’s file reconstruction. Detecting PII in a PDF is one thing. Putting the PDF back together — with the right values replaced, the formatting intact, and the file structurally valid — is a different engineering challenge entirely.
We went deep on all of this in a recent post on the Questa-AI Privacy Café. If you’re building any of this yourself, it’s worth reading: Under the Hood: Building a Privacy-First Anonymizer for LLMs. It covers the dual-model architecture, the merge logic, the CSV heuristics, the multithreading approach, and the full reconstruction pipeline.
The pseudonymization trap that nobody talks about
Quick aside before we go further, because this comes up constantly.
A lot of teams think they’re anonymizing when they’re actually pseudonymizing. These are not the same thing, and the legal difference is significant.
Pseudonymization replaces PII with a placeholder that can be reversed if you have the mapping key. The original data is still recoverable. Under GDPR, pseudonymized data is still classified as personal data and is still fully regulated.
Anonymization produces data that cannot be reversed under any circumstances — even with additional information. Truly anonymized data falls entirely outside the scope of GDPR. You gain real regulatory flexibility, not just the appearance of it.
If your team is using the word “anonymization” to describe a process that could theoretically be reversed with a lookup table, your compliance posture is weaker than you think. Worth having that conversation with your legal team before you need to have it with a regulator.
Who needs this most urgently right now
Honestly? Any team routing real customer or patient data through an LLM pipeline. But some industries are carrying more risk than others.
Healthcare teams are in the most acute position. HIPAA requires strict de-identification for any patient data used outside direct care. There is no compliant version of “we manually reviewed it before sending.”
Financial services teams are dealing with GLBA, SOX, and PCI-DSS simultaneously. Customer transaction data, account details, and financial records flow through AI tools constantly, and the regulatory exposure is substantial.
Legal teams face a combination of regulatory and privilege risk. LLMs processing legal documents without an anonymization layer are exposing both client PII and potentially privileged communications.
BPOs have a multiplier problem. A breach in a BPO context doesn’t just expose one company’s customer base — it can expose every client organization simultaneously.
HR teams are routinely processing some of the most sensitive data in any organization — salary information, performance reviews, medical accommodations, candidate assessments — through AI tools that were not built with that sensitivity in mind.
A few things I get asked about this all the time
Can’t I just prompt the LLM to anonymize the data itself?
You can, and for small-scale, low-stakes use cases, it’s a reasonable starting point. But ad-hoc prompting is inconsistent. It fails on edge cases in ways that are hard to predict. It produces no audit trail. And it gives you no compliance documentation. If you’re processing real customer data at any meaningful scale, you need a dedicated layer, not a prompt instruction.
Won’t anonymized data produce worse model outputs?
No — and this surprises people. The model doesn’t care whether the name is “John Smith” or “Alex T.” It cares about the semantic structure of the input. Synthetic substitutes preserve that structure completely. I’ve never seen a case where properly anonymized data produced meaningfully different output quality compared to the original.
How hard is it to add this to an existing pipeline?
Less hard than you’d expect, if it’s designed to be modular. The anonymization layer sits between your data source and your LLM call. Your prompt templates, your model configuration, your response handling — none of that changes. You’re adding one layer in, and everything downstream becomes safe by default. The hard part is building the detection and replacement logic correctly, which is why purpose-built solutions tend to be worth it.
What I’d actually do if I were starting from scratch today
If I were building an LLM pipeline from the ground up and had to think about this from day one, here’s what I’d prioritize:
• Map the data before you build the prompts. Know exactly what personal data is touching your pipeline and where. Most teams discover this later than they should.
• Treat the anonymization layer as infrastructure, not a feature. It should be inthe architecture from the start, not added after a compliance review flags it.
• Don’t confuse pseudonymization with anonymization. Know which one you’re doing and be honest about the regulatory implications of each.
• Test your detection layer adversarially. PII detection models have blind spots. Test with unusual formats, non-English names, domain-specific identifiers, and data structures the model wasn’t trained on.
• Document everything. If you ever need to demonstrate compliance, the documentation of your anonymization approach is what will matter. An automated layer makes this straightforward. A manual process makes it nearly impossible.
The short version
Redaction is what you do when you’re moving fast and hoping for the best. Anonymization is what you build when you understand the actual risk.
The teams that are getting this right aren’t treating it as a compliance checkbox. They’re treating it as engineering infrastructure — something that gets built in at the foundation, runs automatically, and gets audited the same way any other security control gets audited.
The teams that are getting it wrong are one log file away from a very uncomfortable conversation.
Build the anonymization layer in. Make it default. Don’t make it someone’s job to remember.
That’s it from me this week. Reply and let me know how your team is handling this — I read everything that comes in, and the best conversations I have come from these replies.
— Questa-AI Engineering Team
Where to go from here
If this issue was useful, here are the three places to go deeper:
The practitioner discussion
This same topic generated a good conversation on LinkedIn. Worth reading the comments if you want to see how different teams are approaching it in practice: Redaction vs Anonymization for AI Prompts — Questa-AI on LinkedIn →
The broader context on Medium
For a more structured breakdown of the same topic — including the compliance implications and industry-specific risks — read the Medium piece: Stop Redacting. Start Anonymizing. →

