AI - Beyond the Hype

AI Security Part 1: Why AI Without Data Security Is a Breach Waiting to Happen

Season 1 Episode 3

Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.

0:00 | 21:37

Sarah and James open the three-part Data Security for AI series with a simple argument: AI is only as trustworthy as the data underneath it.

What we cover

The adoption gap: Gartner expects 40% of enterprise apps to embed AI agents by end‑2026 (up from <5%). IBM’s 2025 Cost of a Data Breach Report found 13% of organisations have had an AI-related breach — 97% lacked proper access controls.

Structured vs unstructured data: IDC estimates 80–90% of enterprise data is unstructured. Varonis found only 1 in 10 organisations have labelled files, and 88% still have “ghost” accounts. Point a copilot at that estate and every overshared file is exposed.

The incident catalogue: Samsung engineers pasting source code into ChatGPT (2023). Microsoft’s AI team exposing 38 TB — via a misconfigured Azure SAS token. DeepSeek’s ClickHouse leak exposing chat histories and API keys (2025).

Liability is real: Moffatt v. Air Canada (2024), where the airline argued its chatbot was a separate legal entity — and lost. NYC’s MyCity chatbot.

Shadow AI: IBM found shadow-AI breaches cost US$670K more and make up 20% of incidents.

Memorisation: Carlini et al. (ICLR 2023) showed models memorise training data based on size, duplication, and prompt context — sensitive data should be treated as eventually leakable.

Sources

Gartner 40% forecast: https://finance.yahoo.com/news/40-enterprise-apps-embed-ai-181310288.html

IBM 2025 Cost of a Data Breach: https://www.ibm.com/reports/data-breach

IBM analysis (97%, US$670K): https://www.kiteworks.com/cybersecurity-risk-management/ibm-2025-data-breach-report-ai-risks/

IDC unstructured data: https://blog.box.com/90-percent-unstructured-data

Varonis 2025 State of Data Security: https://www.varonis.com/blog/state-of-data-security-report

Samsung ChatGPT leak: https://www.pcmag.com/news/samsung-software-engineers-busted-for-pasting-proprietary-code-into-chatgpt

Microsoft 38 TB exposure: https://www.wiz.io/blog/38-terabytes-of-private-data-accidentally-exposed-by-microsoft-ai-researchers

DeepSeek ClickHouse exposure: https://www.wiz.io/blog/wiz-research-uncovers-exposed-deepseek-database-leak

Moffatt v. Air Canada (Forbes): https://www.forbes.com/sites/marisagarcia/2024/02/19/what-air-canada-lost-in-remarkable-lying-ai-chatbot-case/

NYC MyCity (The Markup): https://themarkup.org/artificial-intelligence/2024/04/02/malfunctioning-nyc-ai-chatbot-still-active-despite-widespread-evidence-its-encouraging-illegal-behavior

Cisco 2024 Privacy Benchmark: https://www.cisco.com/c/dam/en_us/about/doing_business/trust-center/docs/cisco-privacy-benchmark-study-2024.pdf

Carlini et al., ICLR 2023: https://arxiv.org/abs/2202.07646

Send us Feedback

SPEAKER_01

Welcome back to AI Beyond the Hype. I'm Jane.

SPEAKER_00

And I'm Sarah. Good to be back.

SPEAKER_01

Sarah.

SPEAKER_00

Hmm?

SPEAKER_01

Before we hit record, you handed me a stack of research and said, and I quote, you're going to want to read this before we start.

SPEAKER_00

I did say that.

SPEAKER_01

Which is never an entirely reassuring sentence from a data engineer.

SPEAKER_00

No, it usually means I'm about to ruin someone's quarter.

SPEAKER_01

Right. So today, and actually next two episodes as well, because there's too much for one show. We're talking about data security for AI. The plumbing, the access controls, the boring bits underneath the shiny copilot demo.

SPEAKER_00

The substrate. That's the word the report I read kept using. Data security is the substrate AI is built on.

SPEAKER_01

Okay. I'll be honest with the audience. When you sent me this, my first reaction was a little bit of an eye roll. Go on. Not because security doesn't matter, I'm a CIO type. Of course it matters. But every time we get into one of these conversations, there's a tendency for technical folks to go full doom and gloom. Everything's on fire! The agents are coming for your database. Lock the doors. And meanwhile, the business is trying to actually ship something useful.

SPEAKER_00

Yeah.

SPEAKER_01

So part of me wonders, is this a bit alarmist? Are we over-engineering the solution to a problem that in practice mostly works out?

SPEAKER_00

That's a fair pushback. And I want to take it seriously, because I don't think the answer is panic. I really don't. But I do think the answer is we've got the order of operations wrong in a lot of organizations.

SPEAKER_01

Order of operations. Explain.

SPEAKER_00

Most enterprises right now are buying AI faster than they're securing the data underneath it. Gartner's predicting that by the end of this year, 40% of enterprise applications will have an AI agent embedded in them, up from less than 5% earlier in the year.

SPEAKER_01

That's a huge jump.

SPEAKER_00

Massive. And in the same window, only about 6% of organizations report having an advanced AI security strategy.

SPEAKER_01

6%.

SPEAKER_00

6. So we're racing past our own maturity curve. And IBM's cost of a data breach for 2025 found that 13% of organizations have already had a breach involving an AI model or application. Of those, 97% lacked proper AI access controls.

SPEAKER_01

Okay. That 97% figure is the one that should make leaders pause. That's not could happen, that's did happen, and the controls weren't there.

SPEAKER_00

Right. So this isn't the technical community being dramatic. This is the data telling us the gap is real.

SPEAKER_01

Alright, I'll put my eye roll in the bin for now.

SPEAKER_00

Appreciated.

SPEAKER_01

So let's set the table for the executive listening to this. What's actually different about securing AI compared to, say, securing a normal application?

SPEAKER_00

Okay, this is the bit I get weirdly excited about. A traditional application is predictable. It accesses a defined set of tables behind a defined API. You can draw the data flow on a whiteboard. Done.

SPEAKER_01

Right, the architects love a whiteboard.

SPEAKER_00

We do. But an enterprise AI assistant sits on top of nearly the entire data estate. A single copilot deployment might touch a CRM, a ticketing system, Confluence, SharePoint, Slack, source code, email transcripts, all before producing a single answer.

SPEAKER_01

So the surface area is just bigger.

SPEAKER_00

Massively bigger. And here's the part that really matters. Cisco's data privacy benchmark. 92% of organizations agree that generative AI is, quote, a fundamentally different technology with novel challenges.

SPEAKER_01

Okay, but 92% of organizations agree on lots of things in surveys. What's the practical version of that?

SPEAKER_00

Practical version. Same study found 48% of users admitted entering non-public company information into Gen AI tools. And 62% entered internal process information.

SPEAKER_01

Into public tools.

SPEAKER_00

Into public tools. And then Gartner found that 57% of employees use personal Gen AI accounts for work. And a third of those have uploaded sensitive information to unsanctioned ones.

SPEAKER_01

Okay, that's where this gets real for a leader. Because it means whatever your official AI policy says, the reality on the ground is that your data is already walking out the door in copy-paste form.

SPEAKER_00

Exactly. And IBM put a number on that too. One in five organizations had a breach from Shadow AI. Unsanctioned tools used outside IT's visibility. Those breaches carried a$670,000 premium over standard incidents.

SPEAKER_01

$670,000 just for the shadow AI flavour.

SPEAKER_00

Just for the shadow flavour.

SPEAKER_01

Alright, so the surface is bigger and the human behavior is already running ahead of policy. What's the technical thing executives most often miss?

SPEAKER_00

I think it's the difference between structured and unstructured data. And I know that sounds geeky, but it really matters.

SPEAKER_01

Walk us through it. Slowly. Imagine I'm me on a Monday.

unknown

Okay.

SPEAKER_00

Structured data is the stuff that lives in databases and warehouses. Tables, rows, columns. Customer records in a CRM, transactions in Snowflake, that kind of thing. The schema tells you what each field is. This column is salary. This column is email. So you can write a policy that says, mask the salary column for these users. It's tractable. Right.

SPEAKER_01

Tidy. Boxes and labels.

SPEAKER_00

Tidy. Now unstructured data, that's everything else. Word documents, PDFs, emails, chat threads in Slack and Teams, JIRA tickets, audio transcripts, source code, images.

SPEAKER_01

And how much of the enterprise estate is that?

SPEAKER_00

IDC estimates 80 to 90% of all enterprise data is unstructured.

SPEAKER_01

80 to 90.

SPEAKER_00

And, this is the bit, only one in ten organizations has labelled their files.

SPEAKER_01

At all. So 9 out of 10 organizations don't actually know what's in the bulk of their data.

SPEAKER_00

Right. Structured data tells you what it is. Unstructured data only tells you what it says. There's a lovely framing in the research. Imagine a single PowerPoint deck in a SharePoint folder. It might contain a customer name, the five-year strategy, an unannounced product roadmap, a database password an engineer pasted in by accident, and a contract excerpt. Five different sensitivities.

SPEAKER_01

And that's a real example, not a hypothetical.

SPEAKER_00

That's the average organization. And here's where it bites with AI. Generative tools ingest content based on access permissions, not sensitivity. So if you've got an oversharing problem, and you almost certainly do, an AI assistant will surface it at a scale and a salience nothing else has.

SPEAKER_01

Hmm. And does the research quantify that oversharing?

SPEAKER_00

It does. The average organization has 802,000 files at risk of oversharing. 16% of business critical data is overshared. And 90% of business critical documents are shared outside the C-suite. 90%. And here's the leadership tell. A Gartner survey of 132 IT leaders found that data oversharing prompted 40% of them to delay their Microsoft 365 co-pilot rollout by three months or more.

SPEAKER_01

Right, that's the sentence I want every executive listening to hear. Because that is not a security team being precious. That is 40% of your peer group hitting the brakes on a tool they paid for because they looked at their own file shares and panicked.

SPEAKER_00

Yeah, and I think that's the real reframe. The security work isn't slowing AI down. The lack of security work is what's slowing AI down.

SPEAKER_01

Polished on the surface, shaky underneath. Mm-hmm. Okay, let's make this concrete. You promised me real incidents. I want real incidents.

SPEAKER_00

Oh, I have incidents. Where do you want to start? Data going out or data coming back to bite you?

SPEAKER_01

Let's go data going out first. That feels like the gentle one.

SPEAKER_00

Sure. Samsung. Early 2023.

SPEAKER_01

I remember this one vaguely.

SPEAKER_00

Samsung semiconductor lifted an internal ban on Chat GPT. Within three weeks, engineers in the device solutions division had three separate leakage incidents. One pasted proprietary semiconductor diagnostics source code in to get a bug fix. One submitted code for optimization. And one uploaded the transcript of a confidential business meeting to be summarized.

SPEAKER_01

So none of them were trying to leak anything.

SPEAKER_00

They were chasing productivity. There was no DLP, no guardrails on the externally hosted service, and no awareness that, at that time, OpenAI's policy retained submissions for training.

SPEAKER_01

So that source code is now presumably part of an external training corpus.

SPEAKER_00

Presumed, yes. And Samsung's response was to restrict chat GPT uploads to 1,024 bytes per user, open an investigation, and rush an internal AI assistant.

SPEAKER_01

Right. Closing the gate after the horse has bolted, written out a few suggestions, and taught the whole farm.

SPEAKER_00

Pretty much. And that's the canonical example of why Shadow AI is a data security problem first and a policy problem second. You can write all the policies you like. If your engineers have a deadline and a public chatbot, the policy is coming second.

SPEAKER_01

And for leaders, the lesson there is sanctioned alternatives matter as much as bans. If you don't give people a safe place to do this work, they'll do it in an unsafe place.

SPEAKER_00

Exactly. There's a line in the research I love.

SPEAKER_01

That's good. I'm stealing that. Okay, that's data going out. What about data coming back in?

SPEAKER_00

DeepSeq. January 2025.

SPEAKER_01

The Chinese AI startup.

SPEAKER_00

Right. Wiz research found that DeepSeek had left a ClickHouse database completely open on the public internet, unauthenticated, on non-standard ports. Open on the internet. Anyone could execute arbitrary SQL via a browser. The log table contained over a million entries, plain text chat history, API keys, back-end service names, directory structures, and the click house function set could potentially have retrieved plain text passwords and local files. So all the conversations users were having with this AI sat in a database any passerby with a browser could read.

SPEAKER_01

And presumably this caused.

SPEAKER_00

The pairing the research draws, and I think it's a really useful one, is Samsung shows what unstructured data looks like leaving the building. Deep Seek shows what structured data looks like coming back in.

SPEAKER_01

Right, because the moment you start thinking of an AI provider as a data custodian, which is what they are, you're back to first principles infrastructure security, authentication, network controls, database hardening.

SPEAKER_00

Yes. And here's the thing. Deep Seek wasn't a sophisticated attack. There was no zero day, no clever prompt. Someone forgot to put a password on a database.

SPEAKER_01

The world's most expensive AI training run, undone by the IT equivalent of leaving the front door wide open. Pretty much. Okay, tell me about the Microsoft one, because I remember that being eye watering.

SPEAKER_00

Eye watering is right. June 2023. Microsoft's own AI research team committed an Azure storage token to a public GitHub repository for an image recognition project.

SPEAKER_01

A token.

SPEAKER_00

A token. The token was meant to grant access to one specific dataset. It actually granted access to the entire Azure storage account it lived in. And it was misconfigured to allow not just read, but full control. Read, write, delete. What was in the storage account? 38 TB. Of disc backups of two former employees' workstations, secrets, private keys, passwords to Microsoft services, and over 30,000 internal Microsoft Teams messages from 359 employees.

SPEAKER_01

38 terabytes. From Microsoft, the company that sells security.

SPEAKER_00

And the token had been sitting there since 2020. Three years.

SPEAKER_01

How was it found?

SPEAKER_00

Wiz research. Just routine scanning of public repos.

SPEAKER_01

Hmm. And the bit that really gets me, write permission, you said.

SPEAKER_00

That's the kicker. Read access is bad. But write access on an AI training pipeline means an attacker could have substituted poisoned model weights. Every researcher who downloaded those weights afterwards would have inherited the poison.

SPEAKER_01

That's the dangerous bit. That's not a leak. That's a supply chain compromise waiting to fire.

SPEAKER_00

Exactly. And the lesson, Wiz's own conclusion, was that those tokens should be treated as sensitive as the account key itself. They're created client side. They're not centrally tracked. So you don't even know how many you've got.

SPEAKER_01

Right, and translating that to the boardroom, your AI training pipeline is going to bring all of your organization's existing bad credential management habits into very stark relief. And add a write path on top.

SPEAKER_00

Yeah, that's a good way to put it.

SPEAKER_01

Okay, so far we've talked about data leaking. Source code into ChatGPT. Database on the internet. Token in a public repo. What about when the AI itself gets it wrong?

SPEAKER_00

Different category. Equally important. Two cases.

SPEAKER_01

Go.

SPEAKER_00

Air Canada. February 2024. A grieving customer asked their chatbot about bereavement fares. The bot invented a policy, said he could buy a full fare ticket and apply for a retroactive bereavement discount within 90 days. That policy did not exist.

SPEAKER_01

And he relied on it?

SPEAKER_00

He relied on it. Travelled. Submitted the claim. Air Canada refused. Told him the chat bot was wrong.

SPEAKER_01

Which is technically accurate.

SPEAKER_00

Technically accurate. So he took it to the Civil Resolution Tribunal of British Columbia. And here's the line that's now legendary. Air Canada's defense was, and I'm quoting the tribunal now, that the chat bot is a separate legal entity that is responsible for its own actions.

SPEAKER_01

They argued that.

SPEAKER_00

They argued that.

SPEAKER_01

Oh, that is that's bold. That's a bold move.

SPEAKER_00

The tribunal called it a remarkable position. Rejected it. Air Canada paid out. Small amount,$812.02. But the precedent?

SPEAKER_01

The precedent is enormous. The dollar value is rounding error. The principle is you are liable for what your AI says. Full stop.

SPEAKER_00

Yeah.

SPEAKER_01

Hmm. And then what was the New York one?

SPEAKER_00

The My City chatbot. Built by New York City, Microsoft Azure AI. About$600,000 to build. Trained on roughly 2,000 NYC web pages.

SPEAKER_01

Hmm. Reasonable so far.

SPEAKER_00

Reasonable. Then it started giving illegal advice. Systematically. It told employers they could take a portion of their workers' tips. That's illegal under New York labor law. It told landlords they could refuse housing voucher tenants. Illegal since 2008. It told businesses they could refuse cash, banned in NYC since 2020.

SPEAKER_01

On the city's official site.

SPEAKER_00

Yep. On the city's official site. And when asked directly whether it could be relied on for professional business advice, it answered yes. Directly contradicting its own disclaimer. Oof. Insufficient grounding. No legal domain validation. Unstructured training data that didn't reflect current law. And no human in the loop on high-stakes outputs.

SPEAKER_01

And the danger there is the source. The fact that it's the city saying this means people will trust it. The official source signal amplifies the harm of the hallucination.

SPEAKER_00

That's exactly the framing the research uses. Not every AI failure is a data leak. Some are accountability failures with legal teeth.

SPEAKER_01

Okay, pulling this together for the executive listening. So far, we have shadow AI walking your data out, public databases letting strangers read it, tokens in repos creating supply chain risk, chatbots inventing policies and getting you sued. And confabulation, more commonly known as hallucination, at scale on official channels.

SPEAKER_00

And we haven't started on the agentic stuff yet.

SPEAKER_01

Right, I was going to say. Everything you've described so far, the AI is, in a sense, still just answering questions, talking, memorizing, leaking, but not acting. Not yet acting. And I've got a feeling that's where part two starts.

SPEAKER_00

Correct. That's where part two starts. Because everything we've talked about today is the latent risk. The data weaknesses sitting there in your environment. The overshared SharePoint folders nobody noticed for five years. The stale service accounts. The unlabeled files. And what changes? What changes is when the AI gains tools, hands, APIs it can call, databases it can write to, code it can execute, email it can send. And then a successful attack doesn't just leak a sentence, it deletes a database, or empties a SharePoint into an attacker's URL, or pulls data out of private Slack channels, with one crafted message.

SPEAKER_01

One message.

SPEAKER_00

One message. There's a CVE for it now. CVE 2025 32711. A CVE is a publicly disclosed cybersecurity vulnerability called a common vulnerabilities and exposures, or CVE entry.

SPEAKER_01

Okay, we are absolutely going to talk about that next time.

SPEAKER_00

We are.

SPEAKER_01

So for now, takeaway for leaders. Three things. One, your AI rollout is sitting on top of decades of unfinished data hygiene. Classification, oversharing, ghost users. Don't confuse buying AI with being ready for AI. Two, the structured versus unstructured split is the conceptual bit you need to take into your next leadership meeting. They are different problems with different controls, and most organizations have only really solved one of them.

SPEAKER_00

And the unsolved one is the 80 to 90%.

SPEAKER_01

Right. And three, your liability does not end where your AI's confidence begins. Air Canada is the case to remember.

SPEAKER_00

Better. AI still starts with better foundations. Boring, I know. But true.

SPEAKER_01

Boring and true is my favorite combination, Sarah. Next episode, we get into the agentic era. We'll talk about EchoLeak, the first zero-click attack on an enterprise AI agent. The Replit incident, where an AI agent deleted a production database during a code freeze, then lied about it.

SPEAKER_00

And then tried to cover its tracks.

SPEAKER_01

We'll talk about the OWASP top 10 for agentic systems, what controls actually work, and the questions every board should be asking before approving an agent rollout.

SPEAKER_00

It's the more uncomfortable episode.

SPEAKER_01

Until next time, thanks for listening.

SPEAKER_00

Thanks, everyone.