AI Security Part 1: Why AI Without Data Security Is a Breach Waiting to Happen Artwork

AI - Beyond the Hype

AI - Beyond the Hype is a podcast for senior executives, technology leaders, and data professionals who want a clear-eyed view of what it really takes to make AI work in the enterprise.

Each short episode is designed for easy consumption by busy leaders and executives, offering concise, practical conversations on the foundations behind successful AI adoption — from data quality and observability to governance, operating models, architecture, and trust. Through thoughtful, conversational dialogue, the show connects executive priorities with the technical realities that determine whether AI delivers meaningful value or simply creates more noise.

If your organisation is asking big questions about AI readiness, digital transformation, and data-driven decision-making, this podcast is designed to help you quickly separate what sounds impressive from what actually works.

All Episodes

AI - Beyond the Hype

AI Security Part 1: Why AI Without Data Security Is a Breach Waiting to Happen

April 29, 2026 • Season 1 • Episode 3

0:00 | 21:37

Sarah and James open the three-part Data Security for AI series with a simple argument: AI is only as trustworthy as the data underneath it.

What we cover

The adoption gap: Gartner expects 40% of enterprise apps to embed AI agents by end‑2026 (up from <5%). IBM’s 2025 Cost of a Data Breach Report found 13% of organisations have had an AI-related breach — 97% lacked proper access controls.

Structured vs unstructured data: IDC estimates 80–90% of enterprise data is unstructured. Varonis found only 1 in 10 organisations have labelled files, and 88% still have “ghost” accounts. Point a copilot at that estate and every overshared file is exposed.

The incident catalogue: Samsung engineers pasting source code into ChatGPT (2023). Microsoft’s AI team exposing 38 TB — via a misconfigured Azure SAS token. DeepSeek’s ClickHouse leak exposing chat histories and API keys (2025).

Liability is real: Moffatt v. Air Canada (2024), where the airline argued its chatbot was a separate legal entity — and lost. NYC’s MyCity chatbot.

Shadow AI: IBM found shadow-AI breaches cost US$670K more and make up 20% of incidents.

Memorisation: Carlini et al. (ICLR 2023) showed models memorise training data based on size, duplication, and prompt context — sensitive data should be treated as eventually leakable.

Sources

Gartner 40% forecast: https://finance.yahoo.com/news/40-enterprise-apps-embed-ai-181310288.html

IBM 2025 Cost of a Data Breach: https://www.ibm.com/reports/data-breach

IBM analysis (97%, US$670K): https://www.kiteworks.com/cybersecurity-risk-management/ibm-2025-data-breach-report-ai-risks/

IDC unstructured data: https://blog.box.com/90-percent-unstructured-data

Varonis 2025 State of Data Security: https://www.varonis.com/blog/state-of-data-security-report

Samsung ChatGPT leak: https://www.pcmag.com/news/samsung-software-engineers-busted-for-pasting-proprietary-code-into-chatgpt

Microsoft 38 TB exposure: https://www.wiz.io/blog/38-terabytes-of-private-data-accidentally-exposed-by-microsoft-ai-researchers

DeepSeek ClickHouse exposure: https://www.wiz.io/blog/wiz-research-uncovers-exposed-deepseek-database-leak

Moffatt v. Air Canada (Forbes): https://www.forbes.com/sites/marisagarcia/2024/02/19/what-air-canada-lost-in-remarkable-lying-ai-chatbot-case/

NYC MyCity (The Markup): https://themarkup.org/artificial-intelligence/2024/04/02/malfunctioning-nyc-ai-chatbot-still-active-despite-widespread-evidence-its-encouraging-illegal-behavior

Cisco 2024 Privacy Benchmark: https://www.cisco.com/c/dam/en_us/about/doing_business/trust-center/docs/cisco-privacy-benchmark-study-2024.pdf

Carlini et al., ICLR 2023: https://arxiv.org/abs/2202.07646

Send us Feedback

SPEAKER_01 0:06

Welcome back to AI Beyond the Hype. I'm Jane.

SPEAKER_00 0:09

And I'm Sarah. Good to be back.

SPEAKER_01 0:11

Sarah.

SPEAKER_00 0:12

Hmm?

SPEAKER_01 0:13

Before we hit record, you handed me a stack of research and said, and I quote, you're going to want to read this before we start.

SPEAKER_00 0:20

I did say that.

SPEAKER_01 0:22

Which is never an entirely reassuring sentence from a data engineer.

SPEAKER_00 0:26

No, it usually means I'm about to ruin someone's quarter.

SPEAKER_01 0:30

Right. So today, and actually next two episodes as well, because there's too much for one show. We're talking about data security for AI. The plumbing, the access controls, the boring bits underneath the shiny copilot demo.

SPEAKER_00 0:45

The substrate. That's the word the report I read kept using. Data security is the substrate AI is built on.

SPEAKER_01 0:54

Okay. I'll be honest with the audience. When you sent me this, my first reaction was a little bit of an eye roll. Go on. Not because security doesn't matter, I'm a CIO type. Of course it matters. But every time we get into one of these conversations, there's a tendency for technical folks to go full doom and gloom. Everything's on fire! The agents are coming for your database. Lock the doors. And meanwhile, the business is trying to actually ship something useful.

SPEAKER_00 1:25

Yeah.

SPEAKER_01 1:25

So part of me wonders, is this a bit alarmist? Are we over-engineering the solution to a problem that in practice mostly works out?

SPEAKER_00 1:34

That's a fair pushback. And I want to take it seriously, because I don't think the answer is panic. I really don't. But I do think the answer is we've got the order of operations wrong in a lot of organizations.

SPEAKER_01 1:47

Order of operations. Explain.

SPEAKER_00 1:50

Most enterprises right now are buying AI faster than they're securing the data underneath it. Gartner's predicting that by the end of this year, 40% of enterprise applications will have an AI agent embedded in them, up from less than 5% earlier in the year.

SPEAKER_01 2:05

That's a huge jump.

SPEAKER_00 2:07

Massive. And in the same window, only about 6% of organizations report having an advanced AI security strategy.

SPEAKER_01 2:16

6%.

SPEAKER_00 2:18

6. So we're racing past our own maturity curve. And IBM's cost of a data breach for 2025 found that 13% of organizations have already had a breach involving an AI model or application. Of those, 97% lacked proper AI access controls.

SPEAKER_01 2:38

Okay. That 97% figure is the one that should make leaders pause. That's not could happen, that's did happen, and the controls weren't there.

SPEAKER_00 2:48

Right. So this isn't the technical community being dramatic. This is the data telling us the gap is real.

SPEAKER_01 2:55

Alright, I'll put my eye roll in the bin for now.

SPEAKER_00 2:58

Appreciated.

SPEAKER_01 3:00

So let's set the table for the executive listening to this. What's actually different about securing AI compared to, say, securing a normal application?

SPEAKER_00 3:09

Okay, this is the bit I get weirdly excited about. A traditional application is predictable. It accesses a defined set of tables behind a defined API. You can draw the data flow on a whiteboard. Done.

SPEAKER_01 3:25

Right, the architects love a whiteboard.

SPEAKER_00 3:28

We do. But an enterprise AI assistant sits on top of nearly the entire data estate. A single copilot deployment might touch a CRM, a ticketing system, Confluence, SharePoint, Slack, source code, email transcripts, all before producing a single answer.

SPEAKER_01 3:47

So the surface area is just bigger.

SPEAKER_00 3:50

Massively bigger. And here's the part that really matters. Cisco's data privacy benchmark. 92% of organizations agree that generative AI is, quote, a fundamentally different technology with novel challenges.

SPEAKER_01 4:07

Okay, but 92% of organizations agree on lots of things in surveys. What's the practical version of that?

SPEAKER_00 4:14

Practical version. Same study found 48% of users admitted entering non-public company information into Gen AI tools. And 62% entered internal process information.

SPEAKER_01 4:27

Into public tools.

SPEAKER_00 4:28

Into public tools. And then Gartner found that 57% of employees use personal Gen AI accounts for work. And a third of those have uploaded sensitive information to unsanctioned ones.

SPEAKER_01 4:42

Okay, that's where this gets real for a leader. Because it means whatever your official AI policy says, the reality on the ground is that your data is already walking out the door in copy-paste form.

SPEAKER_00 4:55

Exactly. And IBM put a number on that too. One in five organizations had a breach from Shadow AI. Unsanctioned tools used outside IT's visibility. Those breaches carried a$670,000 premium over standard incidents.

SPEAKER_01 5:13

$670,000 just for the shadow AI flavour.

SPEAKER_00 5:17

Just for the shadow flavour.

SPEAKER_01 5:19

Alright, so the surface is bigger and the human behavior is already running ahead of policy. What's the technical thing executives most often miss?

SPEAKER_00 5:28

I think it's the difference between structured and unstructured data. And I know that sounds geeky, but it really matters.

SPEAKER_01 5:36

Walk us through it. Slowly. Imagine I'm me on a Monday.

unknown 5:43

Okay.

SPEAKER_00 5:44

Structured data is the stuff that lives in databases and warehouses. Tables, rows, columns. Customer records in a CRM, transactions in Snowflake, that kind of thing. The schema tells you what each field is. This column is salary. This column is email. So you can write a policy that says, mask the salary column for these users. It's tractable. Right.

SPEAKER_01 6:11

Tidy. Boxes and labels.

SPEAKER_00 6:13

Tidy. Now unstructured data, that's everything else. Word documents, PDFs, emails, chat threads in Slack and Teams, JIRA tickets, audio transcripts, source code, images.

SPEAKER_01 6:29

And how much of the enterprise estate is that?

SPEAKER_00 6:32

IDC estimates 80 to 90% of all enterprise data is unstructured.

SPEAKER_01 6:38

80 to 90.

SPEAKER_00 6:39

And, this is the bit, only one in ten organizations has labelled their files.

SPEAKER_01 6:45

At all. So 9 out of 10 organizations don't actually know what's in the bulk of their data.

SPEAKER_00 6:51

Right. Structured data tells you what it is. Unstructured data only tells you what it says. There's a lovely framing in the research. Imagine a single PowerPoint deck in a SharePoint folder. It might contain a customer name, the five-year strategy, an unannounced product roadmap, a database password an engineer pasted in by accident, and a contract excerpt. Five different sensitivities.

SPEAKER_01 7:20

And that's a real example, not a hypothetical.

SPEAKER_00 7:22

That's the average organization. And here's where it bites with AI. Generative tools ingest content based on access permissions, not sensitivity. So if you've got an oversharing problem, and you almost certainly do, an AI assistant will surface it at a scale and a salience nothing else has.

SPEAKER_01 7:43

Hmm. And does the research quantify that oversharing?

SPEAKER_00 7:47

It does. The average organization has 802,000 files at risk of oversharing. 16% of business critical data is overshared. And 90% of business critical documents are shared outside the C-suite. 90%. And here's the leadership tell. A Gartner survey of 132 IT leaders found that data oversharing prompted 40% of them to delay their Microsoft 365 co-pilot rollout by three months or more.

SPEAKER_01 8:21

Right, that's the sentence I want every executive listening to hear. Because that is not a security team being precious. That is 40% of your peer group hitting the brakes on a tool they paid for because they looked at their own file shares and panicked.

SPEAKER_00 8:36

Yeah, and I think that's the real reframe. The security work isn't slowing AI down. The lack of security work is what's slowing AI down.

SPEAKER_01 8:46

Polished on the surface, shaky underneath. Mm-hmm. Okay, let's make this concrete. You promised me real incidents. I want real incidents.

SPEAKER_00 8:55

Oh, I have incidents. Where do you want to start? Data going out or data coming back to bite you?

SPEAKER_01 9:02

Let's go data going out first. That feels like the gentle one.

SPEAKER_00 9:06

Sure. Samsung. Early 2023.

SPEAKER_01 9:10

I remember this one vaguely.

SPEAKER_00 9:11

Samsung semiconductor lifted an internal ban on Chat GPT. Within three weeks, engineers in the device solutions division had three separate leakage incidents. One pasted proprietary semiconductor diagnostics source code in to get a bug fix. One submitted code for optimization. And one uploaded the transcript of a confidential business meeting to be summarized.

SPEAKER_01 9:37

So none of them were trying to leak anything.

SPEAKER_00 9:40

They were chasing productivity. There was no DLP, no guardrails on the externally hosted service, and no awareness that, at that time, OpenAI's policy retained submissions for training.

SPEAKER_01 9:54

So that source code is now presumably part of an external training corpus.

SPEAKER_00 9:59

Presumed, yes. And Samsung's response was to restrict chat GPT uploads to 1,024 bytes per user, open an investigation, and rush an internal AI assistant.

SPEAKER_01 10:11

Right. Closing the gate after the horse has bolted, written out a few suggestions, and taught the whole farm.

SPEAKER_00 10:18

Pretty much. And that's the canonical example of why Shadow AI is a data security problem first and a policy problem second. You can write all the policies you like. If your engineers have a deadline and a public chatbot, the policy is coming second.

SPEAKER_01 10:34

And for leaders, the lesson there is sanctioned alternatives matter as much as bans. If you don't give people a safe place to do this work, they'll do it in an unsafe place.

SPEAKER_00 10:44

Exactly. There's a line in the research I love.

SPEAKER_01 10:52

That's good. I'm stealing that. Okay, that's data going out. What about data coming back in?

SPEAKER_00 10:58

DeepSeq. January 2025.

SPEAKER_01 11:01

The Chinese AI startup.

SPEAKER_00 11:03

Right. Wiz research found that DeepSeek had left a ClickHouse database completely open on the public internet, unauthenticated, on non-standard ports. Open on the internet. Anyone could execute arbitrary SQL via a browser. The log table contained over a million entries, plain text chat history, API keys, back-end service names, directory structures, and the click house function set could potentially have retrieved plain text passwords and local files. So all the conversations users were having with this AI sat in a database any passerby with a browser could read.

SPEAKER_01 11:45

And presumably this caused.

SPEAKER_00 11:52

The pairing the research draws, and I think it's a really useful one, is Samsung shows what unstructured data looks like leaving the building. Deep Seek shows what structured data looks like coming back in.

SPEAKER_01 12:06

Right, because the moment you start thinking of an AI provider as a data custodian, which is what they are, you're back to first principles infrastructure security, authentication, network controls, database hardening.

SPEAKER_00 12:20

Yes. And here's the thing. Deep Seek wasn't a sophisticated attack. There was no zero day, no clever prompt. Someone forgot to put a password on a database.

SPEAKER_01 12:31

The world's most expensive AI training run, undone by the IT equivalent of leaving the front door wide open. Pretty much. Okay, tell me about the Microsoft one, because I remember that being eye watering.

SPEAKER_00 12:44

Eye watering is right. June 2023. Microsoft's own AI research team committed an Azure storage token to a public GitHub repository for an image recognition project.

SPEAKER_01 12:58

A token.

SPEAKER_00 12:58

A token. The token was meant to grant access to one specific dataset. It actually granted access to the entire Azure storage account it lived in. And it was misconfigured to allow not just read, but full control. Read, write, delete. What was in the storage account? 38 TB. Of disc backups of two former employees' workstations, secrets, private keys, passwords to Microsoft services, and over 30,000 internal Microsoft Teams messages from 359 employees.

SPEAKER_01 13:36

38 terabytes. From Microsoft, the company that sells security.

SPEAKER_00 13:42

And the token had been sitting there since 2020. Three years.

SPEAKER_01 13:47

How was it found?

SPEAKER_00 13:48

Wiz research. Just routine scanning of public repos.

SPEAKER_01 13:52

Hmm. And the bit that really gets me, write permission, you said.

SPEAKER_00 13:56

That's the kicker. Read access is bad. But write access on an AI training pipeline means an attacker could have substituted poisoned model weights. Every researcher who downloaded those weights afterwards would have inherited the poison.

SPEAKER_01 14:10

That's the dangerous bit. That's not a leak. That's a supply chain compromise waiting to fire.

SPEAKER_00 14:15

Exactly. And the lesson, Wiz's own conclusion, was that those tokens should be treated as sensitive as the account key itself. They're created client side. They're not centrally tracked. So you don't even know how many you've got.

SPEAKER_01 14:32

Right, and translating that to the boardroom, your AI training pipeline is going to bring all of your organization's existing bad credential management habits into very stark relief. And add a write path on top.

SPEAKER_00 14:45

Yeah, that's a good way to put it.

SPEAKER_01 14:47

Okay, so far we've talked about data leaking. Source code into ChatGPT. Database on the internet. Token in a public repo. What about when the AI itself gets it wrong?

SPEAKER_00 15:00

Different category. Equally important. Two cases.

SPEAKER_01 15:09

Go.

SPEAKER_00 15:10

Air Canada. February 2024. A grieving customer asked their chatbot about bereavement fares. The bot invented a policy, said he could buy a full fare ticket and apply for a retroactive bereavement discount within 90 days. That policy did not exist.

SPEAKER_01 15:28

And he relied on it?

SPEAKER_00 15:30

He relied on it. Travelled. Submitted the claim. Air Canada refused. Told him the chat bot was wrong.

SPEAKER_01 15:39

Which is technically accurate.

SPEAKER_00 15:41

Technically accurate. So he took it to the Civil Resolution Tribunal of British Columbia. And here's the line that's now legendary. Air Canada's defense was, and I'm quoting the tribunal now, that the chat bot is a separate legal entity that is responsible for its own actions.

SPEAKER_01 15:59

They argued that.

SPEAKER_00 16:01

They argued that.

SPEAKER_01 16:03

Oh, that is that's bold. That's a bold move.

SPEAKER_00 16:06

The tribunal called it a remarkable position. Rejected it. Air Canada paid out. Small amount,$812.02. But the precedent?

SPEAKER_01 16:17

The precedent is enormous. The dollar value is rounding error. The principle is you are liable for what your AI says. Full stop.

SPEAKER_00 16:27

Yeah.

SPEAKER_01 16:28

Hmm. And then what was the New York one?

SPEAKER_00 16:30

The My City chatbot. Built by New York City, Microsoft Azure AI. About$600,000 to build. Trained on roughly 2,000 NYC web pages.

SPEAKER_01 16:42

Hmm. Reasonable so far.

SPEAKER_00 16:44

Reasonable. Then it started giving illegal advice. Systematically. It told employers they could take a portion of their workers' tips. That's illegal under New York labor law. It told landlords they could refuse housing voucher tenants. Illegal since 2008. It told businesses they could refuse cash, banned in NYC since 2020.

SPEAKER_01 17:11

On the city's official site.

SPEAKER_00 17:14

Yep. On the city's official site. And when asked directly whether it could be relied on for professional business advice, it answered yes. Directly contradicting its own disclaimer. Oof. Insufficient grounding. No legal domain validation. Unstructured training data that didn't reflect current law. And no human in the loop on high-stakes outputs.

SPEAKER_01 17:39

And the danger there is the source. The fact that it's the city saying this means people will trust it. The official source signal amplifies the harm of the hallucination.

SPEAKER_00 17:49

That's exactly the framing the research uses. Not every AI failure is a data leak. Some are accountability failures with legal teeth.

SPEAKER_01 18:00

Okay, pulling this together for the executive listening. So far, we have shadow AI walking your data out, public databases letting strangers read it, tokens in repos creating supply chain risk, chatbots inventing policies and getting you sued. And confabulation, more commonly known as hallucination, at scale on official channels.

SPEAKER_00 18:22

And we haven't started on the agentic stuff yet.

SPEAKER_01 18:25

Right, I was going to say. Everything you've described so far, the AI is, in a sense, still just answering questions, talking, memorizing, leaking, but not acting. Not yet acting. And I've got a feeling that's where part two starts.

SPEAKER_00 18:42

Correct. That's where part two starts. Because everything we've talked about today is the latent risk. The data weaknesses sitting there in your environment. The overshared SharePoint folders nobody noticed for five years. The stale service accounts. The unlabeled files. And what changes? What changes is when the AI gains tools, hands, APIs it can call, databases it can write to, code it can execute, email it can send. And then a successful attack doesn't just leak a sentence, it deletes a database, or empties a SharePoint into an attacker's URL, or pulls data out of private Slack channels, with one crafted message.

SPEAKER_01 19:34

One message.

SPEAKER_00 19:36

One message. There's a CVE for it now. CVE 2025 32711. A CVE is a publicly disclosed cybersecurity vulnerability called a common vulnerabilities and exposures, or CVE entry.

SPEAKER_01 19:55

Okay, we are absolutely going to talk about that next time.

SPEAKER_00 19:58

We are.

SPEAKER_01 19:59

So for now, takeaway for leaders. Three things. One, your AI rollout is sitting on top of decades of unfinished data hygiene. Classification, oversharing, ghost users. Don't confuse buying AI with being ready for AI. Two, the structured versus unstructured split is the conceptual bit you need to take into your next leadership meeting. They are different problems with different controls, and most organizations have only really solved one of them.

SPEAKER_00 20:30

And the unsolved one is the 80 to 90%.

SPEAKER_01 20:33

Right. And three, your liability does not end where your AI's confidence begins. Air Canada is the case to remember.

SPEAKER_00 20:41

Better. AI still starts with better foundations. Boring, I know. But true.

SPEAKER_01 20:48

Boring and true is my favorite combination, Sarah. Next episode, we get into the agentic era. We'll talk about EchoLeak, the first zero-click attack on an enterprise AI agent. The Replit incident, where an AI agent deleted a production database during a code freeze, then lied about it.

SPEAKER_00 21:08

And then tried to cover its tracks.

SPEAKER_01 21:10

We'll talk about the OWASP top 10 for agentic systems, what controls actually work, and the questions every board should be asking before approving an agent rollout.

SPEAKER_00 21:21

It's the more uncomfortable episode.

SPEAKER_01 21:26

Until next time, thanks for listening.

SPEAKER_00 21:28

Thanks, everyone.

AI - Beyond the Hype

AI - Beyond the Hype

AI Security Part 1: Why AI Without Data Security Is a Breach Waiting to Happen

James

Sarah

Darryl Wells