AI - Beyond the Hype
AI - Beyond the Hype is a podcast for senior executives, technology leaders, and data professionals who want a clear-eyed view of what it really takes to make AI work in the enterprise.
Each short episode is designed for easy consumption by busy leaders and executives, offering concise, practical conversations on the foundations behind successful AI adoption — from data quality and observability to governance, operating models, architecture, and trust. Through thoughtful, conversational dialogue, the show connects executive priorities with the technical realities that determine whether AI delivers meaningful value or simply creates more noise.
If your organisation is asking big questions about AI readiness, digital transformation, and data-driven decision-making, this podcast is designed to help you quickly separate what sounds impressive from what actually works.
AI - Beyond the Hype
AI Security Part 1: Why AI Without Data Security Is a Breach Waiting to Happen
Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.
Sarah and James open the three-part Data Security for AI series with a simple argument: AI is only as trustworthy as the data underneath it.
What we cover
The adoption gap: Gartner expects 40% of enterprise apps to embed AI agents by end‑2026 (up from <5%). IBM’s 2025 Cost of a Data Breach Report found 13% of organisations have had an AI-related breach — 97% lacked proper access controls.
Structured vs unstructured data: IDC estimates 80–90% of enterprise data is unstructured. Varonis found only 1 in 10 organisations have labelled files, and 88% still have “ghost” accounts. Point a copilot at that estate and every overshared file is exposed.
The incident catalogue: Samsung engineers pasting source code into ChatGPT (2023). Microsoft’s AI team exposing 38 TB — via a misconfigured Azure SAS token. DeepSeek’s ClickHouse leak exposing chat histories and API keys (2025).
Liability is real: Moffatt v. Air Canada (2024), where the airline argued its chatbot was a separate legal entity — and lost. NYC’s MyCity chatbot.
Shadow AI: IBM found shadow-AI breaches cost US$670K more and make up 20% of incidents.
Memorisation: Carlini et al. (ICLR 2023) showed models memorise training data based on size, duplication, and prompt context — sensitive data should be treated as eventually leakable.
Sources
Gartner 40% forecast: https://finance.yahoo.com/news/40-enterprise-apps-embed-ai-181310288.html
IBM 2025 Cost of a Data Breach: https://www.ibm.com/reports/data-breach
IBM analysis (97%, US$670K): https://www.kiteworks.com/cybersecurity-risk-management/ibm-2025-data-breach-report-ai-risks/
IDC unstructured data: https://blog.box.com/90-percent-unstructured-data
Varonis 2025 State of Data Security: https://www.varonis.com/blog/state-of-data-security-report
Samsung ChatGPT leak: https://www.pcmag.com/news/samsung-software-engineers-busted-for-pasting-proprietary-code-into-chatgpt
Microsoft 38 TB exposure: https://www.wiz.io/blog/38-terabytes-of-private-data-accidentally-exposed-by-microsoft-ai-researchers
DeepSeek ClickHouse exposure: https://www.wiz.io/blog/wiz-research-uncovers-exposed-deepseek-database-leak
Moffatt v. Air Canada (Forbes): https://www.forbes.com/sites/marisagarcia/2024/02/19/what-air-canada-lost-in-remarkable-lying-ai-chatbot-case/
NYC MyCity (The Markup): https://themarkup.org/artificial-intelligence/2024/04/02/malfunctioning-nyc-ai-chatbot-still-active-despite-widespread-evidence-its-encouraging-illegal-behavior
Cisco 2024 Privacy Benchmark: https://www.cisco.com/c/dam/en_us/about/doing_business/trust-center/docs/cisco-privacy-benchmark-study-2024.pdf
Carlini et al., ICLR 2023: https://arxiv.org/abs/2202.07646
Welcome back to AI Beyond the Hype. I'm Jane.
SPEAKER_00And I'm Sarah. Good to be back.
SPEAKER_01Sarah.
SPEAKER_00Hmm?
SPEAKER_01Before we hit record, you handed me a stack of research and said, and I quote, you're going to want to read this before we start.
SPEAKER_00I did say that.
SPEAKER_01Which is never an entirely reassuring sentence from a data engineer.
SPEAKER_00No, it usually means I'm about to ruin someone's quarter.
SPEAKER_01Right. So today, and actually next two episodes as well, because there's too much for one show. We're talking about data security for AI. The plumbing, the access controls, the boring bits underneath the shiny copilot demo.
SPEAKER_00The substrate. That's the word the report I read kept using. Data security is the substrate AI is built on.
SPEAKER_01Okay. I'll be honest with the audience. When you sent me this, my first reaction was a little bit of an eye roll. Go on. Not because security doesn't matter, I'm a CIO type. Of course it matters. But every time we get into one of these conversations, there's a tendency for technical folks to go full doom and gloom. Everything's on fire! The agents are coming for your database. Lock the doors. And meanwhile, the business is trying to actually ship something useful.
SPEAKER_00Yeah.
SPEAKER_01So part of me wonders, is this a bit alarmist? Are we over-engineering the solution to a problem that in practice mostly works out?
SPEAKER_00That's a fair pushback. And I want to take it seriously, because I don't think the answer is panic. I really don't. But I do think the answer is we've got the order of operations wrong in a lot of organizations.
SPEAKER_01Order of operations. Explain.
SPEAKER_00Most enterprises right now are buying AI faster than they're securing the data underneath it. Gartner's predicting that by the end of this year, 40% of enterprise applications will have an AI agent embedded in them, up from less than 5% earlier in the year.
SPEAKER_01That's a huge jump.
SPEAKER_00Massive. And in the same window, only about 6% of organizations report having an advanced AI security strategy.
SPEAKER_016%.
SPEAKER_006. So we're racing past our own maturity curve. And IBM's cost of a data breach for 2025 found that 13% of organizations have already had a breach involving an AI model or application. Of those, 97% lacked proper AI access controls.
SPEAKER_01Okay. That 97% figure is the one that should make leaders pause. That's not could happen, that's did happen, and the controls weren't there.
SPEAKER_00Right. So this isn't the technical community being dramatic. This is the data telling us the gap is real.
SPEAKER_01Alright, I'll put my eye roll in the bin for now.
SPEAKER_00Appreciated.
SPEAKER_01So let's set the table for the executive listening to this. What's actually different about securing AI compared to, say, securing a normal application?
SPEAKER_00Okay, this is the bit I get weirdly excited about. A traditional application is predictable. It accesses a defined set of tables behind a defined API. You can draw the data flow on a whiteboard. Done.
SPEAKER_01Right, the architects love a whiteboard.
SPEAKER_00We do. But an enterprise AI assistant sits on top of nearly the entire data estate. A single copilot deployment might touch a CRM, a ticketing system, Confluence, SharePoint, Slack, source code, email transcripts, all before producing a single answer.
SPEAKER_01So the surface area is just bigger.
SPEAKER_00Massively bigger. And here's the part that really matters. Cisco's data privacy benchmark. 92% of organizations agree that generative AI is, quote, a fundamentally different technology with novel challenges.
SPEAKER_01Okay, but 92% of organizations agree on lots of things in surveys. What's the practical version of that?
SPEAKER_00Practical version. Same study found 48% of users admitted entering non-public company information into Gen AI tools. And 62% entered internal process information.
SPEAKER_01Into public tools.
SPEAKER_00Into public tools. And then Gartner found that 57% of employees use personal Gen AI accounts for work. And a third of those have uploaded sensitive information to unsanctioned ones.
SPEAKER_01Okay, that's where this gets real for a leader. Because it means whatever your official AI policy says, the reality on the ground is that your data is already walking out the door in copy-paste form.
SPEAKER_00Exactly. And IBM put a number on that too. One in five organizations had a breach from Shadow AI. Unsanctioned tools used outside IT's visibility. Those breaches carried a$670,000 premium over standard incidents.
SPEAKER_01$670,000 just for the shadow AI flavour.
SPEAKER_00Just for the shadow flavour.
SPEAKER_01Alright, so the surface is bigger and the human behavior is already running ahead of policy. What's the technical thing executives most often miss?
SPEAKER_00I think it's the difference between structured and unstructured data. And I know that sounds geeky, but it really matters.
SPEAKER_01Walk us through it. Slowly. Imagine I'm me on a Monday.
unknownOkay.
SPEAKER_00Structured data is the stuff that lives in databases and warehouses. Tables, rows, columns. Customer records in a CRM, transactions in Snowflake, that kind of thing. The schema tells you what each field is. This column is salary. This column is email. So you can write a policy that says, mask the salary column for these users. It's tractable. Right.
SPEAKER_01Tidy. Boxes and labels.
SPEAKER_00Tidy. Now unstructured data, that's everything else. Word documents, PDFs, emails, chat threads in Slack and Teams, JIRA tickets, audio transcripts, source code, images.
SPEAKER_01And how much of the enterprise estate is that?
SPEAKER_00IDC estimates 80 to 90% of all enterprise data is unstructured.
SPEAKER_0180 to 90.
SPEAKER_00And, this is the bit, only one in ten organizations has labelled their files.
SPEAKER_01At all. So 9 out of 10 organizations don't actually know what's in the bulk of their data.
SPEAKER_00Right. Structured data tells you what it is. Unstructured data only tells you what it says. There's a lovely framing in the research. Imagine a single PowerPoint deck in a SharePoint folder. It might contain a customer name, the five-year strategy, an unannounced product roadmap, a database password an engineer pasted in by accident, and a contract excerpt. Five different sensitivities.
SPEAKER_01And that's a real example, not a hypothetical.
SPEAKER_00That's the average organization. And here's where it bites with AI. Generative tools ingest content based on access permissions, not sensitivity. So if you've got an oversharing problem, and you almost certainly do, an AI assistant will surface it at a scale and a salience nothing else has.
SPEAKER_01Hmm. And does the research quantify that oversharing?
SPEAKER_00It does. The average organization has 802,000 files at risk of oversharing. 16% of business critical data is overshared. And 90% of business critical documents are shared outside the C-suite. 90%. And here's the leadership tell. A Gartner survey of 132 IT leaders found that data oversharing prompted 40% of them to delay their Microsoft 365 co-pilot rollout by three months or more.
SPEAKER_01Right, that's the sentence I want every executive listening to hear. Because that is not a security team being precious. That is 40% of your peer group hitting the brakes on a tool they paid for because they looked at their own file shares and panicked.
SPEAKER_00Yeah, and I think that's the real reframe. The security work isn't slowing AI down. The lack of security work is what's slowing AI down.
SPEAKER_01Polished on the surface, shaky underneath. Mm-hmm. Okay, let's make this concrete. You promised me real incidents. I want real incidents.
SPEAKER_00Oh, I have incidents. Where do you want to start? Data going out or data coming back to bite you?
SPEAKER_01Let's go data going out first. That feels like the gentle one.
SPEAKER_00Sure. Samsung. Early 2023.
SPEAKER_01I remember this one vaguely.
SPEAKER_00Samsung semiconductor lifted an internal ban on Chat GPT. Within three weeks, engineers in the device solutions division had three separate leakage incidents. One pasted proprietary semiconductor diagnostics source code in to get a bug fix. One submitted code for optimization. And one uploaded the transcript of a confidential business meeting to be summarized.
SPEAKER_01So none of them were trying to leak anything.
SPEAKER_00They were chasing productivity. There was no DLP, no guardrails on the externally hosted service, and no awareness that, at that time, OpenAI's policy retained submissions for training.
SPEAKER_01So that source code is now presumably part of an external training corpus.
SPEAKER_00Presumed, yes. And Samsung's response was to restrict chat GPT uploads to 1,024 bytes per user, open an investigation, and rush an internal AI assistant.
SPEAKER_01Right. Closing the gate after the horse has bolted, written out a few suggestions, and taught the whole farm.
SPEAKER_00Pretty much. And that's the canonical example of why Shadow AI is a data security problem first and a policy problem second. You can write all the policies you like. If your engineers have a deadline and a public chatbot, the policy is coming second.
SPEAKER_01And for leaders, the lesson there is sanctioned alternatives matter as much as bans. If you don't give people a safe place to do this work, they'll do it in an unsafe place.
SPEAKER_00Exactly. There's a line in the research I love.
SPEAKER_01That's good. I'm stealing that. Okay, that's data going out. What about data coming back in?
SPEAKER_00DeepSeq. January 2025.
SPEAKER_01The Chinese AI startup.
SPEAKER_00Right. Wiz research found that DeepSeek had left a ClickHouse database completely open on the public internet, unauthenticated, on non-standard ports. Open on the internet. Anyone could execute arbitrary SQL via a browser. The log table contained over a million entries, plain text chat history, API keys, back-end service names, directory structures, and the click house function set could potentially have retrieved plain text passwords and local files. So all the conversations users were having with this AI sat in a database any passerby with a browser could read.
SPEAKER_01And presumably this caused.
SPEAKER_00The pairing the research draws, and I think it's a really useful one, is Samsung shows what unstructured data looks like leaving the building. Deep Seek shows what structured data looks like coming back in.
SPEAKER_01Right, because the moment you start thinking of an AI provider as a data custodian, which is what they are, you're back to first principles infrastructure security, authentication, network controls, database hardening.
SPEAKER_00Yes. And here's the thing. Deep Seek wasn't a sophisticated attack. There was no zero day, no clever prompt. Someone forgot to put a password on a database.
SPEAKER_01The world's most expensive AI training run, undone by the IT equivalent of leaving the front door wide open. Pretty much. Okay, tell me about the Microsoft one, because I remember that being eye watering.
SPEAKER_00Eye watering is right. June 2023. Microsoft's own AI research team committed an Azure storage token to a public GitHub repository for an image recognition project.
SPEAKER_01A token.
SPEAKER_00A token. The token was meant to grant access to one specific dataset. It actually granted access to the entire Azure storage account it lived in. And it was misconfigured to allow not just read, but full control. Read, write, delete. What was in the storage account? 38 TB. Of disc backups of two former employees' workstations, secrets, private keys, passwords to Microsoft services, and over 30,000 internal Microsoft Teams messages from 359 employees.
SPEAKER_0138 terabytes. From Microsoft, the company that sells security.
SPEAKER_00And the token had been sitting there since 2020. Three years.
SPEAKER_01How was it found?
SPEAKER_00Wiz research. Just routine scanning of public repos.
SPEAKER_01Hmm. And the bit that really gets me, write permission, you said.
SPEAKER_00That's the kicker. Read access is bad. But write access on an AI training pipeline means an attacker could have substituted poisoned model weights. Every researcher who downloaded those weights afterwards would have inherited the poison.
SPEAKER_01That's the dangerous bit. That's not a leak. That's a supply chain compromise waiting to fire.
SPEAKER_00Exactly. And the lesson, Wiz's own conclusion, was that those tokens should be treated as sensitive as the account key itself. They're created client side. They're not centrally tracked. So you don't even know how many you've got.
SPEAKER_01Right, and translating that to the boardroom, your AI training pipeline is going to bring all of your organization's existing bad credential management habits into very stark relief. And add a write path on top.
SPEAKER_00Yeah, that's a good way to put it.
SPEAKER_01Okay, so far we've talked about data leaking. Source code into ChatGPT. Database on the internet. Token in a public repo. What about when the AI itself gets it wrong?
SPEAKER_00Different category. Equally important. Two cases.
SPEAKER_01Go.
SPEAKER_00Air Canada. February 2024. A grieving customer asked their chatbot about bereavement fares. The bot invented a policy, said he could buy a full fare ticket and apply for a retroactive bereavement discount within 90 days. That policy did not exist.
SPEAKER_01And he relied on it?
SPEAKER_00He relied on it. Travelled. Submitted the claim. Air Canada refused. Told him the chat bot was wrong.
SPEAKER_01Which is technically accurate.
SPEAKER_00Technically accurate. So he took it to the Civil Resolution Tribunal of British Columbia. And here's the line that's now legendary. Air Canada's defense was, and I'm quoting the tribunal now, that the chat bot is a separate legal entity that is responsible for its own actions.
SPEAKER_01They argued that.
SPEAKER_00They argued that.
SPEAKER_01Oh, that is that's bold. That's a bold move.
SPEAKER_00The tribunal called it a remarkable position. Rejected it. Air Canada paid out. Small amount,$812.02. But the precedent?
SPEAKER_01The precedent is enormous. The dollar value is rounding error. The principle is you are liable for what your AI says. Full stop.
SPEAKER_00Yeah.
SPEAKER_01Hmm. And then what was the New York one?
SPEAKER_00The My City chatbot. Built by New York City, Microsoft Azure AI. About$600,000 to build. Trained on roughly 2,000 NYC web pages.
SPEAKER_01Hmm. Reasonable so far.
SPEAKER_00Reasonable. Then it started giving illegal advice. Systematically. It told employers they could take a portion of their workers' tips. That's illegal under New York labor law. It told landlords they could refuse housing voucher tenants. Illegal since 2008. It told businesses they could refuse cash, banned in NYC since 2020.
SPEAKER_01On the city's official site.
SPEAKER_00Yep. On the city's official site. And when asked directly whether it could be relied on for professional business advice, it answered yes. Directly contradicting its own disclaimer. Oof. Insufficient grounding. No legal domain validation. Unstructured training data that didn't reflect current law. And no human in the loop on high-stakes outputs.
SPEAKER_01And the danger there is the source. The fact that it's the city saying this means people will trust it. The official source signal amplifies the harm of the hallucination.
SPEAKER_00That's exactly the framing the research uses. Not every AI failure is a data leak. Some are accountability failures with legal teeth.
SPEAKER_01Okay, pulling this together for the executive listening. So far, we have shadow AI walking your data out, public databases letting strangers read it, tokens in repos creating supply chain risk, chatbots inventing policies and getting you sued. And confabulation, more commonly known as hallucination, at scale on official channels.
SPEAKER_00And we haven't started on the agentic stuff yet.
SPEAKER_01Right, I was going to say. Everything you've described so far, the AI is, in a sense, still just answering questions, talking, memorizing, leaking, but not acting. Not yet acting. And I've got a feeling that's where part two starts.
SPEAKER_00Correct. That's where part two starts. Because everything we've talked about today is the latent risk. The data weaknesses sitting there in your environment. The overshared SharePoint folders nobody noticed for five years. The stale service accounts. The unlabeled files. And what changes? What changes is when the AI gains tools, hands, APIs it can call, databases it can write to, code it can execute, email it can send. And then a successful attack doesn't just leak a sentence, it deletes a database, or empties a SharePoint into an attacker's URL, or pulls data out of private Slack channels, with one crafted message.
SPEAKER_01One message.
SPEAKER_00One message. There's a CVE for it now. CVE 2025 32711. A CVE is a publicly disclosed cybersecurity vulnerability called a common vulnerabilities and exposures, or CVE entry.
SPEAKER_01Okay, we are absolutely going to talk about that next time.
SPEAKER_00We are.
SPEAKER_01So for now, takeaway for leaders. Three things. One, your AI rollout is sitting on top of decades of unfinished data hygiene. Classification, oversharing, ghost users. Don't confuse buying AI with being ready for AI. Two, the structured versus unstructured split is the conceptual bit you need to take into your next leadership meeting. They are different problems with different controls, and most organizations have only really solved one of them.
SPEAKER_00And the unsolved one is the 80 to 90%.
SPEAKER_01Right. And three, your liability does not end where your AI's confidence begins. Air Canada is the case to remember.
SPEAKER_00Better. AI still starts with better foundations. Boring, I know. But true.
SPEAKER_01Boring and true is my favorite combination, Sarah. Next episode, we get into the agentic era. We'll talk about EchoLeak, the first zero-click attack on an enterprise AI agent. The Replit incident, where an AI agent deleted a production database during a code freeze, then lied about it.
SPEAKER_00And then tried to cover its tracks.
SPEAKER_01We'll talk about the OWASP top 10 for agentic systems, what controls actually work, and the questions every board should be asking before approving an agent rollout.
SPEAKER_00It's the more uncomfortable episode.
SPEAKER_01Until next time, thanks for listening.
SPEAKER_00Thanks, everyone.