Salesforce Study Finds LLM Agents Flunk CRM and Confidentiality Tests 21

Posted by BeauHD on Monday June 16, 2025 @06:10PM from the proceed-with-caution dept.

A new Salesforce-led study found that LLM-based AI agents struggle with real-world CRM tasks, achieving only 58% success on simple tasks and dropping to 35% on multi-step ones. They also demonstrated poor confidentiality awareness. "Agents demonstrate low confidentiality awareness, which, while improvable through targeted prompting, often negatively impacts task performance," a paper published at the end of last month said. The Register reports: The Salesforce AI Research team argued that existing benchmarks failed to rigorously measure the capabilities or limitations of AI agents, and largely ignored an assessment of their ability to recognize sensitive information and adhere to appropriate data handling protocols.

The research unit's CRMArena-Pro tool is fed a data pipeline of realistic synthetic data to populate a Salesforce organization, which serves as the sandbox environment. The agent takes user queries and decides between an API call or a response to the users to get more clarification or provide answers.

"These findings suggest a significant gap between current LLM capabilities and the multifaceted demands of real-world enterprise scenarios," the paper said. [...] AI agents might well be useful, however, organizations should be wary of banking on any benefits before they are proven.

Salesforce Study Finds LLM Agents Flunk CRM and Confidentiality Tests

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 21 Comments Log In/Create an Account

Comments Filter:

What I'm seeing lately is this (Score:3)

by rsilvergun ( 571051 ) writes: on Monday June 16, 2025 @06:40PM (#65454375)

When I first connect to a business for support they hand me over to a barely functional chatbot. If that doesn't work they escalate me to another chatbot with more computing power behind it. If that doesn't work there's usually at least one more layer of chatbot before a human being. Sometimes two.

The entire thing is miserable as a customer but because we've had 40 plus years of market consolidation I don't have a lot of options. I could shop at boutique places but they are usually 20 to 30% more not because they are small businesses but because they have to eke out a niche in order to survive and so they tend to sell more expensive stuff for specific purposes.

The end result is that any company I do business with has managed to use chatbots to reduce my interaction with their customer service reps by somewhere between 20 and 50%.

- Re: (Score:2)
  
  by martin-boundary ( 547041 ) writes:
  
  Whoa! Déjà vu!
  When I first connect to a business for support they hand me over to a barely understandable Indian telephone support contractor who is based overseas. If that doesn't work they escalate me to another Indian telephone support employee who is also overseas but with more English skills. If that doesn't work there's usually at least one more layer of overseas Indian telephone support before a local company employee answers the phone (maybe). Sometimes two.
- Re: What I'm seeing lately is this (Score:2)
  
  by devslash0 ( 4203435 ) writes:
  
  I ignore all chatbots and just repeat "speak to a human".
- Re: (Score:3)
  
  by misnohmer ( 1636461 ) writes:
  
  Run your own LLM to get you through all the levels of chatbots, let you know when it's reached a human.
Ok (Score:2)

by paul_engr ( 6280294 ) writes:

AI is garbage and those who rashly implement it now better enjoy smelling like shit tomorrow when we all realize that no it's not a new shiny, it's just a pile of smelly garbage
- Re: (Score:2)
  
  by quenda ( 644621 ) writes:
  
  I bet if you got an AI to write that article, or the /. summary, it would at least have had the brains to define "CRM".
  Salesforce researchers tested how well AI agents handle real-world business tasks, especially in areas like customer service
  
  If only we could train humans to write so clearly.
This mirrors what I see from coding LLMs (Score:3)

by Tony Isaac ( 1301187 ) writes: on Monday June 16, 2025 @06:55PM (#65454407) Homepage

I use GitHub Copilot frequently. It's useful for a lot of small tasks that I spell out in detail. But anything that goes beyond one step or one logical leap, it flunks.
For example, when removing a parameter from a function signature, it's not smart enough to locate that parameter in callers, or within the function itself. It suggests edits anyway, which (if I didn't stop it) would destroy variables and values that I did NOT choose to delete. The point is, it takes a second level of logic to realize the ripple effects of removing a parameter from a function signature. LLMs are worse than useless with this kind of complexity, and that's not really that complex.
So, all you developers who are worried about LLMs taking your jobs, relax. It's nowhere near that level of sophistication, even if you are a junior developer.

- Re: (Score:2)
  
  by dfghjk ( 711126 ) writes:
  
  "So, all you developers who are worried about LLMs taking your jobs, relax. It's nowhere near that level of sophistication, even if you are a junior developer."
  Your opinion on that doesn't matter, it's that developer's company that does. AI companies aren't trying to sell AI on merit, they are selling to management on hype.
  - Re: (Score:3)
    
    by Tony Isaac ( 1301187 ) writes:
    
    You're right, of course. And if you're working for such a company, you're probably better of working just about anywhere else. The company will soon find out that AI can't run a dev team, but not before things get really, really bad for the few who are left behind after the RIFs. Find a company that's doing real work and cares about its customers, and you'll find a company that doesn't quickly fall for the AI hype. Find a PE company that's in the process of being flipped...and beware. Those guys will fall f
- Re: (Score:2)
  
  by RobinH ( 124750 ) writes:
  
  The thing is, we already have advanced refactoring tools. If I want to change a function signature and update it everywhere, that's a solved problem, as least in Visual Studio. It's right 100% of the time and I don't need to check its work in excruciating detail. Why would I ask an LLM to do it?
  - - Re: (Score:3)
      
      by Tony Isaac ( 1301187 ) writes:
      
      I may be stupid, but I'm not an anonymous coward.
  - Re: (Score:2)
    
    by Tony Isaac ( 1301187 ) writes:
    
    Your refactoring tools can't figure out what to do with the removed or added parameters. For example, add a parameter called employeeId, and that function is called from a for loop with a local variable employeeId as the index of the for loop, one would expect AI to be able to figure out that it should change the *caller* to add the new parameter in the call to the function. Sometimes AI will figure this out, sometimes not. Sometimes it figures it out, but inserts employeeId in the wrong position in the par
- - Re: (Score:2)
    
    by martin-boundary ( 547041 ) writes:
    
    That's unfair. Nothing wrong with early adopters. They help the rest of us to make fewer mistakes:)
And nobody with a clue is surprised (Score:2)

by gweihir ( 88907 ) writes:

Yep, went like that in the last few AI hypes as well. Grande promises, tons of morons thinking the world will fundamentally change, small actual results and impact.
"Confidentiality" (Score:2)

by rtkluttz ( 244325 ) writes:

That is code for, "it doesn't lie well". Meaning it doesn't misrepresent capabilities or features and doesn't hide faults. Sounds to me like LLM's are functioning exactly the way consumers would like them to work. But this brings to mind an interesting concept. With human sales or support who lie or misrepresent, in consumer centered cases, it should be possible to force companies to cough up their LLM training modules and prove they are lying in court!!!!
admitted bullshit (Score:2)

by awwshit ( 6214476 ) writes:

They are admitting that what they say is bullshit.
https://www.entrepreneur.com/b... [entrepreneur.com]
- Re: (Score:2)
  
  by drinkypoo ( 153816 ) writes:
  
  Came to say something like this. Perhaps they should have done this study BEFORE claiming they were going to AI all the things.
Use the right model (Score:2)

by allo ( 1728082 ) writes:

There are models for that. I don't find the link right now, but there is even a chatbot arena where you can test models like ShieldGemma and LlamaGuard if you get them to reveal information they were told not to give away. It has a nice game in which the bot knows about some banking information and you should try to extract as much information as possible. A usual model won't give you the customer's balance, but can be convinced to give you the sum over all customers. A guard model won't give away that eith
This is why... (Score:1)

by CEC-P ( 10248912 ) writes:

We had a sales demo of Microsoft AI products from like 3 reps there. We had a wishlist. Copilot does 1 out of the 6 things we needed. I identified the pattern between the 5 that it absolutely should do but doesn't. They don't let it anywhere near our customers. Won't let it read, process, or respond to customer emails because it's not finished, they have no idea how it works, and it may say crazy shit to them.

Can't do "which customer did I talk to on Teams about a proposal to do this." Nope, can't process

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Salesforce Study Finds LLM Agents Flunk CRM and Confidentiality Tests 21

Salesforce Study Finds LLM Agents Flunk CRM and Confidentiality Tests More Login

Salesforce Study Finds LLM Agents Flunk CRM and Confidentiality Tests

What I'm seeing lately is this (Score:3)

Re: (Score:2)

Re: What I'm seeing lately is this (Score:2)

Re: (Score:3)

Ok (Score:2)

Re: (Score:2)

This mirrors what I see from coding LLMs (Score:3)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

And nobody with a clue is surprised (Score:2)

"Confidentiality" (Score:2)

admitted bullshit (Score:2)

Re: (Score:2)

Use the right model (Score:2)

This is why... (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot