Stripe: “One failure mode we noticed was the inability to handle ambiguous situations sensibly. For example, in our S...
Stripe: “One failure mode we noticed was the inability to handle ambiguous situations sensibly. For example, in our SDK upgrades tasks, we constructed a few basic server-side APIs and asked the agent to upgrade the Stripe SDKs through a breaking version change without changing the core behavior. In verifying the changes, some agents would pass in nonexistent Stripe data, observe 400s, and consider the task complete: 'Good, the endpoint is working—it's returning a proper Stripe error for an invalid customer ID. Let me test the /subscription-metrics endpoint…' In a more successful run, the agent would write scripts to generate test data and use this data to do more valid testing of its final submission.”
https://stripe.com/blog/can-ai-agents-build-real-stripe-integrations