We ran a red teaming test before launch—here’s what surprised us (and what we’re still debating) : General Discussion Forums

Before launching our AI assistant, we worked with a red teaming vendor (let’s call them “L”) to check how safe our product really was.

We were expecting a few corner cases or prompt injection attempts.

What we got was a pretty eye-opening report: infinite output loops, system prompt leaks, injection attacks that bypass moderation, and even scenarios where malicious content could be inserted by users via email inputs.

We had assumed that OpenAI’s Moderation API would catch most of this. It didn’t.

And the more surprising part: many of the issues weren’t caused by our service logic, but by vulnerabilities in the base model itself.

L’s advice? Don’t just rely on the model—add your own service-level guardrails.

That’s when the real questions started for us.

🧐 1. Can we realistically build better filters than OpenAI?

It feels almost arrogant to assume we could build safer filters than OpenAI.

And yet… maybe we have to? Especially if we're responsible for the end-user experience.

We’re now debating whether to create a lightweight moderation layer ourselves—or whether that just adds fragile complexity.

➡️ 2. Input shaping vs. output filtering—what feels more natural?

There seem to be two main approaches:

Suggest better prompts or rephrase risky input before it goes to the model
Let the user say what they want, but clean up the output afterwards

From a user’s perspective, which one feels less “controlled”?

Which one builds more trust without being annoying?

We’re leaning toward a hybrid model but haven’t decided yet.

📢 3. How do we evaluate advice from a vendor who also sells the solution?

To be fair, L did a great job identifying issues.

But since they also offer a paid tool to “fix” those same issues, we’re trying to stay skeptical and think clearly.

Is this a real risk—or a convenient upsell?

🤝 4. Should we go with the red teamers for the fix, even if it's more expensive?

Since L already understands our product’s attack surface, handing the solution work over to them might save time.

But it's not cheap.

We’re currently comparing L with other CSPs who offer similar protection layers.

Trying to strike the right balance between cost, trust, and speed.

Have you gone through anything like this with your AI product?

Would love to hear how others are handling this in the real world.

We ran a red teaming test before launch—here’s what surprised us (and what we’re still debating)

Replies