Core Concepts
Safety & Content Controls

Safety & Content Controls

Configure guardrails to ensure your avatar stays on-topic, provides appropriate responses, and knows when to escalate to humans.

Why Safety Controls Matter

AI can occasionally:

  • Go off-topic or make things up
  • Provide advice outside its expertise
  • Miss signals that a human is needed
  • Discuss topics you'd rather avoid

Safety controls let you prevent these issues proactively.

Topic Boundaries

Define topics your avatar should avoid:

Dashboard Configuration

  1. Go to Dashboard > Avatars > [Your Avatar] > Safety
  2. Under Topics to Avoid, add entries like:
    • "competitor pricing"
    • "legal advice"
    • "medical recommendations"
    • "political opinions"
  3. Save changes

When users ask about avoided topics, the avatar responds with a polite redirect:

"I'm not able to help with that topic, but I can assist you with [relevant alternative]."

Via Personality Prompt

Include boundaries in your personality:

const PERSONALITY = `
You are a customer support agent for TechCorp.
 
## Topics to Avoid
Never discuss:
- Competitor products or pricing
- Legal advice or liability questions
- Medical or health recommendations
- Political or religious topics
 
If asked about these, say: "I'm not the best resource for that topic.
For [topic], I'd recommend consulting a [appropriate professional]."
`;

Human Escalation

Configure when to transfer to a human agent:

Automatic Triggers

Enable in Dashboard > Safety > Human Escalation:

TriggerDescription
User requests human"Can I speak to a real person?"
Negative sentimentFrustration or anger detected
Repeated failuresAvatar can't answer after 3 attempts
Sensitive topicsCertain keywords detected

Escalation Response

When triggered, the avatar says:

"I think you'd be better served by one of my human colleagues. Let me connect you with our team."

Then either:

  • Shows a contact form
  • Provides support email/phone
  • Integrates with your live chat (Enterprise)

Custom Escalation Logic

<Avatar
  onEscalate={(reason, conversation) => {
    // Your custom escalation handling
    openLiveChat(conversation.id);
  }}
  escalationTriggers={[
    { type: 'keyword', values: ['refund', 'cancel', 'lawyer'] },
    { type: 'sentiment', threshold: -0.5 },
    { type: 'intent', values: ['speak_to_human'] }
  ]}
/>

Safety Filters

Built-in filters that are always active:

  • PII Protection – Avatar won't repeat back sensitive info (SSN, credit cards)
  • Harmful Content – Blocks generation of harmful or illegal content
  • Jailbreak Prevention – Resists prompt injection attempts

Custom Filters (Pro)

Add your own content filters:

POST /avatars/{avatarId}/safety/filters
{
  "name": "profanity-filter",
  "type": "block",
  "patterns": ["badword1", "badword2"],
  "action": "replace",
  "replacement": "[filtered]"
}

Response Validation

Ensure avatar responses meet your standards:

Confidence Scoring

Configure minimum confidence threshold:

<Avatar
  minConfidence={0.7}
  lowConfidenceResponse="I'm not entirely sure about that. Let me connect you with someone who can help."
/>

Response Length Limits

Prevent overly long responses:

<Avatar
  maxResponseLength={500}  // characters
  truncationMessage="..."
/>

Monitoring & Alerts

Stay informed about safety events:

Dashboard Alerts

View safety events in Dashboard > Analytics > Safety Events:

  • Escalation triggers
  • Blocked topics
  • Filter activations
  • Low confidence responses

Webhooks

Receive real-time safety notifications:

POST /webhooks
{
  "url": "https://your-site.com/avatarium-safety",
  "events": ["safety.escalation", "safety.topic_blocked", "safety.filter_triggered"]
}

Best Practices

  1. Start restrictive, loosen gradually – Better to over-block initially
  2. Review escalations weekly – Learn what the avatar struggles with
  3. Test adversarially – Try to break your own avatar
  4. Update based on real conversations – Add filters for unexpected issues

Configure in Dashboard → (opens in a new tab)