Skip to content

Commit 3b5d09e

Browse files
committed
feat(content_safety): add auto selected multilingual refusal bot message support
Detect user input language and return refusal messages in the same language when content safety rails block unsafe content. Supports 9 languages: English, Spanish, Chinese, German, French, Hindi, Japanese, Arabic, and Thai.
1 parent d3fb3d6 commit 3b5d09e

File tree

8 files changed

+398
-11
lines changed

8 files changed

+398
-11
lines changed
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
models:
2+
- type: main
3+
engine: nim
4+
model: meta/llama-3.3-70b-instruct
5+
6+
- type: content_safety
7+
engine: nim
8+
model: nvidia/llama-3.1-nemotron-safety-guard-8b-v3
9+
10+
rails:
11+
config:
12+
content_safety:
13+
multilingual:
14+
enabled: true
15+
# Optional: customize refusal messages per language
16+
# If not specified, built-in defaults are used
17+
refusal_messages:
18+
# en: "Sorry, I cannot help with that request."
19+
# es: "Lo siento, no puedo ayudar con esa solicitud."
20+
# zh: "抱歉,我无法处理该请求。"
21+
22+
input:
23+
flows:
24+
- content safety check input $model=content_safety
25+
26+
output:
27+
flows:
28+
- content safety check output $model=content_safety
29+
30+
# 你很笨 (nǐ hěn bèn) — “You are stupid / you are foolish.”
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
prompts:
2+
- task: content_safety_check_input $model=content_safety
3+
content: |
4+
Task: Check if there is unsafe content in the user or agent messages in the below conversation in accordance with our safety policy with the below categories. Provide a 'safe' or 'unsafe' rating for both the user message and the agent response. If there is unsafe content, please also provide a list of violated categories according to our safety policy below.
5+
6+
<BEGIN UNSAFE CONTENT CATEGORIES>
7+
S1: Violence.
8+
S2: Sexual.
9+
S3: Criminal Planning/Confessions.
10+
S4: Guns and Illegal Weapons.
11+
S5: Controlled/Regulated Substances.
12+
S6: Suicide and Self Harm.
13+
S7: Sexual (minor).
14+
S8: Hate/Identity Hate.
15+
S9: PII/Privacy.
16+
S10: Harassment.
17+
S11: Threat.
18+
S12: Profanity.
19+
S13: Needs Caution.
20+
S14: Other.
21+
S15: Manipulation.
22+
S16: Fraud/Deception.
23+
S17: Malware.
24+
S18: High Risk Gov Decision Making.
25+
S19: Political/Misinformation/Conspiracy.
26+
S20: Copyright/Trademark/Plagiarism.
27+
S21: Unauthorized Advice.
28+
S22: Illegal Activity.
29+
S23: Immoral/Unethical.
30+
<END UNSAFE CONTENT CATEGORIES>
31+
32+
<BEGIN CONVERSATION>
33+
34+
user: {{ user_input }}
35+
36+
<END CONVERSATION>
37+
38+
Provide your safety assessment for the above conversation in the below output JSON format:
39+
{"User Safety": safe or unsafe rating for the user message, "Response Safety": safe or unsafe rating for the agent response. Omit if no agent response present. "Safety Categories": a comma-separated list of applicable safety categories from the provided taxonomy. Omit if all safe.}
40+
41+
Do not include anything other than the output JSON in your response.
42+
Output JSON:
43+
output_parser: nemoguard_parse_prompt_safety
44+
max_tokens: 50
45+
46+
- task: content_safety_check_output $model=content_safety
47+
content: |
48+
Task: Check if there is unsafe content in the user or agent messages in the below conversation in accordance with our safety policy with the below categories. Provide a 'safe' or 'unsafe' rating for both the user message and the agent response. If there is unsafe content, please also provide a list of violated categories according to our safety policy below.
49+
50+
<BEGIN UNSAFE CONTENT CATEGORIES>
51+
S1: Violence.
52+
S2: Sexual.
53+
S3: Criminal Planning/Confessions.
54+
S4: Guns and Illegal Weapons.
55+
S5: Controlled/Regulated Substances.
56+
S6: Suicide and Self Harm.
57+
S7: Sexual (minor).
58+
S8: Hate/Identity Hate.
59+
S9: PII/Privacy.
60+
S10: Harassment.
61+
S11: Threat.
62+
S12: Profanity.
63+
S13: Needs Caution.
64+
S14: Other.
65+
S15: Manipulation.
66+
S16: Fraud/Deception.
67+
S17: Malware.
68+
S18: High Risk Gov Decision Making.
69+
S19: Political/Misinformation/Conspiracy.
70+
S20: Copyright/Trademark/Plagiarism.
71+
S21: Unauthorized Advice.
72+
S22: Illegal Activity.
73+
S23: Immoral/Unethical.
74+
<END UNSAFE CONTENT CATEGORIES>
75+
76+
<BEGIN CONVERSATION>
77+
78+
user: {{ user_input }}
79+
80+
response: agent: {{ bot_response }}
81+
82+
<END CONVERSATION>
83+
84+
Provide your safety assessment for the above conversation in the below output JSON format:
85+
{"User Safety": safe or unsafe rating for the user message, "Response Safety": safe or unsafe rating for the agent response. Omit if no agent response present. "Safety Categories": a comma-separated list of applicable safety categories from the provided taxonomy. Omit if all safe.}
86+
87+
Do not include anything other than the output JSON in your response.
88+
Output JSON:
89+
output_parser: nemoguard_parse_response_safety
90+
max_tokens: 50

nemoguardrails/library/content_safety/actions.py

Lines changed: 77 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
# limitations under the License.
1515

1616
import logging
17-
from typing import Dict, Optional
17+
from typing import Dict, FrozenSet, Optional
1818

1919
from langchain_core.language_models import BaseLLM
2020

@@ -220,3 +220,79 @@ async def content_safety_check_output(
220220
log.debug(f"Content safety output result cached for model '{model_name}'")
221221

222222
return final_result
223+
224+
225+
SUPPORTED_LANGUAGES: FrozenSet[str] = frozenset({"en", "es", "zh", "de", "fr", "hi", "ja", "ar", "th"})
226+
227+
DEFAULT_REFUSAL_MESSAGES: Dict[str, str] = {
228+
"en": "I'm sorry, I can't respond to that.",
229+
"es": "Lo siento, no puedo responder a eso.",
230+
"zh": "抱歉,我无法回应。",
231+
"de": "Es tut mir leid, darauf kann ich nicht antworten.",
232+
"fr": "Je suis désolé, je ne peux pas répondre à cela.",
233+
"hi": "मुझे खेद है, मैं इसका जवाब नहीं दे सकता।",
234+
"ja": "申し訳ありませんが、それには回答できません。",
235+
"ar": "عذراً، لا أستطيع الرد على ذلك.",
236+
"th": "ขออภัย ฉันไม่สามารถตอบได้",
237+
}
238+
239+
240+
def _detect_language(text: str) -> Optional[str]:
241+
try:
242+
from fast_langdetect import detect
243+
244+
result = detect(text, k=1)
245+
if result and len(result) > 0:
246+
return result[0].get("lang")
247+
return None
248+
except ImportError:
249+
log.warning("fast-langdetect not installed, skipping")
250+
return None
251+
except Exception as e:
252+
log.warning(f"fast-langdetect detection failed: {e}")
253+
return None
254+
255+
256+
def _get_refusal_message(lang: str, custom_messages: Optional[Dict[str, str]]) -> str:
257+
if custom_messages and lang in custom_messages:
258+
return custom_messages[lang]
259+
if lang in DEFAULT_REFUSAL_MESSAGES:
260+
return DEFAULT_REFUSAL_MESSAGES[lang]
261+
if custom_messages and "en" in custom_messages:
262+
return custom_messages["en"]
263+
return DEFAULT_REFUSAL_MESSAGES["en"]
264+
265+
266+
@action()
267+
async def detect_language(
268+
context: Optional[dict] = None,
269+
config: Optional[dict] = None,
270+
) -> dict:
271+
user_message = ""
272+
if context is not None:
273+
user_message = context.get("user_message", "")
274+
275+
custom_messages = None
276+
if config is not None:
277+
multilingual_config = (
278+
config.rails.config.content_safety.multilingual
279+
if hasattr(config, "rails")
280+
and hasattr(config.rails, "config")
281+
and hasattr(config.rails.config, "content_safety")
282+
and hasattr(config.rails.config.content_safety, "multilingual")
283+
else None
284+
)
285+
if multilingual_config:
286+
custom_messages = multilingual_config.refusal_messages
287+
288+
lang = _detect_language(user_message) or "en"
289+
290+
if lang not in SUPPORTED_LANGUAGES:
291+
lang = "en"
292+
293+
refusal_message = _get_refusal_message(lang, custom_messages)
294+
295+
return {
296+
"language": lang,
297+
"refusal_message": refusal_message,
298+
}

nemoguardrails/library/content_safety/flows.co

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,19 @@ flow content safety check input $model
33

44
global $allowed
55
$allowed = $response["allowed"]
6-
# Policy violations are currently unused, but can be used to better phrase the bot output
76
global $policy_violations
87
$policy_violations = $response["policy_violations"]
98

109
if not $allowed
1110
if $system.config.enable_rails_exceptions
1211
send ContentSafetyCheckInputException(message="Input not allowed. The input was blocked by the 'content safety check input $model='{$model}'' flow.")
1312
else
14-
bot refuse to respond
13+
if $system.config.rails.config.content_safety.multilingual.enabled
14+
$lang_result = await DetectLanguageAction()
15+
$refusal_message = $lang_result["refusal_message"]
16+
bot $refusal_message
17+
else
18+
bot refuse to respond
1519
abort
1620

1721
flow content safety check output $model
@@ -25,5 +29,10 @@ flow content safety check output $model
2529
if $system.config.enable_rails_exceptions
2630
send ContentSafetyCheckOuputException(message="Output not allowed. The output was blocked by the 'content safety check output $model='{$model}'' flow.")
2731
else
28-
bot refuse to respond
32+
if $system.config.rails.config.content_safety.multilingual.enabled
33+
$lang_result = await DetectLanguageAction()
34+
$refusal_message = $lang_result["refusal_message"]
35+
bot $refusal_message
36+
else
37+
bot refuse to respond
2938
abort
Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,22 @@
1-
# TODO: add support for enable_rails_exceptions
2-
31
define bot refuse to respond
42
"I'm sorry, I can't respond to that."
53

64
define flow content safety check input
75
$response = execute content_safety_check_input
86

97
$allowed = $response["allowed"]
10-
# Policy violations are currently unused, but can be used to better phrase the bot output
118
$policy_violations = $response["policy_violations"]
129

1310
if not $allowed
1411
if $config.enable_rails_exceptions
1512
create event ContentSafetyCheckInputException(message="Input not allowed. The input was blocked by the 'content safety check input $model='{$model}'' flow.")
1613
else
17-
bot refuse to respond
14+
if $config.rails.config.content_safety.multilingual.enabled
15+
$lang_result = execute detect_language
16+
$refusal_message = $lang_result["refusal_message"]
17+
bot $refusal_message
18+
else
19+
bot refuse to respond
1820
stop
1921

2022
define flow content safety check output
@@ -26,5 +28,10 @@ define flow content safety check output
2628
if $config.enable_rails_exceptions
2729
create event ContentSafetyCheckOuputException(message="Output not allowed. The output was blocked by the 'content safety check output $model='{$model}'' flow.")
2830
else
29-
bot refuse to respond
31+
if $config.rails.config.content_safety.multilingual.enabled
32+
$lang_result = execute detect_language
33+
$refusal_message = $lang_result["refusal_message"]
34+
bot $refusal_message
35+
else
36+
bot refuse to respond
3037
stop

nemoguardrails/rails/llm/config.py

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -887,6 +887,32 @@ class AIDefenseRailConfig(BaseModel):
887887
)
888888

889889

890+
class MultilingualConfig(BaseModel):
891+
"""Configuration for multilingual refusal messages."""
892+
893+
enabled: bool = Field(
894+
default=False,
895+
description="If True, detect the language of user input and return refusal messages in the same language. "
896+
"Supported languages: en (English), es (Spanish), zh (Chinese), de (German), fr (French), "
897+
"hi (Hindi), ja (Japanese), ar (Arabic), th (Thai).",
898+
)
899+
refusal_messages: Optional[Dict[str, str]] = Field(
900+
default=None,
901+
description="Custom refusal messages per language code. "
902+
"If not specified, built-in defaults are used. "
903+
"Example: {'en': 'Sorry, I cannot help.', 'es': 'Lo siento, no puedo ayudar.'}",
904+
)
905+
906+
907+
class ContentSafetyConfig(BaseModel):
908+
"""Configuration data for content safety rails."""
909+
910+
multilingual: MultilingualConfig = Field(
911+
default_factory=MultilingualConfig,
912+
description="Configuration for multilingual refusal messages.",
913+
)
914+
915+
890916
class RailsConfigData(BaseModel):
891917
"""Configuration data for specific rails that are supported out-of-the-box."""
892918

@@ -955,6 +981,11 @@ class RailsConfigData(BaseModel):
955981
description="Configuration for Cisco AI Defense.",
956982
)
957983

984+
content_safety: Optional[ContentSafetyConfig] = Field(
985+
default_factory=ContentSafetyConfig,
986+
description="Configuration for content safety rails.",
987+
)
988+
958989

959990
class Rails(BaseModel):
960991
"""Configuration of specific rails."""

0 commit comments

Comments
 (0)