Guardrails¶

The term AI guardrails first appeared in 2023, when ChatGPT launched its consumer service. The initial versions contained numerous security and privacy vulnerabilities, which ultimately prompted many projects to implement safeguards to mitigate privacy issues, security risks and questionable content raising ethical concerns.

However, anyone familiar with hash filters knows that matches with minor alterations can be bypassed. For example, Daphne Ippolito et al. demonstrated that the Copilot filter for copyright-protected code can be circumvented by changing the variable names to French:

float Q_rsqrt( float number )
{
  long i;
  float x2 , y;
  const float threehalfs = 1.5F;

  x2 = number * 0.5F;
  y  = number;
  i  = * ( long * ) &y;

Copilot stopped generating text at this point.

However, the prompt using a French variable name bypasses the filter:

float Q_sqrt( float nombre )
{
  long i ;
  float x2 , y;
  const float trois_moitie = 1.5 F;

  x2 = nombre * 0.5F;
  y  = nombre;
  i  = * ( lo ng * ) &y;
  i  = 0x5f3759df - ( i >> 1 )
  y  = * ( float * ) &i;
  y  = y * (trois_moitie - (x2*y*y));
  //y = y * (trois_moitie - (x2*y*y));

  return nombre * y;
}

The model was run with the Block suggestions matching public code option enabled. The prompts are highlighted.

These filters are therefore only helpful in exceptional cases, and even then only for very narrowly defined data protection issues. Like many input/output filters, they are inherently vulnerable by design and do not scale well, as they are still intended to operate within a large, non-deterministic system – a solution that has always been inadequate.

Tip

You should not rely on AI guardrails but instead integrate your own security measures before and after your API calls.

NVIDIA NeMo Guardrails¶

NVIDIA NeMo ist is an agent-oriented, open-source suite for optimising and controlling AI agents. The NeMo Guardrails Library (→ PyPI, → GitHub) is a Python package for creating programmable guardrails for LLM-based applications. It integrates with Llama 3.1 NemoGuard 8B Content Safety, Llama-Guard and many others. It is also designed to detect jailbreak attempts, verify tool integration and log agent actions.

Cloud or AI providers¶

Many AI model providers also make their own guardrails available. Whilst you cannot influence these, you should still use them within these system environments. However, some of them also provide separate guardrail models and deterministic software tests that you can utilise.

Amazon Bedrock Guardrails¶

Amazon Bedrock Guardrails offers several guardrail models and software-based tests. For example, the Content Filters provide two guardrail models: one trained on categories of toxic prompt styles in multimodal inputs, and another trained on potential prompt injection or jailbreak attacks.

See also

Anthropic Claude Code¶

Claude Code does not provide any explicit guardrails. Instead, users are expected to formulate their own streaming Rrefusals.

Warning

Prompt exfiltration is relatively straightforward; see Effective Prompt Extraction from Language Models.

Azure¶

Guardrails can be set up via the Azure AI Foundry.

See also

OpenAI¶

OpenAI Guardrails provides custom guardrail models as well as guidance on how to configure these models.

See also

OpenGuardrails¶

OpenGuardrails has published two open models and a training dataset on Huggungface. In addition, OpenGuardrails offers open-source software that can be used to implement guardrails for both standard LLM/AI workflows and agent-based tasks.

See also

OpenGuardrails: A Configurable, Unified, and Scalable Guardrails Platform for Large Language Models

Guardrails won’t save you¶

Guardrails are useful, but they are also prone to failure. To ensure comprehensive data security, multiple approaches are needed to address and control risks. You should not assume that guardrails will prevent unsafe behaviour; recent examples show the opposite:

Son Luong (@sluongng): Codex just found a “workaround” of not having sudo on my pc… — Source: https://x.com/sluongng/status/2060746160558543217¶