XSS Prevention
This document details the Cross-Site Scripting (XSS) prevention implementation, including HTML sanitization, field validators, and max length enforcement standards.
Overview
XSS prevention is not “sanitize every string”. A professional baseline is:
- Validate inputs by type/format/length (reject invalid values).
- Sanitize only where you intentionally accept HTML (e.g., rich-text fields).
- Never mutate credentials (e.g., passwords). Prevent XSS by not reflecting/logging secrets and by escaping output where appropriate.
Minimum requirements:
- Max length validation on every user-controlled string field.
- Allowlist validation for identifiers and other constrained strings (reject on mismatch).
- HTML sanitization only for fields that are designed to store/render HTML.
Core Implementation
sanitize_html() Function
Standard Location: utils/security.py
Purpose: Remove dangerous HTML/JavaScript while preserving safe content
Features:
- Removes dangerous tags:
<script>,<iframe>,<object>,<embed> - Removes event handlers:
onclick,onerror,onload, etc. - Removes
javascript:protocol - Preserves safe HTML tags for rich text
- Alphanumeric IDs pass through unchanged
Standard Implementation:
import bleach
def sanitize_html(value: str) -> str:
"""
Sanitize HTML content while preserving safe formatting.
Uses bleach library.
"""
if not value:
return value
# Allow safe HTML tags
allowed_tags = ['p', 'br', 'strong', 'em', 'u', 'a', 'ul', 'ol', 'li', 'h1', 'h2', 'h3']
allowed_attributes = {'a': ['href', 'title']}
allowed_protocols = ['http', 'https', 'mailto']
# Sanitize using bleach
clean_value = bleach.clean(
value,
tags=allowed_tags,
attributes=allowed_attributes,
protocols=allowed_protocols,
strip=True
)
return clean_value
What Gets Sanitized
Removed:
<script>tags and JavaScript<iframe>embeddings- Event handlers (onclick, onerror, etc.)
javascript:anddata:protocolseval()and similar functions
Preserved:
- Regular text
- Safe HTML formatting (
<p>,<strong>,<em>) - Links with sanitized URLs
- Special characters in passwords (!, @, #) (passwords should not be sanitized; see below)
Schema Implementation Patterns
Basic Pattern
from pydantic import BaseModel, Field, field_validator
from utils.security import sanitize_html
class EntityCreate(BaseModel):
name: str = Field(..., max_length=100)
description: str = Field(..., max_length=500)
# Only sanitize fields that are intended to accept HTML/rich text.
_sanitize = field_validator('name', 'description')(
lambda cls, v: sanitize_html(v)
)
Identifier validation (IDs): allowlist + reject (do not sanitize)
Identifiers are not “rich text”. Treat them as constrained inputs:
- Allowlist characters
- Reject invalid values (422)
- Keep max lengths small and consistent
Recommended ID rule (example):
- length: 1–100
- chars: alphanumeric plus
_and-
import re
from pydantic import BaseModel, Field, field_validator
ID_RE = re.compile(r"^[a-zA-Z0-9_-]{1,100}$")
def validate_id(value: str) -> str:
if not value:
raise ValueError("ID is required")
if not ID_RE.fullmatch(value):
raise ValueError("Invalid ID format")
return value
class EntityGet(BaseModel):
entity_id: str = Field(..., max_length=100)
_validate_ids = field_validator("entity_id")(lambda cls, v: validate_id(v))
List Fields
@field_validator('tags', mode='before')
@classmethod
def sanitize_tags(cls, v):
if v and isinstance(v, list):
return [sanitize_html(item)[:50] for item in v if isinstance(item, str)]
return v
Nested Structures
@field_validator('metadata', mode='before')
@classmethod
def sanitize_metadata(cls, v):
if not v or not isinstance(v, dict):
return v
return {
k: sanitize_html(val)[:5000] if isinstance(val, str) else val
for k, val in v.items()
}
Max Length Validation
Every string field has a max length to prevent payload abuse:
| Field Type | Max Length | Example |
|---|---|---|
| ID Fields | 100 chars | user_id, org_id |
| Passwords | 8-128 chars | User passwords |
| Names/Titles | 100-300 chars | Entity names |
| Descriptions | 500-1000 chars | Short descriptions |
| Long Text | 2000-5000 chars | Messages, content |
| Rich Text | 10000 chars | HTML content |
| URLs | 500-1000 chars | Web addresses |
Password fields: never sanitize (do not mutate credentials)
Passwords (and other secrets) must not be sanitized or transformed.
- Validate length (and optionally basic character constraints if you have a requirement).
- Never include passwords in error messages.
- Never log passwords.
- Hash and store using a strong password hashing scheme (e.g., bcrypt via
passlib).
Implementation Checklist
When adding new input schemas:
- Add
max_lengthto all user-controlled string fields - For ID fields: add allowlist validation (reject invalid values; do not sanitize)
- For HTML/rich-text fields only: add
field_validatorwithsanitize_html() - Sanitize list items individually
- Sanitize nested dict values
- Test with XSS payloads
- Test max length enforcement
- Verify normal content preserved
Example: Complete Schema
from pydantic import BaseModel, Field, field_validator
from utils.security import sanitize_html
from typing import List, Optional
class ProductCreate(BaseModel):
"""Schema for creating a product with XSS prevention."""
# Required fields with validation
name: str = Field(..., max_length=100)
description: str = Field(..., max_length=1000)
price: float = Field(..., gt=0)
# Optional fields
category: Optional[str] = Field(None, max_length=50)
tags: Optional[List[str]] = []
image_url: Optional[str] = Field(None, max_length=500)
# Sanitize simple fields
_sanitize_strings = field_validator('name', 'description', 'category', 'image_url')(
lambda cls, v: sanitize_html(v) if v else v
)
# Sanitize list fields
@field_validator('tags', mode='before')
@classmethod
def sanitize_tags(cls, v):
if v and isinstance(v, list):
return [sanitize_html(item)[:50] for item in v if isinstance(item, str)]
return v