XSS Prevention

This document details the Cross-Site Scripting (XSS) prevention implementation, including HTML sanitization, field validators, and max length enforcement standards.

Overview

XSS prevention is not “sanitize every string”. A professional baseline is:

Validate inputs by type/format/length (reject invalid values).
Sanitize only where you intentionally accept HTML (e.g., rich-text fields).
Never mutate credentials (e.g., passwords). Prevent XSS by not reflecting/logging secrets and by escaping output where appropriate.

Minimum requirements:

Max length validation on every user-controlled string field.
Allowlist validation for identifiers and other constrained strings (reject on mismatch).
HTML sanitization only for fields that are designed to store/render HTML.

Core Implementation

`sanitize_html()` Function

Standard Location: utils/security.py

Purpose: Remove dangerous HTML/JavaScript while preserving safe content

Features:

Removes dangerous tags: <script>, <iframe>, <object>, <embed>
Removes event handlers: onclick, onerror, onload, etc.
Removes javascript: protocol
Preserves safe HTML tags for rich text
Alphanumeric IDs pass through unchanged

Standard Implementation:

import bleach

def sanitize_html(value: str) -> str:
    """
    Sanitize HTML content while preserving safe formatting.
    Uses bleach library.
    """
    if not value:
        return value
    
    # Allow safe HTML tags
    allowed_tags = ['p', 'br', 'strong', 'em', 'u', 'a', 'ul', 'ol', 'li', 'h1', 'h2', 'h3']
    allowed_attributes = {'a': ['href', 'title']}
    allowed_protocols = ['http', 'https', 'mailto']
    
    # Sanitize using bleach
    clean_value = bleach.clean(
        value,
        tags=allowed_tags,
        attributes=allowed_attributes,
        protocols=allowed_protocols,
        strip=True
    )
    
    return clean_value

What Gets Sanitized

Removed:

<script> tags and JavaScript
<iframe> embeddings
Event handlers (onclick, onerror, etc.)
javascript: and data: protocols
eval() and similar functions

Preserved:

Regular text
Safe HTML formatting (<p>, <strong>, <em>)
Links with sanitized URLs
Special characters in passwords (!, @, #) (passwords should not be sanitized; see below)

Schema Implementation Patterns

Basic Pattern

from pydantic import BaseModel, Field, field_validator
from utils.security import sanitize_html

class EntityCreate(BaseModel):
    name: str = Field(..., max_length=100)
    description: str = Field(..., max_length=500)
    
    # Only sanitize fields that are intended to accept HTML/rich text.
    _sanitize = field_validator('name', 'description')(
        lambda cls, v: sanitize_html(v)
    )

Identifier validation (IDs): allowlist + reject (do not sanitize)

Identifiers are not “rich text”. Treat them as constrained inputs:

Allowlist characters
Reject invalid values (422)
Keep max lengths small and consistent

Recommended ID rule (example):

length: 1–100
chars: alphanumeric plus _ and -

import re

from pydantic import BaseModel, Field, field_validator

ID_RE = re.compile(r"^[a-zA-Z0-9_-]{1,100}$")

def validate_id(value: str) -> str:
    if not value:
        raise ValueError("ID is required")
    if not ID_RE.fullmatch(value):
        raise ValueError("Invalid ID format")
    return value

class EntityGet(BaseModel):
    entity_id: str = Field(..., max_length=100)

    _validate_ids = field_validator("entity_id")(lambda cls, v: validate_id(v))

List Fields

@field_validator('tags', mode='before')
@classmethod
def sanitize_tags(cls, v):
    if v and isinstance(v, list):
        return [sanitize_html(item)[:50] for item in v if isinstance(item, str)]
    return v

Nested Structures

@field_validator('metadata', mode='before')
@classmethod
def sanitize_metadata(cls, v):
    if not v or not isinstance(v, dict):
        return v
    return {
        k: sanitize_html(val)[:5000] if isinstance(val, str) else val 
        for k, val in v.items()
    }

Max Length Validation

Every string field has a max length to prevent payload abuse:

Field Type	Max Length	Example
ID Fields	100 chars	`user_id`, `org_id`
Passwords	8-128 chars	User passwords
Names/Titles	100-300 chars	Entity names
Descriptions	500-1000 chars	Short descriptions
Long Text	2000-5000 chars	Messages, content
Rich Text	10000 chars	HTML content
URLs	500-1000 chars	Web addresses

Password fields: never sanitize (do not mutate credentials)

Passwords (and other secrets) must not be sanitized or transformed.

Validate length (and optionally basic character constraints if you have a requirement).
Never include passwords in error messages.
Never log passwords.
Hash and store using a strong password hashing scheme (e.g., bcrypt via passlib).

Implementation Checklist

When adding new input schemas:

Add max_length to all user-controlled string fields
For ID fields: add allowlist validation (reject invalid values; do not sanitize)
For HTML/rich-text fields only: add field_validator with sanitize_html()
Sanitize list items individually
Sanitize nested dict values
Test with XSS payloads
Test max length enforcement
Verify normal content preserved

Example: Complete Schema

from pydantic import BaseModel, Field, field_validator
from utils.security import sanitize_html
from typing import List, Optional

class ProductCreate(BaseModel):
    """Schema for creating a product with XSS prevention."""
    
    # Required fields with validation
    name: str = Field(..., max_length=100)
    description: str = Field(..., max_length=1000)
    price: float = Field(..., gt=0)
    
    # Optional fields
    category: Optional[str] = Field(None, max_length=50)
    tags: Optional[List[str]] = []
    image_url: Optional[str] = Field(None, max_length=500)
    
    # Sanitize simple fields
    _sanitize_strings = field_validator('name', 'description', 'category', 'image_url')(
        lambda cls, v: sanitize_html(v) if v else v
    )
    
    # Sanitize list fields
    @field_validator('tags', mode='before')
    @classmethod
    def sanitize_tags(cls, v):
        if v and isinstance(v, list):
            return [sanitize_html(item)[:50] for item in v if isinstance(item, str)]
        return v