HTML Sanitizer API

Draft Community Group Report,

This version:
https://wicg.github.io/sanitizer-api/
Issue Tracking:
GitHub
Editors:
Frederik Braun (Mozilla)
Mario Heiderich (Cure53)
Daniel Vogelheim (Google LLC)

Abstract

This document specifies a set of APIs which allow developers to take untrusted HTML input and sanitize it for safe insertion into a document’s DOM.

Status of this document

This specification was published by the Web Platform Incubator Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups.

1. Introduction

This section is not normative.

Web applications often need to work with strings of HTML on the client side, perhaps as part of a client-side templating solution, perhaps as part of rendering user generated content, etc. It is difficult to do so in a safe way. The naive approach of joining strings together and stuffing them into an Element's innerHTML is fraught with risk, as it can cause JavaScript execution in a number of unexpected ways.

Libraries like [DOMPURIFY] attempt to manage this problem by carefully parsing and sanitizing strings before insertion, by constructing a DOM and filtering its members through an allow-list. This has proven to be a fragile approach, as the parsing APIs exposed to the web don’t always map in reasonable ways to the browser’s behavior when actually rendering a string as HTML in the "real" DOM. Moreover, the libraries need to keep on top of browsers' changing behavior over time; things that once were safe may turn into time-bombs based on new platform-level features.

The browser has a fairly good idea of when it is going to execute code. We can improve upon the user-space libraries by teaching the browser how to render HTML from an arbitrary string in a safe manner, and do so in a way that is much more likely to be maintained and updated along with the browser’s own changing parser implementation. This document outlines an API which aims to do just that.

1.1. Goals

1.2. API Summary

The Sanitizer API offers functionality to parse a string containing HTML into a DOM tree, and to filter the resulting tree according to a user-supplied configuration. The methods come in two by two flavours:

2. Framework

2.1. Sanitizer API

The Element interface defines two methods, setHTML() and setHTMLUnsafe__TO_BE_MERGED(). Both of these take a DOMString with HTML markup, and an optional configuration.

partial interface Element {
  [CEReactions] undefined setHTMLUnsafe__TO_BE_MERGED(DOMString html, optional SanitizerConfig config = {});
  [CEReactions] undefined setHTML(DOMString html, optional SanitizerConfig config = {});
};

Note: The setHTMLUnsafe method is meant to be merged with [HTML]'s setHTMLUnsafe. To prevent tooling errors about overloaded methods we’ll rename them here by appending __TO_BE_MERGED to those names.

Element's setHTMLUnsafe__TO_BE_MERGED(html, options) method steps are:
  1. Let target be this’s template contents if this is template element; otherwise this.

  2. Set and filter HTML given target, this, html, options, and false.

Element's setHTML(html, options) method steps are:
  1. Let target be this’s template contents if this is a template; otherwise this.

  2. Set and filter HTML given target, this, html, options, and true.

partial interface ShadowRoot {
  [CEReactions] undefined setHTMLUnsafe__TO_BE_MERGED(DOMString html, optional SanitizerConfig config = {});
  [CEReactions] undefined setHTML(DOMString html, optional SanitizerConfig config = {});
};

These methods are mirrored on the ShadowRoot:

ShadowRoot's setHTMLUnsafe__TO_BE_MERGED(html, options) method steps are:
  1. Set and filter HTML using this, this's shadow host (as context element), html, options, and false.

ShadowRoot's setHTML(html, options) method steps are:
  1. Set and filter HTML using this (as target), this (as context element), html, options, and true.

The Document interface gains two new methods which parse an entire Document:

partial interface Document {
  static Document parseHTMLUnsafe__TO_BE_MERGED(DOMString html, optional SanitizerConfig config = {});
  static Document parseHTML(DOMString html, optional SanitizerConfig config = {});
};
The parseHTMLUnsafe__TO_BE_MERGED(html, options) method steps are:
  1. Let document be a new Document, whose content type is "text/html".

    Note: Since document does not have a browsing context, scripting is disabled.

  2. Set document’s allow declarative shadow roots to true.

  3. Parse HTML from a string given document and html.

  4. Let config be the result of calling canonicalize a configuration on options["sanitizer"] and false.

  5. If config is not empty, then call sanitize on document’s root node with config.

  6. Return document.

The parseHTML(html, options) method steps are:
  1. Let document be a new Document, whose content type is "text/html".

    Note: Since document does not have a browsing context, scripting is disabled.

  2. Set document’s allow declarative shadow roots to true.

  3. Parse HTML from a string given document and html.

  4. Let config be the result of calling canonicalize a configuration on options["sanitizer"] and true.

  5. Call sanitize on document’s root node with config.

  6. Return document.

2.2. The Configuration Dictionary

dictionary SanitizerElementNamespace {
  required DOMString name;
  DOMString? _namespace = "http://www.w3.org/1999/xhtml";
};

// Used by "elements"
dictionary SanitizerElementNamespaceWithAttributes : SanitizerElementNamespace {
  sequence<SanitizerAttribute> attributes;
  sequence<SanitizerAttribute> removeAttributes;
};

typedef (DOMString or SanitizerElementNamespace) SanitizerElement;
typedef (DOMString or SanitizerElementNamespaceWithAttributes) SanitizerElementWithAttributes;

dictionary SanitizerAttributeNamespace {
  required DOMString name;
  DOMString? _namespace = null;
};
typedef (DOMString or SanitizerAttributeNamespace) SanitizerAttribute;

dictionary SanitizerConfig {
  sequence<SanitizerElementWithAttributes> elements;
  sequence<SanitizerElement> removeElements;
  sequence<SanitizerElement> replaceWithChildrenElements;

  sequence<SanitizerAttribute> attributes;
  sequence<SanitizerAttribute> removeAttributes;

  boolean comments;
  boolean dataAttributes;
};

3. Algorithms

To set and filter HTML, given an Element or DocumentFragment target, an Element contextElement, a string html, and a dictionary options, and a boolean safe:
  1. If safe and contextElement’s local name is "script" and contextElement’s namespace is the HTML namespace or the SVG namespace, then return.

  2. Let config be the result of calling canonicalize a configuration on options["sanitizer"] and safe.

  3. Let newChildren be the result of the HTML fragment parsing algorithm given contextElement, html, and true.

  4. Let fragment be a new DocumentFragment whose node document is contextElement’s node document.

  5. For each node in newChildren, append node to fragment.

  6. If config is not empty, then run sanitize on fragment using config.

  7. Replace all with fragment within target.

3.1. Sanitization Algorithms

For the main sanitize operation, using a ParentNode node, a canonical SanitizerConfig config, run these steps:
  1. Assert: config is canonical.

  2. Let current be node.

  3. For each child in current’s children:

    1. Assert: child implements Text, Comment, or Element.

      Note: Currently, this algorithm is only be called on output of the HTML parser for which this assertion should hold. If in the future this algorithm will be used in different contexts, this assumption needs to be re-examined.

    2. If child implements Text:

      1. continue.

    3. else if child implements Comment:

      1. If config’s comments is not true:

        1. remove child.

    4. else:

      1. Let elementName be a SanitizerElementNamespace with child’s local name and namespace.

      2. If config["elements"] exists and config["elements"] does not contain [elementName]:

        1. remove child.

      3. else if config["removeElements"] exists and config["removeElements"] contains [elementName]:

        1. remove child.

      4. If config["replaceWithChildrenElements"] exists and config["replaceWithChildrenElements"] contains elementName:

        1. Call sanitize on child with config.

        2. Call replace all with child’s children within child.

      5. If elementName equals «[ "name" → "template", "namespace" → HTML namespace

        1. Then call sanitize on child’s template contents with config.

      6. If child is a shadow host:

        1. Then call sanitize on child’s shadow root with config.

      7. For each attr in current’s attribute list:

        1. Let attrName be a SanitizerAttributeNamespace with attr’s local name and namespace.

        2. If config["attributes"] exists and config["attributes"] does not contain attrName:

          1. If "data-" is a code unit prefix of local name and if namespace is null and if config["dataAttributes"] exists and is false:

            1. Remove attr from child.

        3. else if config["removeAttributes"] exists and config["removeAttributes"] contains attrName:

          1. Remove attr from child.

        4. If config["elements"][elementName] exists, and if config["elements"][elementName]["attributes"] exists, and if config["elements"][elementName]["attributes"] does not contain attrName:

          1. Remove attr from child.

        5. If config["elements"][elementName] exists, and if config["elements"][elementName]["removeAttributes"] exists, and if config["elements"][elementName]["removeAttributes"] contains attrName:

          1. Remove attr from child.

        6. If «[elementName, attrName]» matches an entry in the navigating URL attributes list, and if attr’s protocol is "javascript:":

          1. Then remove attr from child.

        7. Call sanitize on child’s shadow root with config.

      8. else:

        1. remove child.

3.2. Configuration Processing

A config is valid if all these conditions are met:
  1. config is a dictionary

  2. config’s key set does not contain both "elements" and "removeElements"

  3. config’s key set does not contain both "removeAttributes" and "attributes".

  4. For any key of «[ "elements", "removeElements", "replaceWithChildrenElements", "attributes", "removeAttributes" ]» where config[key] exists:

    1. config[key] is valid.

  5. If config["elements"] exists, then for any element in config[key] that is a dictionary:

    1. element does not contain both "attributes" and "removeAttributes".

    2. If either element["attributes"] or element["removeAttributes"] exists, then it is valid.

    3. Let tmp be a dictionary, and for any key «[ "elements", "removeElements", "replaceWithChildrenElements", "attributes", "removeAttributes" ]» tmp[key] is set to the result of canonicalize a sanitizer element list called on config[key], and HTML namespace as default namespace for the element lists, and null as default namespace for the attributes lists.

      Note: The intent here is to assert about list erlements, but without regard of whether the string shortcut syntax or the explicit dictionary syntax is used. For example, having "img" in elements and { name: "img" } in removeElements. An implementation might well do this without explicitly canonicalizing the lists at this point.

      1. Given theses canonlicalized name lists, all of the following conditions hold:

        1. The intersection between tmp["elements"] and tmp["removeElements"] is empty.

        2. The intersection between tmp["removeElements"] tmp["replaceWithChildrenElements"] is empty.

        3. The intersection between tmp["replaceWithChildrenElements"] and tmp["elements"] is empty.

        4. The intersection between tmp["attributes"] and tmp["removeAttributes"] is empty.

      1. Let tmpattrs be tmp["attributes"] if it exists, and otherwise built-in default config["attributes"].

      2. For any item in tmp["elements"]:

        1. If either item["attributes"] or item["removeAttributes"] exists:

          1. Then the difference between it and tmpattrs is empty.

A list of names is valid if all these conditions are met:
  1. list is a list.

  2. For all of its members name:

    1. name is a string or a dictionary.

    2. If name is a dictionary:

      1. name["name"] exists and is a string.

A config is canonical if all these conditions are met:
  1. config is valid.

  2. config’s key set is a subset of «[ "elements", "removeElements", "replaceWithChildrenElements", "attributes", "removeAttributes", "comments", "dataAttributes" ]»

  3. config’s key set contains either:

    1. both "elements" and "attributes", but neither of "removeElements" or "removeAttributes".

    2. or both "removeElements" and "removeAttributes", but neither of "elements" or "attributes".

  4. For any key of «[ "replaceWithChildrenElements", "removeElements", "attributes", "removeAttributes" ]» where config[key] exists:

    1. config[key] is canonical.

  5. If config["elements"] exists:

    1. config["elements"] is canonical.

  6. For any key of «[ "comments", "dataAttributes" ]»:

    1. if config[key] exists, config[key] is a boolean.

A list of names is canonical if all these conditions are met:
  1. list[key] is a list.

  2. For all of its list[key]'s members name:

    1. name is a dictionary.

    2. name’s key set equals «[ "name", "namespace" ]»

    3. name’s values are strings.

A list of names is canonical if all these conditions are met:
  1. list[key] is a list.

  2. For all of its list[key]'s members name:

    1. name is a dictionary.

    2. name’s key set equals one of:

      1. «[ "name", "namespace" ]»

      2. «[ "name", "namespace", "attributes" ]»

      3. «[ "name", "namespace", "removeAttributes" ]»

    3. name["name"] and name["namespace"] are strings.

    4. name["attributes"] and name["removeAttributes"] are canonical if they exist.

To canonicalize a configuration config with a boolean safe:

Note: The initial set of asserts assert properties of the built-in constants, like the defaults and the lists of known elements and attributes.

  1. Assert: built-in default config is canonical.

  2. Assert: built-in default config["elements"] is a subset of known elements.

  3. Assert: built-in default config["attributes"] is a subset of known attributes.

  4. Assert: «[ "elements" → known elements, "attributes" → known attributes, ]» is canonical.

  5. If config is empty and not safe, then return «[]»

  6. If config is not valid, then throw a TypeError.

  7. Let result be a new dictionary.

  8. For each key of «[ "elements", "removeElements", "replaceWithChildrenElements" ]»:

    1. If config[key] exists, set result[key] to the result of running canonicalize a sanitizer element list on config[key] with HTML namespace as the default namespace.

  9. For each key of «[ "attributes", "removeAttributes" ]»:

    1. If config[key] exists, set result[key] to the result of running canonicalize a sanitizer element list on config[key] with null as the default namespace.

  10. Set result["comments"] to config["comments"].

  11. Let default be the result of canonicalizing a configuration for the built-in default config.

  12. If safe:

    1. If config["elements"] exists:

      1. Let elementBlockList be the difference between known elements default["elements"].

        Note: The "natural" way to enforce the default element list would be to intersect with it. But that would also eliminate any unknown (i.e., non-HTML supplied element, like <foo>). So we construct this helper to be able to use it to subtract any "unsafe" elements.

      2. Set result["elements"] to the difference of result["elements"] and elementBlockList.

    2. If config["removeElements"] exists:

      1. Set result["elements"] to the difference of default["elements"] and result["removeElements"].

      2. Remove "removeElements" from result.

    3. If neither config["elements"] nor config["removeElements"] exist:

      1. Set result["elements"] to default["elements"].

    4. If config["attributes"] exists:

      1. Let attributeBlockList be the difference between known attributes and default["attributes"];

      2. Set result["attributes"] to the difference of result["attributes"] and attributeBlockList.

    5. If config["removeAttributes"] exists:

      1. Set result["attributes"] to the difference of default["attributes"] and result["removeAttributes"].

      2. Remove "removeAttributes" from result.

    6. If neither config["attributes"] nor config["removeAttributes"] exist:

      1. Set result["attributes"] to default["attributes"].

  13. Else (if not safe):

    1. If neither config["elements"] nor config["removeElements"] exist:

      1. Set result["elements"] to default["elements"].

    2. If neither config["attributes"] nor config["removeAttributes"] exist:

      1. Set result["attributes"] to default["attributes"].

  14. Assert: result is valid.

  15. Assert: result is canonical.

  16. Return result.

In order to canonicalize a sanitizer element list list, with a default namespace defaultNamespace, run the following steps:
  1. Let result be a new ordered set.

  2. For each name in list, call canonicalize a sanitizer name on name with defaultNamespace and append to result.

  3. Return result.

In order to canonicalize a sanitizer name name, with a default namespace defaultNamespace, run the following steps:
  1. Assert: name is either a DOMString or a dictionary.

  2. If name is a DOMString, then return «[ "name" → name, "namespace" → defaultNamespace]».

  3. Assert: name is a dictionary and name["name"] exists.

  4. Return «[
    "name" → name["name"],
    "namespace" → name["namespace"] if it exists, otherwise defaultNamespace
    ]».

3.3. Supporting Algorithms

For the canonicalized element and attribute name lists used in this spec, list membership is based on matching both "name" and "namespace" entries: A Sanitizer name list contains an item if there exists an entry of list that is an ordered map, and where item["name"] equals entry["name"] and item["namespace"] equals entry["namespace"].
Set difference (or set subtraction) is a clone of a set A, but with all members removed that occur in a set B: To compute the difference of two ordered sets A and B:
  1. Let set be a new ordered set.

  2. For each item of A:

    1. If B does not contain item, then append item to set.

  3. Return set.

Equality for ordered sets is equality of its members, but without regard to order: Ordered sets A and B are equal if both A is a superset of B and B is a superset of A.

3.4. Defaults

Note: The defaults should follow a certain form, which is checked for at the beginning of canonicalize a configuration.

The built-in default config is as follows:

{
  elements: [....],
  attributes: [....],
  comments: true,
}

The known elements are as follows:

[
  { name: "div", namespace: "http://www.w3.org/1999/xhtml" },
  ...
]

The known attributes are as follows:

[
  { name: "class", namespace: null },
  ...
]

Note: The known elements and known attributes should be derived from the HTML5 specification, rather than being explicitly listed here. Currently, there are no mechanics to do so.

The navigating URL attributes list, for which "javascript:" navigations are unsafe, are as follows:

«[
[ { "name" → "a", "namespace" → "HTML namespace" }, { "name" → "href", "namespace" → null } ],
[ { "name" → "area", "namespace" → "HTML namespace" }, { "name" → "href", "namespace" → null } ],
[ { "name" → "form", "namespace" → "HTML namespace" }, { "name" → "action", "namespace" → null } ],
[ { "name" → "input", "namespace" → "HTML namespace" }, { "name" → "formaction", "namespace" → null } ],
[ { "name" → "button", "namespace" → "HTML namespace" }, { "name" → "formaction", "namespace" → null } ],

4. Security Considerations

The Sanitizer API is intended to prevent DOM-based Cross-Site Scripting by traversing a supplied HTML content and removing elements and attributes according to a configuration. The specified API must not support the construction of a Sanitizer object that leaves script-capable markup in and doing so would be a bug in the threat model.

That being said, there are security issues which the correct usage of the Sanitizer API will not be able to protect against and the scenarios will be laid out in the following sections.

4.1. Server-Side Reflected and Stored XSS

This section is not normative.

The Sanitizer API operates solely in the DOM and adds a capability to traverse and filter an existing DocumentFragment. The Sanitizer does not address server-side reflected or stored XSS.

4.2. DOM clobbering

This section is not normative.

DOM clobbering describes an attack in which malicious HTML confuses an application by naming elements through id or name attributes such that properties like children of an HTML element in the DOM are overshadowed by the malicious content.

The Sanitizer API does not protect DOM clobbering attacks in its default state, but can be configured to remove id and name attributes.

4.3. XSS with Script gadgets

This section is not normative.

Script gadgets are a technique in which an attacker uses existing application code from popular JavaScript libraries to cause their own code to execute. This is often done by injecting innocent-looking code or seemingly inert DOM nodes that is only parsed and interpreted by a framework which then performs the execution of JavaScript based on that input.

The Sanitizer API can not prevent these attacks, but requires page authors to explicitly allow unknown elements in general, and authors must additionally explicitly configure unknown attributes and elements and markup that is known to be widely used for templating and framework-specific code, like data- and slot attributes and elements like <slot> and <template>. We believe that these restrictions are not exhaustive and encourage page authors to examine their third party libraries for this behavior.

4.4. Mutated XSS

This section is not normative.

Mutated XSS or mXSS describes an attack based on parser context mismatches when parsing an HTML snippet without the correct context. In particular, when a parsed HTML fragment has been serialized to a string, the string is not guaranteed to be parsed and interpreted exactly the same when inserted into a different parent element. An example for carrying out such an attack is by relying on the change of parsing behavior for foreign content or misnested tags.

The Sanitizer API offers help against Mutated XSS, but relies on some amount of cooperation by the developers. The sanitize() function does not handle strings and is therefore unaffected. The setHTML function combines sanitization with DOM modification and can implicitly apply the correct context. The sanitizeFor() function combines parsing and sanitization, and relies on the developer to supply the correct context for the eventual application of its result.

If the data to be sanitized is available as a node tree, we encourage authors to use the sanitize() function of the API which returns a DocumentFragment and avoids risks that come with serialization and additional parsing. Directly operating on a fragment after sanitization also comes with a performance benefit, as the cost of additional serialization and parsing is avoided.

A more complete treatement of mXSS can be found in [MXSS].

5. Acknowledgements

Cure53’s [DOMPURIFY] is a clear inspiration for the API this document describes, as is Internet Explorer’s window.toStaticHTML().

Index

Terms defined by this specification

Terms defined by reference

References

Normative References

[DOM]
Anne van Kesteren. DOM Standard. Living Standard. URL: https://dom.spec.whatwg.org/
[HTML]
Anne van Kesteren; et al. HTML Standard. Living Standard. URL: https://html.spec.whatwg.org/multipage/
[INFRA]
Anne van Kesteren; Domenic Denicola. Infra Standard. Living Standard. URL: https://infra.spec.whatwg.org/
[SPEECH-API]
Web Speech API. cg-draft. URL: https://wicg.github.io/speech-api/
[URLPATTERN]
Ben Kelly; Jeremy Roman; 宍戸俊哉 (Shunya Shishido). URL Pattern Standard. Living Standard. URL: https://urlpattern.spec.whatwg.org/
[WEBIDL]
Edgar Chen; Timothy Gu. Web IDL Standard. Living Standard. URL: https://webidl.spec.whatwg.org/

Informative References

[DOMPURIFY]
DOMPurify. URL: https://github.com/cure53/DOMPurify
[MXSS]
mXSS Attacks: Attacking well-secured Web-Applications by using innerHTML Mutations. URL: https://cure53.de/fp170.pdf

IDL Index

partial interface Element {
  [CEReactions] undefined setHTMLUnsafe__TO_BE_MERGED(DOMString html, optional SanitizerConfig config = {});
  [CEReactions] undefined setHTML(DOMString html, optional SanitizerConfig config = {});
};

partial interface ShadowRoot {
  [CEReactions] undefined setHTMLUnsafe__TO_BE_MERGED(DOMString html, optional SanitizerConfig config = {});
  [CEReactions] undefined setHTML(DOMString html, optional SanitizerConfig config = {});
};

partial interface Document {
  static Document parseHTMLUnsafe__TO_BE_MERGED(DOMString html, optional SanitizerConfig config = {});
  static Document parseHTML(DOMString html, optional SanitizerConfig config = {});
};

dictionary SanitizerElementNamespace {
  required DOMString name;
  DOMString? _namespace = "http://www.w3.org/1999/xhtml";
};

// Used by "elements"
dictionary SanitizerElementNamespaceWithAttributes : SanitizerElementNamespace {
  sequence<SanitizerAttribute> attributes;
  sequence<SanitizerAttribute> removeAttributes;
};

typedef (DOMString or SanitizerElementNamespace) SanitizerElement;
typedef (DOMString or SanitizerElementNamespaceWithAttributes) SanitizerElementWithAttributes;

dictionary SanitizerAttributeNamespace {
  required DOMString name;
  DOMString? _namespace = null;
};
typedef (DOMString or SanitizerAttributeNamespace) SanitizerAttribute;

dictionary SanitizerConfig {
  sequence<SanitizerElementWithAttributes> elements;
  sequence<SanitizerElement> removeElements;
  sequence<SanitizerElement> replaceWithChildrenElements;

  sequence<SanitizerAttribute> attributes;
  sequence<SanitizerAttribute> removeAttributes;

  boolean comments;
  boolean dataAttributes;
};