- htmLawed should work with PHP 4.4 and higher. Either
file or copy-paste the entire code.
+ htmLawed works in PHP version 4.4 or higher. Either
file, or copy-paste the entire code. To use with PHP 4.3, have the following code included:
should be in the same directory on the web-server).
: For code for usage of the htmLawed class (for htmLawed in OOP), please refer to this
on the htmLawed website; the filtering itself can be configured, etc., as described here.
2.4 Performance time & memory usage
(to top)
- The time and memory used by htmLawed depends on its configuration and the size of the input, and the amount, nestedness and well-formedness of the HTML markup within it. In particular, tag balancing and beautification each can increase the processing time by about a quarter.
+ The time and memory consumed during text processing by htmLawed depends on its configuration, the size of the input, and the amount, nestedness and well-formedness of the HTML markup within the input. In particular, tag balancing and beautification each can increase the processing time by about a quarter.
The htmLawed
demo can be used to evaluate the performance and effects of different types of input and
$config.
@@ -582,15 +608,13 @@ A PHP Labware internal utility -
2.5 Some security risks to keep in mind
(to top)
- When setting the parameters/arguments (like those to allow certain HTML elements) for use with htmLawed, one should bear in mind that the setting may let through potentially
dangerous HTML code which is meant to steal user-data, deface a website, render a page non-functional, etc.
-
- Unless end-users, either people or software, supplying the content are completely trusted, security issues arising from the degree of HTML usage permission has to be kept in mind. For example, following increase security risks:
+ When setting the parameters/arguments (like those to allow certain HTML elements) for use with htmLawed, one should bear in mind that the setting may let through potentially
dangerous HTML code which is meant to steal user-data, deface a website, render a page non-functional, etc. Unless end-users, either people or software, supplying the content are completely trusted, security issues arising from the degree of HTML usage permitted through htmLawed's setting should be considered. For example, following increase security risks:
* Allowing
script,
applet,
embed,
iframe or
object elements, or certain of their attributes like
allowscriptaccess
* Allowing HTML comments (some Internet Explorer versions are vulnerable with, e.g.,
<!--[if gte IE 4]><script>alert("xss");</script><![endif]-->
- * Allowing dynamic CSS expressions (a feature of the IE browser)
+ * Allowing dynamic CSS expressions (some Internet Explorer versions are vulnerable)
* Allowing the
style attribute
@@ -598,7 +622,7 @@ A PHP Labware internal utility -
*style* attribute brings in risks of click-jacking, phishing, web-page overlays, etc., even when the safe parameter is enabled (see section 3.6). Except for URLs and a few other things like CSS dynamic expressions, htmLawed currently does not check every CSS style property. It does provide ways for the code-developer implementing htmLawed to do such checks through htmLawed's
$spec argument, and through the
hook_tag parameter (see
section 3.4.8 for more). Disallowing
style completely and relying on CSS classes and stylesheet files is recommended.
- htmLawed does not check or correct the character
encoding of the input it receives. In conjunction with permitting circumstances such as when the character encoding is left undefined through HTTP headers or HTML
meta tags, this can permit an exploit (like Google's UTF-7/XSS vulnerability of the past).
+ htmLawed does not check or correct the character
encoding of the input it receives. In conjunction with permissive circumstances, such as when the character encoding is left undefined through HTTP headers or HTML
meta tags, this can allow for an exploit (like Google's
UTF-7/XSS vulnerability of the past).
2.8 Limitations & work-arounds
(to top)
- htmLawed's main objective is to make the input text
more standard-compliant, secure for web-page readers, and free of HTML elements and attributes considered undesirable by the administrator. Some of its current limitations, regardless of this objective, are noted below along with work-arounds.
+ htmLawed's main objective is to make the input text
more standard-compliant, secure for readers, and free of HTML elements and attributes considered undesirable by the administrator. Some of its current limitations, regardless of this objective, are noted below along with work-arounds.
- It should be borne in mind that no browser application is 100% standard-compliant, and that some of the standard specs (like asking for normalization of white-spacing within
textarea elements) are clearly wrong. Regarding security, note that
unsafe HTML code is not necessarily legally invalid.
+ It should be borne in mind that no browser application is 100% standard-compliant, and that some of the standard specifications (like asking for normalization of white-spacing within
textarea elements) are clearly wrong. Regarding security, note that
unsafe HTML code is not legally invalid per se.
- * htmLawed is meant for input that goes into the
body of HTML documents. HTML's head-level elements are not supported, nor are the frameset elements
frameset,
frame and
noframes.
+ * htmLawed is meant for input that goes into the
body of HTML documents. HTML's head-level elements are not supported, nor are the frameset elements
frameset,
frame and
noframes. Content of the latter elements can, however, be individually filtered through htmLawed.
* It cannot transform the non-standard
embed elements to the standard-compliant
object elements. Yet, it can allow
embed elements if permitted (
embed is widely used and supported). Admins can certainly use the
hook_tag parameter (
section 3.4.9) to deploy a custom embed-to-object converter function.
@@ -721,7 +745,7 @@ A PHP Labware internal utility -
width="20m" with the dimension in non-standard m is let through. Implementing universal and strict attribute value checks can make htmLawed slow and resource-intensive. Admins should look at the hook_tag parameter (section 3.4.9) or
$spec to enforce finer checks.
- * The attributes, deprecated (which can be transformed too) or not, that it supports are largely those that are in the specs. Only a few of the proprietary attributes are supported.
+ * The attributes, deprecated (which can be transformed too) or not, that it supports are largely those that are in the specifications. Only a few of the proprietary attributes are supported.
* Except for contained URLs and dynamic expressions (also optional), htmLawed does not check CSS style property values. Admins should look at using the
hook_tag parameter (
section 3.4.9) or
$spec for finer checks. Perhaps the best option is to disallow
style but allow
class attributes with the right
oneof or
match values for
class, and have the various class style properties in
.css CSS stylesheet files.
@@ -733,11 +757,11 @@ A PHP Labware internal utility -
http to https. Having absolute URLs may be a standard-requirement, e.g., when HTML is embedded in email messages, whereas altering URLs for other purposes is beyond htmLawed's goals. Admins may be able to use a custom hook function to enforce such checks (hook_tag parameter; see section 3.4.9).
- * Pairs of opening and closing tags that do not enclose any content (like
<em></em>) are not removed. This may be against the standard specs for certain elements (e.g.,
table). However, presence of such standard-incompliant code will not break the display or layout of content. Admins can also use simple regex-based code to filter out such code.
+ * Pairs of opening and closing tags that do not enclose any content (like
<em></em>) are not removed. This may be against the standard specifications for certain elements (e.g.,
table). However, presence of such standard-incompliant code will not break the display or layout of content. Admins can also use simple regex-based code to filter out such code.
- * htmLawed does not check for certain element orderings described in the standard specs (e.g., in a
table,
tbody is allowed before
tfoot). Admins may be able to use a custom hook function to enforce such checks (
hook_tag parameter; see
section 3.4.9).
+ * htmLawed does not check for certain element orderings described in the standard specifications (e.g., in a
table,
tbody is allowed before
tfoot). Admins may be able to use a custom hook function to enforce such checks (
hook_tag parameter; see
section 3.4.9).
- * htmLawed does not check the number of nested elements. E.g., it will allow two
caption elements in a
table element, illegal as per the specs. Admins may be able to use a custom hook function to enforce such checks (
hook_tag parameter; see
section 3.4.9).
+ * htmLawed does not check the number of nested elements. E.g., it will allow two
caption elements in a
table element, illegal as per the specifications. Admins may be able to use a custom hook function to enforce such checks (
hook_tag parameter; see
section 3.4.9).
* htmLawed might convert certain entities to actual characters and remove backslashes and CSS comment-markers (
/*) in
style attribute values in order to detect malicious HTML like crafted IE-specific dynamic expressions like
expression.... If this is too harsh, admins can allow CSS expressions through htmLawed core but then use a custom function through the
hook_tag parameter (
section 3.4.9) to more specifically identify CSS expressions in the
style attribute values. Also, using
$config["style_pass"], it is possible to have htmLawed pass
style attribute values without even looking at them (
section 3.4.8).
@@ -745,7 +769,9 @@ A PHP Labware internal utility -
section 3.1).
- * htmLawed does not check or correct the character encoding of the input it receives. In conjunction with permitting circumstances such as when the character encoding is left undefined through HTTP headers or HTML
meta tags, this can permit an exploit (like Google's UTF-7/XSS vulnerability of the past).
+ * htmLawed does not check or correct the character encoding of the input it receives. In conjunction with permitting circumstances such as when the character encoding is left undefined through HTTP headers or HTML
meta tags, this can permit an exploit (like Google's
UTF-7/XSS vulnerability of the past). Also, htmLawed can mangle input text if it is not well-formed in terms of character encoding. Administrators can consider using code available elsewhere to check well-formedness of input text characters to correct any defect.
+
+ * htmLawed is expected to work with input texts in ASCII-compatible single byte encodings such as national variants of ASCII (like ISO-646-DE/German of the ISO 646 standard), extended ASCII variants (like ISO 8859-10/Turkish of the ISO 8859/ISO Latin standard), ISO 8859-based Windows variants (like Windows 1252), EBCDIC, Shift JIS (Japanese), GB-Roman (Chinese), and KS-Roman (Korean). It should also properly handle texts with variable byte encodings like UTF-7 (Unicode) and UTF-8 (Unicode). However, htmLawed may mangle input texts with double byte encodings like UTF-16 (Unicode), JIS X 0208:1997 (Japanese) and K SX 1001:1992 (Korean), or the UTF-32 (Unicode) quadruple byte encoding. If an input text has such an encoding, administrators can use PHP's
iconv functions, or some other mean, to convert text to UTF-8 before passing it to htmLawed.
* Like any script using PHP's PCRE regex functions, PHP setup-specific low PCRE limit values can cause htmLawed to at least partially fail with very long input texts.
@@ -832,12 +858,21 @@ A PHP Labware internal utility -
$spec = 'a=title';
+ $out = htmLawed($in, $config, $spec);
+
+
+ Allowing a custom attribute, vFlag, in img and permitting custom use of the standard attribute, rel, in input --
+
+
+ $spec = 'img=vFlag; input=rel';
+
+
$out = htmLawed($in, $config, $spec);
Some case-studies are presented below.
- 1. A blog administrator wants to allow only a, em, strike, strong and u in comments, but needs strike and u transformed to span for better XHTML 1-strict compliance, and, he wants the a links to be to http or https resources:
+ 1. A blog administrator wants to allow only a, em, strike, strong and u in comments, but needs strike and u transformed to span for better XHTML 1-strict compliance, and, he wants the a links to point only to http or https resources:
$processed = htmLawed($in, array('elements'=>'a, em, strike, strong, u', 'make_tag_strict'=>1, 'safe'=>1, 'schemes'=>'*:http, https'), 'a=href');
@@ -1689,7 +1724,7 @@ A PHP Labware internal utility - 3.9 Retaining non-HTML tags in input with mixed markup
(to top)
- htmLawed does not remove certain characters that though invalid are nevertheless discouraged in HTML documents as per the specs (see
section 5.1). This can be utilized to deal with input that contains mixed markup. Input that may have HTML markup as well as some other markup that is based on the
<,
> and
& characters is considered to have mixed markup. The non-HTML markup can be rather proprietary (like markup for emoticons/smileys), or standard (like MathML or SVG). Or it can be programming code meant for execution/evaluation (such as embedded PHP code).
+ htmLawed does not remove certain characters that, though invalid, are nevertheless
discouraged in HTML documents as per the specifications (see
section 5.1). This can be utilized to deal with input that contains mixed markup. Input that may have HTML markup as well as some other markup that is based on the
<,
> and
& characters is considered to have mixed markup. The non-HTML markup can be rather proprietary (like markup for emoticons/smileys), or standard (like MathML or SVG). Or it can be programming code meant for execution/evaluation (such as embedded PHP code).
To deal with such mixed markup, the input text can be pre-processed to hide the non-HTML markup by specifically replacing the
<,
> and
& characters with some of the HTML-discouraged characters (see
section 3.1.2). Post-htmLawed processing, the replacements are reverted.
@@ -1718,7 +1753,7 @@ A PHP Labware internal utility -
4.1 Support
(to top)
- A careful re-reading of this documentation will very likely answer your questions.
+ A careful reading of this documentation may provide an answer.
Software updates and forum-based community-support may be found at
http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed. For general PHP issues (not htmLawed-specific), support may be found through internet searches and at
http://php.net.
@@ -1728,18 +1763,18 @@ A PHP Labware internal utility -
(to top)
See
section 2.8.
-
- Readers are advised to cross-check information given in this document.
-htmLawed 1.1.13, 22 July 2012
+
+htmLawed 1.1.14, 8 August 2012
Copyright Santosh Patnaik
Dual licensed with LGPL 3 and GPL 2+
A PHP Labware internal utility - http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed
- htmLawed is a highly customizable single-file PHP script to make text secure, and standard- and admin policy-compliant for use in the body of HTML 4, XHTML 1 or 1.1, or generic XML documents. It is thus a configurable input (X)HTML filter, processor, purifier, sanitizer, beautifier, etc., and an alternative to the HTMLTidy application.
+ htmLawed is a PHP script to process text with HTML markup to make it more comliant with HTML standards and administrative policies. It works by making HTML well-formed with balanced and properly nested tags, neutralizing code that may be used for cross-site scripting (XSS) attacks, allowing only specified HTML tags and attributes, and so on. Such lawing in of HTML in text used in (X)HTML or XML documents ensures that it is in accordance with the aesthetics, safety and usability requirements set by administrators.
- The lawing in of input text is needed to ensure that HTML code in the text is standard-compliant, does not introduce security vulnerabilities, and does not break the aesthetics, design or layout of web-pages. htmLawed tries to do this by, for example, making HTML well-formed with balanced and properly nested tags, neutralizing code that may be used for cross-site scripting (XSS) attacks, and allowing only specified HTML elements/tags and attributes.
+ htmLawed is highly customizable, and fast with low memory usage. Its free and open-source code is in one small file, does not require extensions or libraries, and works in older versions of PHP as well. It is a good alternative to the HTML Tidy application.
1.1 Example uses @@ -151,8 +151,8 @@ A PHP Labware internal utility - img ^`
(to top)+ * can restrict elements ^~`
+ * ensures proper closure of empty elements like img ^`
* transform deprecated elements like u ^~`
* HTML comments and CDATA sections can be permitted ^~`
* elements like script, object and form can be permitted ~
@@ -161,7 +161,7 @@ A PHP Labware internal utility - alt for image ^`
- * transform deprecated attributes ^~`
+ * transforms deprecated attributes ^~`
* attributes declared only once ^`
* restrict attribute values, including element-specifically ^~`
@@ -214,52 +214,74 @@ A PHP Labware internal utility - 1.3 History
- htmLawed was developed for use with LabWiki, a wiki software developed at PHP Labware, as a suitable software could not be found. Existing PHP software like Kses and HTMLPurifier were deemed inadequate, slow, resource-intensive, or dependent on external applications like HTML Tidy.
+ htmLawed was created in 2007 for use with LabWiki, a wiki software developed at PHP Labware, as a suitable software could not be found. Existing PHP software like Kses and HTMLPurifier were deemed inadequate, slow, resource-intensive, or dependent on an extension or external application like HTML Tidy. The core logic of htmLawed, that of identifying HTML elements and attributes, was based on the Kses (version 0.2.2) HTML filter software of Ulf Harnhammar (it can still be used with code that uses Kses; see section 2.6.).
- htmLawed started as a modification of Ulf Harnhammar's Kses (version 0.2.2) software, and is compatible with code that uses Kses; see section 2.6.
+ See section 4.3 for a detailed log of changes in htmLawed over the years, and section 4.10 for acknowledgements.
1.4 License & copyright
(to top)- htmLawed is free and open-source software dual licensed under LGPL license version 3, and GPL license version 2 (or later), and copyrighted by Santosh Patnaik, MD, PhD.
+ htmLawed is free and open-source software dual copyrighted by Santosh Patnaik, MD, PhD, and licensed under LGPL license version 3, and GPL license version 2 (or later).
1.5 Terms used here
(to top)- * administrator - or admin; person setting up the code to pass input through htmLawed; also, user
+ In this document, only HTML body-level elements are considered. htmLawed does not have support for head-level elements, body, and the frame-level elements, frameset, frame and noframes, and these elements are ignored here.
+
+ * administrator - or admin; person setting up the code that utilizes htmLawed; also, user
* attributes - name-value pairs like href="http://x.com" in opening tags
- * author - writer
+ * author - see writer
* character - atomic unit of text; internally represented by a numeric code-point as specified by the encoding or charset in use
* entity - markup like > and   used to refer to a character
* element - HTML element like a and img
- * element content - content between the opening and closing tags of an element, like click of <a href="x">click</a>
+ * element content - content between the opening and closing tags of an element, like click of the <a href="x">click</a> element
* HTML - implies XHTML unless specified otherwise
- * input - text string given to htmLawed to process
+ * HTML body - Complete HTML documents typically have a head and a body container. Information in head specifies title of the document, etc., whereas that in the body informs what is to be displayed on a web-page; it is only the elements for body, except frames, frameset and noframes that htmLawed is concerned with
+ * input - text given to htmLawed to process
* processing - involves filtering, correction, etc., of input
- * safe - absence or reduction of certain characters and HTML elements and attributes in the input that can otherwise potentially and circumstantially expose web-site users to security vulnerabilities like cross-site scripting attacks (XSS)
- * scheme - URL protocol like http and ftp
- * specs - standard specifications
+ * safe - absence or reduction of certain characters and HTML elements and attributes in HTML of text that can otherwise potentially, and circumstantially, expose text readers to security vulnerabilities like cross-site scripting attacks (XSS)
+ * scheme - a URL protocol like http and ftp
+ * specifications - standard specifications, for HTML4, HTML5, Ruby, etc.
* style property - terms like border and height for which declarations are made in values for the style attribute of elements
* tag - markers like <a href="x"> and </a> delineating element content; the opening tag can contain attributes
* tag content - consists of tag markers < and >, element names like div, and possibly attributes
* user - administrator
* writer - end-user like a blog commenter providing the input that is to be processed; also, author
+
+1.6 Availability +
(to top)+
+ htmLawed can be downloaded for free at its website. Besides the htmLawed.php file, the download has the htmLawed documentation (this document) in plain text and HTML formats, a script for testing, and a text file for test-cases. htmLawed is also available as a PHP class (OOP code) on its website.
+