1 About htmLawed
(to top)
htmLawed is a PHP script to process text with HTML markup to make it more comliant with HTML standards and administrative policies. It works by making HTML well-formed with balanced and properly nested tags, neutralizing code that may be used for cross-site scripting (XSS) attacks, allowing only specified HTML tags and attributes, and so on. Such
lawing in of HTML in text used in (X)HTML or XML documents ensures that it is in accordance with the aesthetics, safety and usability requirements set by administrators.
htmLawed is highly customizable, and fast with low memory usage. Its free and open-source code is in one small file, does not require extensions or libraries, and works in older versions of PHP as well. It is a good alternative to the HTML
Tidy application.
1.1 Example uses
(to top)
* Filtering of text submitted as comments on blogs to allow only certain HTML elements
* Making RSS/Atom newsfeed item-content standard-compliant: often one uses an excerpt from an HTML document for the content, and with unbalanced tags, non-numerical entities, etc., such excerpts may not be XML-compliant
* Text processing for stricter XML standard-compliance: e.g., to have lowercased
x in hexadecimal numeric entities becomes necessary if an XHTML document with MathML content needs to be served as
application/xml
* Scraping text or data from web-pages
* Pretty-printing HTML code
1.2 Features
(to top)
Key:
* security feature,
^ standard compliance,
~ requires setting right options,
` different from
Kses
* make input more
secure and
standard-compliant
* use for HTML 4, XHTML 1.0 or 1.1, or even generic
XML documents ^~`
*
beautify or
compact HTML ^~`
* can
restrict elements ^~`
* ensures proper closure of empty elements like
img ^`
*
transform deprecated elements like
u ^~`
* HTML
comments and
CDATA sections can be permitted ^~`
* elements like
script,
object and
form can be permitted ~
*
restrict attributes, including
element-specifically ^~`
* remove
invalid attributes ^`
* element and attribute names are
lower-cased ^
* provide
required attributes, like
alt for
image ^`
*
transforms deprecated attributes ^~`
* attributes
declared only once ^`
*
restrict attribute values, including
element-specifically ^~`
* a value is declared for
empty (
minimized) attributes like
checked ^
* check for potentially dangerous attribute values *~
* ensure
unique id attribute values ^~`
*
double-quote attribute values ^
* lower-case
standard attribute values like
password ^`
* permit custom, non-standard attributes as well as custom rules for standard attributes ~`
*
attribute-specific URL protocol/scheme restriction *~`
* disable
dynamic expressions in
style values *~`
* neutralize invalid named character entities ^`
*
convert hexadecimal numeric entities to decimal ones, or vice versa ^~`
* convert named entities to numeric ones for generic XML use ^~`
* remove
null characters *
* neutralize potentially dangerous proprietary Netscape
Javascript entities *
* replace potentially dangerous
soft-hyphen character in URL-accepting attribute values with spaces *
* remove common
invalid characters not allowed in HTML or XML ^`
* replace
characters from Microsoft applications like
Word that are discouraged in HTML or XML ^~`
* neutralize entities for characters invalid or discouraged in HTML or XML ^`
* appropriately neutralize
<,
&,
", and
> characters ^*`
* understands improperly spaced tag content (like, spread over more than a line) and properly spaces them `
* attempts to
balance tags for well-formedness ^~`
* understands when
omitable closing tags like
</p> (allowed in HTML 4, transitional, e.g.) are missing ^~`
* attempts to permit only
validly nested tags ^~`
* option to
remove or neutralize bad content ^~`
* attempts to
rectify common errors of plain-text misplacement (e.g., directly inside
blockquote) ^~`
* fast,
non-OOP code of ~45 kb incurring peak basal memory usage of ~0.5 MB
*
compatible with pre-existing code using
Kses (the filter used by
WordPress)
* optional
anti-spam measures such as addition of
rel="nofollow" and link-disabling ~`
* optionally makes
relative URLs absolute, and vice versa ~`
* optionally mark
& to identify the entities for
&,
< and
> introduced by htmLawed ~`
* allows deployment of powerful
hook functions to
inject HTML,
consolidate style attributes to
class, finely check attribute values, etc. ~`
*
independent of character encoding of input and does not affect it
*
tolerance for ill-written HTML to a certain degree
1.3 History
(to top)
htmLawed was created in 2007 for use with
LabWiki, a wiki software developed at PHP Labware, as a suitable software could not be found. Existing PHP software like
Kses and
HTMLPurifier were deemed inadequate, slow, resource-intensive, or dependent on an extension or external application like
HTML Tidy. The core logic of htmLawed, that of identifying HTML elements and attributes, was based on the
Kses (version 0.2.2) HTML filter software of Ulf Harnhammar (it can still be used with code that uses
Kses; see
section 2.6.).
See
section 4.3 for a detailed log of changes in htmLawed over the years, and
section 4.10 for acknowledgements.
1.4 License & copyright
(to top)
htmLawed is free and open-source software dual copyrighted by Santosh Patnaik, MD, PhD, and licensed under LGPL license version
3, and GPL license version
2 (or later).
1.5 Terms used here
(to top)
In this document, only HTML body-level elements are considered. htmLawed does not have support for head-level elements,
body, and the frame-level elements,
frameset,
frame and
noframes, and these elements are ignored here.
*
administrator - or admin; person setting up the code that utilizes htmLawed; also,
user
*
attributes - name-value pairs like
href="http://x.com" in opening tags
*
author - see
writer
*
character - atomic unit of text; internally represented by a numeric
code-point as specified by the
encoding or
charset in use
*
entity - markup like
> and
  used to refer to a character
*
element - HTML element like
a and
img
*
element content - content between the opening and closing tags of an element, like
click of the
<a href="x">click</a> element
*
HTML - implies XHTML unless specified otherwise
*
HTML body - Complete HTML documents typically have a
head and a
body container. Information in
head specifies title of the document, etc., whereas that in the body informs what is to be displayed on a web-page; it is only the elements for
body, except
frames,
frameset and
noframes that htmLawed is concerned with
*
input - text given to htmLawed to process
*
processing - involves filtering, correction, etc., of input
*
safe - absence or reduction of certain characters and HTML elements and attributes in HTML of text that can otherwise potentially, and circumstantially, expose text readers to security vulnerabilities like cross-site scripting attacks (XSS)
*
scheme - a URL protocol like
http and
ftp
*
specifications - standard specifications, for HTML4, HTML5, Ruby, etc.
*
style property - terms like
border and
height for which declarations are made in values for the
style attribute of elements
*
tag - markers like
<a href="x"> and
</a> delineating element content; the opening tag can contain attributes
*
tag content - consists of tag markers
< and
>, element names like
div, and possibly attributes
*
user - administrator
*
writer - end-user like a blog commenter providing the input that is to be processed; also,
author
1.6 Availability
(to top)
htmLawed can be downloaded for free at its
website. Besides the
htmLawed.php file, the download has the htmLawed documentation (this document) in plain
text and
HTML formats, a script for
testing, and a text file for
test-cases. htmLawed is also available as a PHP class (OOP code) on its website.
2 Usage
(to top)
htmLawed works in PHP version 4.4 or higher. Either
include() the
htmLawed.php file, or copy-paste the entire code. To use with PHP 4.3, have the following code included:
if(!function_exists('ctype_digit')){
function ctype_digit($var){
return ((int) $var == $var);
}
}
2.1 Simple
(to top)
The input text to be processed,
$text, is passed as an argument of type string;
htmLawed() returns the processed string:
$processed = htmLawed($text);
With the
htmLawed class (
section 1.6), usage is:
$processed = htmLawed::hl($text);
Notes: (1) If input is from a
$_GET or
$_POST value, and
magic quotes are enabled on the PHP setup, run
stripslashes() on the input before passing to htmLawed. (2) htmLawed does not have support for head-level elements,
body, and the frame-level elements,
frameset,
frame and
noframes.
By default, htmLawed will process the text allowing all valid HTML elements/tags, secure URL scheme/CSS style properties, etc. It will allow
CDATA sections and HTML comments, balance tags, and ensure proper nesting of elements. Such actions can be configured using two other optional arguments --
$config and
$spec:
$processed = htmLawed($text, $config, $spec);
The
$config and
$spec arguments are detailed below. Some examples are shown in
section 2.9. For maximum protection against
XSS and other scripting attacks (e.g., by disallowing Javascript code), consider using the
safe parameter; see
section 3.6.
2.2 Configuring htmLawed using the $config parameter
(to top)
$config instructs htmLawed on how to tackle certain tasks. When
$config is not specified, or not set as an array (e.g.,
$config = 1), htmLawed will take default actions. One or many of the task-action or value-specification pairs can be specified in
$config as array key-value pairs. If a parameter is not specified, htmLawed will use the default value/action indicated further below.
$config = array('comment'=>0, 'cdata'=>1);
$processed = htmLawed($text, $config);
Or,
$processed = htmLawed($text, array('comment'=>0, 'cdata'=>1));
Below are the possible value-specification combinations. In PHP code, values that are integers should not be quoted and should be used as numeric types (unless meant as string/text).
Key:
* default,
^ different default when htmLawed is used in the Kses-compatible mode (see
section 2.6),
~ different default when
valid_xhtml is set to
1 (see
section 3.5),
" different default when
safe is set to
1 (see
section 3.6)
abs_url
Make URLs absolute or relative;
$config["base_url"] needs to be set; see
section 3.4.4
-1 - make relative
0 - no action *
1 - make absolute
and_mark
Mark
& characters in the original input; see
section 3.2
anti_link_spam
Anti-link-spam measure; see
section 3.4.7
0 - no measure taken *
array("regex1", "regex2") - will ensure a
rel attribute with
nofollow in its value in case the
href attribute value matches the regular expression pattern
regex1, and/or will remove
href if its value matches the regular expression pattern
regex2. E.g.,
array("/./", "/://\W*(?!(abc\.com|xyz\.org))/"); see
section 3.4.7 for more.
anti_mail_spam
Anti-mail-spam measure; see
section 3.4.7
0 - no measure taken *
word -
@ in mail address in
href attribute value is replaced with specified
word
balance
Balance tags for well-formedness and proper nesting; see
section 3.3.3
0 - no
1 - yes *
base_url
Base URL value that needs to be set if
$config["abs_url"] is not
0; see
section 3.4.4
cdata
Handling of
CDATA sections; see
section 3.3.1
0 - don't consider
CDATA sections as markup and proceed as if plain text ^"
1 - remove
2 - allow, but neutralize any
<,
>, and
& inside by converting them to named entities
3 - allow *
clean_ms_char
Replace discouraged characters introduced by Microsoft Word, etc.; see
section 3.1
0 - no *
1 - yes
2 - yes, but replace special single & double quotes with ordinary ones
comment
Handling of HTML comments; see
section 3.3.1
0 - don't consider comments as markup and proceed as if plain text ^"
1 - remove
2 - allow, but neutralize any
<,
>, and
& inside by converting to named entities
3 - allow *
css_expression
Allow dynamic CSS expression by not removing the expression from CSS property values in
style attributes; see
section 3.4.8
0 - remove *
1 - allow
deny_attribute
Denied HTML attributes; see
section 3.4
0 - none *
string - dictated by values in
string
on* (like
onfocus) attributes not allowed - "
direct_nest_list
Allow direct nesting of a list within another without requiring it to be a list item; see
section 3.3.4
0 - no *
1 - yes
elements
Allowed HTML elements; see
section 3.3
* -center -dir -font -isindex -menu -s -strike -u - ~
applet, embed, iframe, object, script not allowed - "
hexdec_entity
Allow hexadecimal numeric entities and do not convert to the more widely accepted decimal ones, or convert decimal to hexadecimal ones; see
section 3.2
0 - no
1 - yes *
2 - convert decimal to hexadecimal ones
hook
Name of an optional hook function to alter the input string,
$config or
$spec before htmLawed starts its main work; see
section 3.7
0 - no hook function *
name -
name is name of the hook function (
kses_hook ^)
hook_tag
Name of an optional hook function to alter tag content finalized by htmLawed; see
section 3.4.9
0 - no hook function *
name -
name is name of the hook function
keep_bad
Neutralize bad tags by converting
< and
> to entities, or remove them; see
section 3.3.3
0 - remove ^
1 - neutralize both tags and element content
2 - remove tags but neutralize element content
3 and
4 - like
1 and
2 but remove if text (
pcdata) is invalid in parent element
5 and
6 * - like
3 and
4 but line-breaks, tabs and spaces are left
lc_std_val
For XHTML compliance, predefined, standard attribute values, like
get for the
method attribute of
form, must be lowercased; see
section 3.4.5
0 - no
1 - yes *
make_tag_strict
Transform/remove these non-strict XHTML elements, even if they are allowed by the admin:
applet center dir embed font isindex menu s strike u; see
section 3.3.2
0 - no ^
1 - yes, but leave
applet,
embed and
isindex elements that currently can't be transformed *
2 - yes, removing
applet,
embed and
isindex elements and their contents (nested elements remain) ~
named_entity
Allow non-universal named HTML entities, or convert to numeric ones; see
section 3.2
0 - convert
1 - allow *
no_deprecated_attr
Allow deprecated attributes or transform them; see
section 3.4.6
0 - allow ^
1 - transform, but
name attributes for
a and
map are retained *
2 - transform
parent
Name of the parent element, possibly imagined, that will hold the input; see
section 3.3
safe
Magic parameter to make input the most secure against XSS without needing to specify other relevant
$config parameters; see
section 3.6
0 - no *
1 - will auto-adjust other relevant
$config parameters (indicated by
" in this list)
schemes
Array of attribute-specific, comma-separated, lower-cased list of schemes (protocols) allowed in attributes accepting URLs (or
! to
deny any URL);
* covers all unspecified attributes; see
section 3.4.3
href: aim, feed, file, ftp, gopher, http, https, irc, mailto, news, nntp, sftp, ssh, telnet; *:file, http, https *
*: ftp, gopher, http, https, mailto, news, nntp, telnet ^
href: aim, feed, file, ftp, gopher, http, https, irc, mailto, news, nntp, sftp, ssh, telnet; style: !; *:file, http, https "
show_setting
Name of a PHP variable to assign the
finalized $config and
$spec values; see
section 3.8
style_pass
Do not look at
style attribute values, letting them through without any alteration
0 - no *
1 - htmLawed will let through any
style value; see
section 3.4.8
tidy
Beautify or compact HTML code; see
section 3.3.5
-1 - compact
0 - no *
1 or
string - beautify (custom format specified by
string)
unique_ids
id attribute value checks; see
section 3.4.2
0 - no ^
1 - remove duplicate and/or invalid ones *
word - remove invalid ones and replace duplicate ones with new and unique ones based on the
word; the admin-specified
word, like
my_, should begin with a letter (a-z) and can contain letters, digits,
.,
_,
-, and
:.
valid_xhtml
Magic parameter to make input the most valid XHTML without needing to specify other relevant
$config parameters; see
section 3.5
0 - no *
1 - will auto-adjust other relevant
$config parameters (indicated by
~ in this list)
xml:lang
Auto-adding
xml:lang attribute; see
section 3.4.1
0 - no *
1 - add if
lang attribute is present
2 - add if
lang attribute is present, and remove
lang ~
2.3 Extra HTML specifications using the $spec parameter
(to top)
The
$spec argument of htmLawed can be used to disallow an otherwise legal attribute for an element, or to restrict the attribute's values. This can also be helpful as a security measure (e.g., in certain versions of browsers, certain values can cause buffer overflows and denial of service attacks), or in enforcing admin policies.
$spec is specified as a string of text containing one or more
rules, with multiple rules separated from each other by a semi-colon (
;). E.g.,
$spec = 'i=-*; td, tr=style, id, -*; a=id(match="/[a-z][a-z\d.:\-`"]*/i"/minval=2), href(maxlen=100/minlen=34); img=-width,-alt';
$processed = htmLawed($text, $config, $spec);
Or,
$processed = htmLawed($text, $config, 'i=-*; td, tr=style, id, -*; a=id(match="/[a-z][a-z\d.:\-`"]*/i"/minval=2), href(maxlen=100/minlen=34); img=-width,-alt');
A rule begins with an HTML
element name(s) (
rule-element), for which the rule applies, followed by an equal (
=) sign. A rule-element may represent multiple elements if comma (,)-separated element names are used. E.g.,
th,td,tr=.
Rest of the rule consists of comma-separated HTML
attribute names. A minus (
-) character before an attribute means that the attribute is not permitted inside the rule-element. E.g.,
-width. To deny all attributes,
-* can be used.
Following shows examples of rule excerpts with rule-element
a and the attributes that are being permitted:
*
a= - all
*
a=id - all
*
a=href, title, -id, -onclick - all except
id and
onclick
*
a=*, id, -id - all except
id
*
a=-* - none
*
a=-*, href, title - none except
href and
title
*
a=-*, -id, href, title - none except
href and
title
Rules regarding
attribute values are optionally specified inside round brackets after attribute names in slash ('/')-separated
parameter = value pairs. E.g.,
title(maxlen=30/minlen=5). None or one or more of the following parameters may be specified:
*
oneof - one or more choices separated by
| that the value should match; if only one choice is provided, then the value must match that choice
*
noneof - one or more choices separated by
| that the value should not match
*
maxlen and
minlen - upper and lower limits for the number of characters in the attribute value; specified in numbers
*
maxval and
minval - upper and lower limits for the numerical value specified in the attribute value; specified in numbers
*
match and
nomatch - pattern that the attribute value should or should not match; specified as PHP/PCRE-compatible regular expressions with delimiters and possibly modifiers
*
default - a value to force on the attribute if the value provided by the writer does not fit any of the specified parameters
If
default is not set and the attribute value does not satisfy any of the specified parameters, then the attribute is removed. The
default value can also be used to force all attribute declarations to take the same value (by getting the values declared illegal by setting, e.g.,
maxlen to
-1).
Examples with
input <input title="WIDTH" value="10em" /><input title="length" value="5" /> are shown below.
Rule:
input=title(maxlen=60/minlen=6), value
Output:
<input value="10em" /><input title="length" value="5" />
Rule:
input=title(), value(maxval=8/default=6)
Output:
<input title="WIDTH" value="6" /><input title="length" value="5" />
Rule:
input=title(nomatch=%w.d%i), value(match=%em%/default=6em)
Output:
<input value="10em" /><input title="length" value="6em" />
Rule:
input=title(oneof=height|depth/default=depth), value(noneof=5|6)
Output:
<input title="depth" value="10em" /><input title="depth" />
Special characters: The characters
;,
,,
/,
(,
),
|,
~ and space have special meanings in the rules. Words in the rules that use such characters, or the characters themselves, should be
escaped by enclosing in pairs of double-quotes (
"). A back-tick (
`) can be used to escape a literal
". An example rule illustrating this is
input=value(maxlen=30/match="/^\w/"/default="your `"ID`"").
Note: To deny an attribute for all elements for which it is legal,
$config["deny_attribute"] (see
section 3.4) can be used instead of
$spec. Also, attributes can be allowed element-specifically through
$spec while being denied globally through
$config["deny_attribute"]. The
hook_tag parameter (
section 3.4.9) can also be possibly used to implement a functionality like that achieved using
$spec functionality.
$spec can also be used to permit custom, non-standard attributes as well as custom rules for standard attributes. Thus, the following value of
$spec will permit the custom uses of the standard
rel attribute in
input (not permitted as per standards) and of a non-standard attribute,
vFlag, in
img.
$spec = 'img=vFlag; input=rel'
The attribute names can contain alphabets, colons (:) and hyphens (-), but they must start with an alphabet.
2.4 Performance time & memory usage
(to top)
The time and memory consumed during text processing by htmLawed depends on its configuration, the size of the input, and the amount, nestedness and well-formedness of the HTML markup within the input. In particular, tag balancing and beautification each can increase the processing time by about a quarter.
The htmLawed
demo can be used to evaluate the performance and effects of different types of input and
$config.
2.5 Some security risks to keep in mind
(to top)
When setting the parameters/arguments (like those to allow certain HTML elements) for use with htmLawed, one should bear in mind that the setting may let through potentially
dangerous HTML code which is meant to steal user-data, deface a website, render a page non-functional, etc. Unless end-users, either people or software, supplying the content are completely trusted, security issues arising from the degree of HTML usage permitted through htmLawed's setting should be considered. For example, following increase security risks:
* Allowing
script,
applet,
embed,
iframe or
object elements, or certain of their attributes like
allowscriptaccess
* Allowing HTML comments (some Internet Explorer versions are vulnerable with, e.g.,
<!--[if gte IE 4]><script>alert("xss");</script><![endif]-->
* Allowing dynamic CSS expressions (some Internet Explorer versions are vulnerable)
* Allowing the
style attribute
To remove
unsecure HTML, code-developers using htmLawed must set
$config appropriately. E.g.,
$config["elements"] = "* -script" to deny the
script element (
section 3.3),
$config["safe"] = 1 to auto-configure ceratin htmLawed parameters for maximizing security (
section 3.6), etc.
Permitting the
*style* attribute brings in risks of
click-jacking,
phishing, web-page overlays, etc.,
even when the
safe parameter is enabled (see
section 3.6). Except for URLs and a few other things like CSS dynamic expressions, htmLawed currently does not check every CSS style property. It does provide ways for the code-developer implementing htmLawed to do such checks through htmLawed's
$spec argument, and through the
hook_tag parameter (see
section 3.4.8 for more). Disallowing
style completely and relying on CSS classes and stylesheet files is recommended.
htmLawed does not check or correct the character
encoding of the input it receives. In conjunction with permissive circumstances, such as when the character encoding is left undefined through HTTP headers or HTML
meta tags, this can allow for an exploit (like Google's
UTF-7/XSS vulnerability of the past).
2.6 Use without modifying old kses() code
(to top)
The
Kses PHP script is used by many applications (like
WordPress). It is possible to have such applications use htmLawed instead, since it is compatible with code that calls the
kses() function declared in the
Kses file (usually named
kses.php). E.g., application code like this will continue to work after replacing
Kses with htmLawed:
$comment_filtered = kses($comment_input, array('a'=>array(), 'b'=>array(), 'i'=>array()));
For some of the
$config parameters, htmLawed will use values other than the default ones. These are indicated by
^ in
section 2.2. To force htmLawed to use other values, function
kses() in the htmLawed code should be edited -- a few configurable parameters/variables need to be changed.
If the application uses a
Kses file that has the
kses() function declared, then, to have the application use htmLawed instead of
Kses, simply rename
htmLawed.php (to
kses.php, e.g.) and replace the
Kses file (or just replace the code in the
Kses file with the htmLawed code). If the
kses() function in the
Kses file had been renamed by the application developer (e.g., in
WordPress, it is named
wp_kses()), then appropriately rename the
kses() function in the htmLawed code.
If the
Kses file used by the application has been highly altered by the application developers, then one may need a different approach. E.g., with
WordPress, it is best to copy the htmLawed code to
wp_includes/kses.php, rename the newly added function
kses() to
wp_kses(), and delete the code for the original
wp_kses() function.
If the
Kses code has a non-empty hook function (e.g.,
wp_kses_hook() in case of
WordPress), then the code for htmLawed's
kses_hook() function should be appropriately edited. However, the requirement of the hook function should be re-evaluated considering that htmLawed has extra capabilities. With
WordPress, the hook function is an essential one. The following code is suggested for the htmLawed
kses_hook() in case of
WordPress:
function kses_hook($string, &$cf, &$spec){
// kses compatibility
$allowed_html = $spec;
$allowed_protocols = array();
foreach($cf['schemes'] as $v){
foreach($v as $k2=>$v2){
if(!in_array($k2, $allowed_protocols)){
$allowed_protocols[] = $k2;
}
}
}
return wp_kses_hook($string, $allowed_html, $allowed_protocols);
// eof
}
2.7 Tolerance for ill-written HTML
(to top)
htmLawed can work with ill-written HTML code in the input. However, HTML that is too ill-written may not be
read as HTML, and may therefore get identified as mere plain text. Following statements indicate the degree of
looseness that htmLawed can work with, and can be provided in instructions to writers:
* Tags must be flanked by
< and
> with no
> inside -- any needed
> should be put in as
>. It is possible for tag content (element name and attributes) to be spread over many lines instead of being on one. A space may be present between the tag content and
>, like
<div > and
<img / >, but not after the
<.
* Element and attribute names need not be lower-cased.
* Attribute string of elements may be liberally spaced with tabs, line-breaks, etc.
* Attribute values may be single- and not double-quoted.
* Left-padding of numeric entities (like,
 ,
&x07ff;) with
0 is okay as long as the number of characters between between the
& and the
; does not exceed 8. All entities must end with
; though.
* Named character entities must be properly cased. Thus,
≪ or
&TILDE; will not be recognized as entities and will be
neutralized.
* HTML comments should not be inside element tags (they can be between tags), and should begin with
<!-- and end with
-->. Characters like
<,
>, and
& may be allowed inside depending on
$config, but any
--> inside should be put in as
-->. Any
-- inside will be automatically converted to
-, and a space will be added before the comment delimiter
-->.
*
CDATA sections should not be inside element tags, and can be in element content only if plain text is allowed for that element. They should begin with
<[CDATA[ and end with
]]>. Characters like
<,
>, and
& may be allowed inside depending on
$config, but any
]]> inside should be put in as
]]>.
* For attribute values, character entities
<,
> and
& should be used instead of characters
< and
>, and
& (when
& is not part of a character entity). This applies even for Javascript code in values of attributes like
onclick.
* Characters
<,
>,
& and
" that are part of actual Javascript, etc., code in
script elements should be used as such and not be put in as entities like
>. Otherwise, though the HTML will be valid, the code may fail to work. Further, if such characters have to be used, then they should be put inside
CDATA sections.
* Simple instructions like "an opening tag cannot be present between two closing tags" and "nested elements should be closed in the reverse order of how they were opened" can help authors write balanced HTML. If tags are imbalanced, htmLawed will try to balance them, but in the process, depending on
$config["keep_bad"], some code/text may be lost.
* Input authors should be notified of admin-specified allowed elements, attributes, configuration values (like conversion of named entities to numeric ones), etc.
* With
$config["unique_ids"] not
0 and the
id attribute being permitted, writers should carefully avoid using duplicate or invalid
id values as even though htmLawed will correct/remove the values, the final output may not be the one desired. E.g., when
<a id="home"></a><input id="home" /><label for="home"></label> is processed into
<a id="home"></a><input id="prefix_home" /><label for="home"></label>.
* Even if intended HTML is lost from an ill-written input, the processed output will be more secure and standard-compliant.
* For URLs, unless
$config["scheme"] is appropriately set, writers should avoid using escape characters or entities in schemes. E.g.,
http (which many browsers will read as the harmless
http) may be considered bad by htmLawed.
* htmLawed will attempt to put plain text present directly inside
blockquote,
form,
map and
noscript elements (illegal as per the specifications) inside auto-generated
div elements.
2.8 Limitations & work-arounds
(to top)
htmLawed's main objective is to make the input text
more standard-compliant, secure for readers, and free of HTML elements and attributes considered undesirable by the administrator. Some of its current limitations, regardless of this objective, are noted below along with work-arounds.
It should be borne in mind that no browser application is 100% standard-compliant, and that some of the standard specifications (like asking for normalization of white-spacing within
textarea elements) are clearly wrong. Regarding security, note that
unsafe HTML code is not legally invalid per se.
* htmLawed is meant for input that goes into the
body of HTML documents. HTML's head-level elements are not supported, nor are the frameset elements
frameset,
frame and
noframes. Content of the latter elements can, however, be individually filtered through htmLawed.
* It cannot transform the non-standard
embed elements to the standard-compliant
object elements. Yet, it can allow
embed elements if permitted (
embed is widely used and supported). Admins can certainly use the
hook_tag parameter (
section 3.4.9) to deploy a custom embed-to-object converter function.
* The only non-standard element that may be permitted is
embed; others like
noembed and
nobr cannot be permitted without modifying the htmLawed code.
* It cannot handle input that has non-HTML code like
SVG and
MathML. One way around is to break the input into pieces and passing only those without non-HTML code to htmLawed. Another is described in
section 3.9. A third way may be to some how take advantage of the
$config["and_mark"] parameter (see
section 3.2).
* By default, htmLawed won't check many attribute values for standard compliance. E.g.,
width="20m" with the dimension in non-standard
m is let through. Implementing universal and strict attribute value checks can make htmLawed slow and resource-intensive. Admins should look at the
hook_tag parameter (
section 3.4.9) or
$spec to enforce finer checks.
* The attributes, deprecated (which can be transformed too) or not, that it supports are largely those that are in the specifications. Only a few of the proprietary attributes are supported.
* Except for contained URLs and dynamic expressions (also optional), htmLawed does not check CSS style property values. Admins should look at using the
hook_tag parameter (
section 3.4.9) or
$spec for finer checks. Perhaps the best option is to disallow
style but allow
class attributes with the right
oneof or
match values for
class, and have the various class style properties in
.css CSS stylesheet files.
* htmLawed does not parse emoticons, decode
BBcode, or
wikify, auto-converting text to proper HTML. Similarly, it won't convert line-breaks to
br elements. Such functions are beyond its purview. Admins should use other code to pre- or post-process the input for such purposes.
* htmLawed cannot be used to have links force-opened in new windows (by auto-adding appropriate
target and
onclick attributes to
a). Admins should look at Javascript-based DOM-modifying solutions for this. Admins may also be able to use a custom hook function to enforce such checks (
hook_tag parameter; see
section 3.4.9).
* Nesting-based checks are not possible. E.g., one cannot disallow
p elements specifically inside
td while permitting it elsewhere. Admins may be able to use a custom hook function to enforce such checks (
hook_tag parameter; see
section 3.4.9).
* Except for optionally converting absolute or relative URLs to the other type, htmLawed will not alter URLs (e.g., to change the value of query strings or to convert
http to
https. Having absolute URLs may be a standard-requirement, e.g., when HTML is embedded in email messages, whereas altering URLs for other purposes is beyond htmLawed's goals. Admins may be able to use a custom hook function to enforce such checks (
hook_tag parameter; see
section 3.4.9).
* Pairs of opening and closing tags that do not enclose any content (like
<em></em>) are not removed. This may be against the standard specifications for certain elements (e.g.,
table). However, presence of such standard-incompliant code will not break the display or layout of content. Admins can also use simple regex-based code to filter out such code.
* htmLawed does not check for certain element orderings described in the standard specifications (e.g., in a
table,
tbody is allowed before
tfoot). Admins may be able to use a custom hook function to enforce such checks (
hook_tag parameter; see
section 3.4.9).
* htmLawed does not check the number of nested elements. E.g., it will allow two
caption elements in a
table element, illegal as per the specifications. Admins may be able to use a custom hook function to enforce such checks (
hook_tag parameter; see
section 3.4.9).
* htmLawed might convert certain entities to actual characters and remove backslashes and CSS comment-markers (
/*) in
style attribute values in order to detect malicious HTML like crafted IE-specific dynamic expressions like
expression.... If this is too harsh, admins can allow CSS expressions through htmLawed core but then use a custom function through the
hook_tag parameter (
section 3.4.9) to more specifically identify CSS expressions in the
style attribute values. Also, using
$config["style_pass"], it is possible to have htmLawed pass
style attribute values without even looking at them (
section 3.4.8).
* htmLawed does not correct certain possible attribute-based security vulnerabilities (e.g.,
<a href="http://x%22+style=%22background-image:xss">x</a>). These arise when browsers mis-identify markup in
escaped text, defeating the very purpose of escaping text (a bad browser will read the given example as
<a href="http://x" style="background-image:xss">x</a>).
* Because of poor Unicode support in PHP, htmLawed does not remove the
high value HTML-invalid characters with multi-byte code-points. Such characters however are extremely unlikely to be in the input. (see
section 3.1).
* htmLawed does not check or correct the character encoding of the input it receives. In conjunction with permitting circumstances such as when the character encoding is left undefined through HTTP headers or HTML
meta tags, this can permit an exploit (like Google's
UTF-7/XSS vulnerability of the past). Also, htmLawed can mangle input text if it is not well-formed in terms of character encoding. Administrators can consider using code available elsewhere to check well-formedness of input text characters to correct any defect.
* htmLawed is expected to work with input texts in ASCII-compatible single byte encodings such as national variants of ASCII (like ISO-646-DE/German of the ISO 646 standard), extended ASCII variants (like ISO 8859-10/Turkish of the ISO 8859/ISO Latin standard), ISO 8859-based Windows variants (like Windows 1252), EBCDIC, Shift JIS (Japanese), GB-Roman (Chinese), and KS-Roman (Korean). It should also properly handle texts with variable byte encodings like UTF-7 (Unicode) and UTF-8 (Unicode). However, htmLawed may mangle input texts with double byte encodings like UTF-16 (Unicode), JIS X 0208:1997 (Japanese) and K SX 1001:1992 (Korean), or the UTF-32 (Unicode) quadruple byte encoding. If an input text has such an encoding, administrators can use PHP's
iconv functions, or some other mean, to convert text to UTF-8 before passing it to htmLawed.
* Like any script using PHP's PCRE regex functions, PHP setup-specific low PCRE limit values can cause htmLawed to at least partially fail with very long input texts.
2.9 Examples of usage
(to top)
Safest, allowing only
safe HTML markup --
$config = array('safe'=>1);
$out = htmLawed($in);
Simplest, allowing all valid HTML markup except
javascript: --
$out = htmLawed($in);
Allowing all valid HTML markup including
javascript: --
$config = array('schemes'=>'*:*');
$out = htmLawed($in, $config);
Allowing only
safe HTML and the elements
a,
em, and
strong --
$config = array('safe'=>1, 'elements'=>'a, em, strong');
$out = htmLawed($in, $config);
Not allowing elements
script and
object --
$config = array('elements'=>'* -script -object');
$out = htmLawed($in, $config);
Not allowing attributes
id and
style --
$config = array('deny_attribute'=>'id, style');
$out = htmLawed($in, $config);
Permitting only attributes
title and
href --
$config = array('deny_attribute'=>'* -title -href');
$out = htmLawed($in, $config);
Remove bad/disallowed tags altogether instead of converting them to entities --
$config = array('keep_bad'=>0);
$out = htmLawed($in, $config);
Allowing attribute
title only in
a and not allowing attributes
id,
style, or scriptable
on* attributes like
onclick --
$config = array('deny_attribute'=>'title, id, style, on*');
$spec = 'a=title';
$out = htmLawed($in, $config, $spec);
Allowing a custom attribute,
vFlag, in
img and permitting custom use of the standard attribute,
rel, in
input --
$spec = 'img=vFlag; input=rel';
$out = htmLawed($in, $config, $spec);
Some case-studies are presented below.
1. A blog administrator wants to allow only
a,
em,
strike,
strong and
u in comments, but needs
strike and
u transformed to
span for better XHTML 1-strict compliance, and, he wants the
a links to point only to
http or
https resources:
$processed = htmLawed($in, array('elements'=>'a, em, strike, strong, u', 'make_tag_strict'=>1, 'safe'=>1, 'schemes'=>'*:http, https'), 'a=href');
2. An author uses a custom-made web application to load content on his web-site. He is the only one using that application and the content he generates has all types of HTML, including scripts. The web application uses htmLawed primarily as a tool to correct errors that creep in while writing HTML and to take care of the occasional
bad characters in copy-paste text introduced by Microsoft Office. The web application provides a preview before submitted input is added to the content. For the previewing process, htmLawed is set up as follows:
$processed = htmLawed($in, array('css_expression'=>1, 'keep_bad'=>1, 'make_tag_strict'=>1, 'schemes'=>'*:*', 'valid_xhtml'=>1));
For the final submission process,
keep_bad is set to
6. A value of
1 for the preview process allows the author to note and correct any HTML mistake without losing any of the typed text.
3. A data-miner is scraping information in a specific table of similar web-pages and is collating the data rows, and uses htmLawed to reduce unnecessary markup and white-spaces:
$processed = htmLawed($in, array('elements'=>'tr, td', 'tidy'=>-1), 'tr, td =');
3 Details
(to top)
3.1 Invalid/dangerous characters
(to top)
Valid characters (more correctly, their code-points) in HTML or XML are, hexadecimally,
9,
a,
d,
20 to
d7ff, and
e000 to
10ffff, except
fffe and
ffff (decimally,
9,
10,
13,
32 to
55295, and
57344 to
1114111, except
65534 and
65535). htmLawed removes the invalid characters
0 to
8,
b,
c, and
e to
1f.
Because of PHP's poor native support for multi-byte characters, htmLawed cannot check for the remaining invalid code-points. However, for various reasons, it is very unlikely for any of those characters to be in the input.
Characters that are discouraged (see
section 5.1) but not invalid are not removed by htmLawed.
It (function
hl_tag()) also replaces the potentially dangerous (in some Mozilla [Firefox] and Opera browsers) soft-hyphen character (code-point, hexadecimally,
ad, or decimally,
173) in attribute values with spaces. Where required, the characters
<,
>,
&, and
" are converted to entities.
With
$config["clean_ms_char"] set as
1 or
2, many of the discouraged characters (decimal code-points
127 to
159 except
133) that many Microsoft applications incorrectly use (as per the
Windows 1252 [
Cp-1252] or a similar encoding system), and the character for decimal code-point
133, are converted to appropriate decimal numerical entities (or removed for a few cases)-- see appendix in
section 5.4. This can help avoid some display issues arising from copying-pasting of content.
With
$config["clean_ms_char"] set as
2, characters for the hexadecimal code-points
82,
91, and
92 (for special single-quotes), and
84,
93, and
94 (for special double-quotes) are converted to ordinary single and double quotes respectively and not to entities.
The character values are replaced with entities/characters and not character values referred to by the entities/characters to keep this task independent of the character-encoding of input text.
The
$config["clean_ms_char"] parameter should not be used if authors do not copy-paste Microsoft-created text, or if the input text is not believed to use the
Windows 1252 (
Cp-1252) or a similar encoding like
Cp-1251. Further, the input form and the web-pages displaying it or its content should have the character encoding appropriately marked-up.
3.2 Character references/entities
(to top)
Valid character entities take the form
&*; where
* is
#x followed by a hexadecimal number (hexadecimal numeric entity; like
  for non-breaking space), or alphanumeric like
gt (external or named entity; like
for non-breaking space), or
# followed by a number (decimal numeric entity; like
  for non-breaking space). Character entities referring to the soft-hyphen character (the
­ or
\xad character; hexadecimal code-point
ad [decimal
173]) in URL-accepting attribute values are always replaced with spaces; soft-hyphens in attribute values introduce vulnerabilities in some older versions of the Opera and Mozilla [Firefox] browsers.
htmLawed (function
hl_ent()):
* Neutralizes entities with multiple leading zeroes or missing semi-colons (potentially dangerous)
* Lowercases the
X (for XML-compliance) and
A-F of hexadecimal numeric entities
* Neutralizes entities referring to characters that are HTML-invalid (see
section 3.1)
* Neutralizes entities referring to characters that are HTML-discouraged (code-points, hexadecimally,
7f to
84,
86 to
9f, and
fdd0 to
fddf, or decimally,
127 to
132,
134 to
159, and
64991 to
64976). Entities referring to the remaining discouraged characters (see
section 5.1 for a full list) are let through.
* Neutralizes named entities that are not in the specs.
* Optionally converts valid HTML-specific named entities except
>,
<,
", and
& to decimal numeric ones (hexadecimal if $config["hexdec_entity"] is
2) for generic XML-compliance. For this,
$config["named_entity"] should be
1.
* Optionally converts hexadecimal numeric entities to the more widely supported decimal ones. For this,
$config["hexdec_entity"] should be
0.
* Optionally converts decimal numeric entities to the hexadecimal ones. For this,
$config["hexdec_entity"] should be
2.
Neutralization refers to the
entitification of
& to
&.
Note: htmLawed does not convert entities to the actual characters represented by them; one can pass the htmLawed output through PHP's
html_entity_decode function for that.
Note: If
$config["and_mark"] is set, and set to a value other than
0, then the
& characters in the original input are replaced with the control character for the hexadecimal code-point
6 (
\x06;
& characters introduced by htmLawed, e.g., after converting
< to
<, are not affected). This allows one to distinguish, say, an
> introduced by htmLawed and an
> put in by the input writer, and can be helpful in further processing of the htmLawed-processed text (e.g., to identify the character sequence
o(><)o to generate an emoticon image). When this feature is active, admins should ensure that the htmLawed output is not directly used in web pages or XML documents as the presence of the
\x06 can break documents. Before use in such documents, and preferably before any storage, any remaining
\x06 should be changed back to
&, e.g., with:
$final = str_replace("\x06", '&', $prelim);
Also, see
section 3.9.
3.3 HTML elements
(to top)
htmLawed can be configured to allow only certain HTML elements (tags) in the input. Disallowed elements (just tag-content, and not element-content), based on
$config["keep_bad"], are either
neutralized (converted to plain text by entitification of
< and
>) or removed.
E.g., with only
em permitted:
Input:
<em>My</em> website is <a href="http://a.com>a.com</a>.
Output, with
$config["keep_bad"] = 0:
<em>My</em> website is a.com.
Output, with
$config["keep_bad"] not
0:
<em>My</em> website is <a href="">a.com</a>.
See
section 3.3.3 for differences between the various non-zero
$config["keep_bad"] values.
htmLawed by default permits these 86 elements:
a, abbr, acronym, address, applet, area, b, bdo, big, blockquote, br, button, caption, center, cite, code, col, colgroup, dd, del, dfn, dir, div, dl, dt, em, embed, fieldset, font, form, h1, h2, h3, h4, h5, h6, hr, i, iframe, img, input, ins, isindex, kbd, label, legend, li, map, menu, noscript, object, ol, optgroup, option, p, param, pre, q, rb, rbc, rp, rt, rtc, ruby, s, samp, script, select, small, span, strike, strong, sub, sup, table, tbody, td, textarea, tfoot, th, thead, tr, tt, u, ul, var
Except for
embed (included because of its wide-spread use) and the Ruby elements (
rb,
rbc,
rp,
rt,
rtc,
ruby; part of XHTML 1.1), these are all the elements in the HTML 4/XHTML 1 specs. Strict-specific specs. exclude
center,
dir,
font,
isindex,
menu,
s,
strike, and
u.
With
$config["safe"] = 1, the default set will exclude
applet,
embed,
iframe,
object and
script; see
section 3.6.
When
$config["elements"], which specifies allowed elements, is
properly defined, and neither empty nor set to
0 or
*, the default set is not used. To have elements added to or removed from the default set, a
+/- notation is used. E.g.,
*-script-object implies that only
script and
object are disallowed, whereas
*+embed means that
noembed is also allowed. Elements can also be specified as comma separated names. E.g.,
a, b, i means only
a,
b and
i are permitted. In this notation,
*,
+ and
- have no significance and can actually cause a mis-reading.
Some more examples of
$config["elements"] values indicating permitted elements (note that empty spaces are liberally allowed for clarity):
*
a, blockquote, code, em, strong -- only
a,
blockquote,
code,
em, and
strong
*
*-script -- all excluding
script
*
* -center -dir -font -isindex -menu -s -strike -u -- only XHTML-Strict elements
*
*+noembed-script -- all including
noembed excluding
script
Some mis-usages (and the resulting permitted elements) that can be avoided:
*
-* -- none; instead of htmLawed, one might just use, e.g., the
htmlspecialchars() PHP function
*
*, -script -- all except
script; admin probably meant
*-script
*
-*, a, em, strong -- all; admin probably meant
a, em, strong
*
* -- all; admin need not have set
elements
*
*-form+form -- all; a
+ will always over-ride any
-
*
*, noembed -- only
noembed; admin probably meant
*+noembed
*
a, +b, i -- only
a and
i; admin probably meant
a, b, i
Basically, when using the
+/- notation, commas (
,) should not be used, and vice versa, and
* should be used with the former but not the latter.
Note: Even if an element that is not in the default set is allowed through
$config["elements"], like
noembed in the last example, it will eventually be removed during tag balancing unless such balancing is turned off (
$config["balance"] set to
0). Currently, the only way around this, which actually is simple, is to edit the various arrays in the function
hl_bal() to accommodate the element and its nesting properties.
A possibly second way to specify allowed elements is to set
$config["parent"] to an element name that supposedly will hold the input, and to set
$config["balance"] to
1. During tag balancing (see
section 3.3.3), all elements that cannot legally nest inside the parent element will be removed. The parent element is auto-reset to
div if
$config["parent"] is empty,
body, or an element not in htmLawed's default set of 86 elements.
Tag transformation is possible for improving XHTML-Strict compliance -- most of the deprecated elements are removed or converted to valid XHTML-Strict ones; see
section 3.3.2.
3.3.1 Handling of comments and CDATA sections
(to top)
CDATA sections have the format
<![CDATA[...anything but not "]]>"...]]>, and HTML comments,
<!--...anything but not "-->"... -->. Neither HTML comments nor
CDATA sections can reside inside tags. HTML comments can exist anywhere else, but
CDATA sections can exist only where plain text is allowed (e.g., immediately inside
td element content but not immediately inside
tr element content).
htmLawed (function
hl_cmtcd()) handles HTML comments or
CDATA sections depending on the values of
$config["comment"] or
$config["cdata"]. If
0, such markup is not looked for and the text is processed like plain text. If
1, it is removed completely. If
2, it is preserved but any
<,
> and
& inside are changed to entities. If
3, they are left as such.
Note that for the last two cases, HTML comments and
CDATA sections will always be removed from tag content (function
hl_tag()).
Examples:
Input:
<!-- home link --><a href="home.htm"><![CDATA[x=&y]]>Home</a>
Output (
$config["comment"] = 0, $config["cdata"] = 2):
<-- home link --><a href="home.htm"><![CDATA[x=&y]]>Home</a>
Output (
$config["comment"] = 1, $config["cdata"] = 2):
<a href="home.htm"><![CDATA[x=&y]]>Home</a>
Output (
$config["comment"] = 2, $config["cdata"] = 2):
<!-- home link --><a href="home.htm"><![CDATA[x=&y]]>Home</a>
Output (
$config["comment"] = 2, $config["cdata"] = 1):
<!-- home link --><a href="home.htm">Home</a>
Output (
$config["comment"] = 3, $config["cdata"] = 3):
<!-- home link --><a href="home.htm"><![CDATA[x=&y]]>Home</a>
For standard-compliance, comments are given the form
<!--comment -->, and any
-- in the content is made
-.
When
$config["safe"] = 1, CDATA sections and comments are considered plain text unless
$config["comment"] or
$config["cdata"] is explicitly specified; see
section 3.6.
3.3.2 Tag-transformation for better XHTML-Strict
(to top)
If
$config["make_tag_strict"] is set and not
0, following non-XHTML-Strict elements (and attributes), even if admin-permitted, are mutated as indicated (element content remains intact; function
hl_tag2()):
* applet - (based on
$config["make_tag_strict"], unchanged (
1) or removed (
2))
* center -
div style="text-align: center;"
* dir -
ul
* embed - (based on
$config["make_tag_strict"], unchanged (
1) or removed (
2))
* font (face, size, color) -
span style="font-family: ; font-size: ; color: ;" (size transformation
reference)
* isindex - (based on
$config["make_tag_strict"], unchanged (
1) or removed (
2))
* menu -
ul
* s -
span style="text-decoration: line-through;"
* strike -
span style="text-decoration: line-through;"
* u -
span style="text-decoration: underline;"
For an element with a pre-existing
style attribute value, the extra style properties are appended.
Example input:
<center>
The PHP <s>software</s> script used for this <strike>web-page</strike> web-page is <font style="font-weight: bold " face=arial size='+3' color = "red ">htmLawedTest.php</font>, from <u style= 'color:green'>PHP Labware</u>.
</center>
The output:
<div style="text-align: center;">
The PHP <span style="text-decoration: line-through;">software</span> script used for this <span style="text-decoration: line-through;">web-page</span> web-page is <span style="font-weight: bold; font-family: arial; color: red; font-size: 200%;">htmLawedTest.php</span>, from <span style="color:green; text-decoration: underline;">PHP Labware</span>.
</div>
3.3.3 Tag balancing and proper nesting
(to top)
If
$config["balance"] is set to
1, htmLawed (function
hl_bal()) checks and corrects the input to have properly balanced tags and legal element content (i.e., any element nesting should be valid, and plain text may be present only in the content of elements that allow them).
Depending on the value of
$config["keep_bad"] (see
section 2.2 and
section 3.3), illegal content may be removed or neutralized to plain text by converting < and > to entities:
0 - remove; this option is available only to maintain Kses-compatibility and should not be used otherwise (see
section 2.6)
1 - neutralize tags and keep element content
2 - remove tags but keep element content
3 and
4 - like
1 and
2, but keep element content only if text (
pcdata) is valid in parent element as per specs
5 and
6 - like
3 and
4, but line-breaks, tabs and spaces are left
Example input (disallowing the
p element):
<*> Pseudo-tags <*>
<xml>Non-HTML tag xml</xml>
<p>
Disallowed tag p
</p>
<ul>Bad<li>OK</li></ul>
The output with
$config["keep_bad"] = 1:
<*> Pseudo-tags <*>
<xml>Non-HTML tag xml</xml>
<p>
Disallowed tag p
</p>
<ul>Bad<li>OK</li></ul>
The output with
$config["keep_bad"] = 3:
<*> Pseudo-tags <*>
<xml>Non-HTML tag xml</xml>
<p>
Disallowed tag p
</p>
<ul><li>OK</li></ul>
The output with
$config["keep_bad"] = 6:
<*> Pseudo-tags <*>
Non-HTML tag xml
Disallowed tag p
<ul><li>OK</li></ul>
An option like
1 is useful, e.g., when a writer previews his submission, whereas one like
3 is useful before content is finalized and made available to all.
Note: In the example above, unlike
<*>,
<xml> gets considered as a tag (even though there is no HTML element named
xml). Thus, the
keep_bad parameter's value affects
<xml> but not
<*>. In general, text matching the regular expression pattern
<(/?)([a-zA-Z][a-zA-Z1-6]*)([^>]*?)\s?> is considered a tag (phrase enclosed by the angled brackets
< and
>, and starting [with an optional slash preceding] with an alphanumeric word that starts with an alphabet...), and is subjected to the
keep_bad value.
Nesting/content rules for each of the 86 elements in htmLawed's default set (see
section 3.3) are defined in function
hl_bal(). This means that if a non-standard element besides
embed is being permitted through
$config["elements"], the element's tag content will end up getting removed if
$config["balance"] is set to
1.
Plain text and/or certain elements nested inside
blockquote,
form,
map and
noscript need to be in block-level elements. This point is often missed during manual writing of HTML code. htmLawed attempts to address this during balancing. E.g., if the parent container is set as
form, the input
B:<input type="text" value="b" />C:<input type="text" value="c" /> is converted to
<div>B:<input type="text" value="b" />C:<input type="text" value="c" /></div>.
3.3.4 Elements requiring child elements
(to top)
As per specs, the following elements require legal child elements nested inside them:
blockquote, dir, dl, form, map, menu, noscript, ol, optgroup, rbc, rtc, ruby, select, table, tbody, tfoot, thead, tr, ul
In some cases, the specs stipulate the number and/or the ordering of the child elements. A
table can have 0 or 1
caption,
tbody,
tfoot, and
thead, but they must be in this order:
caption,
thead,
tfoot,
tbody.
htmLawed currently does not check for conformance to these rules. Note that any non-compliance in this regard will not introduce security vulnerabilities, crash browser applications, or affect the rendering of web-pages.
With
$config["direct_list_nest"] set to
1, htmLawed will allow direct nesting of an
ol or
ul list within another
ol or
ul without requiring the child list to be within an
li of the parent list. While this is not standard-compliant, directly nested lists are rendered properly by almost all browsers. The parameter
$config["direct_list_nest"] has no effect if tag-balancing (
section 3.3.3) is turned off.
3.3.5 Beautify or compact HTML
(to top)
By default, htmLawed will neither
beautify HTML code by formatting it with indentations, etc., nor will it make it compact by removing un-needed white-space.(It does always properly white-space tag content.)
As per the HTML standards, spaces, tabs and line-breaks in web-pages (except those inside
pre elements) are all considered equivalent, and referred to as
white-spaces. Browser applications are supposed to consider contiguous white-spaces as just a single space, and to disregard white-spaces trailing opening tags or preceding closing tags. This white-space
normalization allows the use of text/code beautifully formatted with indentations and line-spacings for readability. Such
pretty HTML can, however, increase the size of web-pages, or make the extraction or scraping of plain text cumbersome.
With the
$config parameter
tidy, htmLawed can be used to beautify or compact the input text. Input with just plain text and no HTML markup is also subject to this. Besides
pre, the
script and
textarea elements, CDATA sections, and HTML comments are not subjected to the tidying process.
To
compact, use
$config["tidy"] = -1; single instances or runs of white-spaces are replaced with a single space, and white-spaces trailing and leading open and closing tags, respectively, are removed.
To
beautify,
$config["tidy"] is set as
1, or for customized tidying, as a string like
2s2n. The
s or
t character specifies the use of spaces or tabs for indentation. The first and third characters, any of the digits 0-9, specify the number of spaces or tabs per indentation, and any parental lead spacing (extra indenting of the whole block of input text). The
r and
n characters are used to specify line-break characters:
n for
\n (Unix/Mac OS X line-breaks),
rn or
nr for
\r\n (Windows/DOS line-breaks), or
r for
\r.
The
$config["tidy"] value of
1 is equivalent to
2s0n. Other
$config["tidy"] values are read loosely: a value of
4 is equivalent to
4s0n;
t2, to
1t2n;
s, to
2s0n;
2TR, to
2t0r;
T1, to
1t1n;
nr3, to
3s0nr, and so on. Except in the indentations and line-spacings, runs of white-spaces are replaced with a single space during beautification.
Input formatting using
$config["tidy"] is not recommended when input text has mixed markup (like HTML + PHP).
3.4 Attributes
(to top)
htmLawed will only permit attributes described in the HTML specs (including deprecated ones). It also permits some attributes for use with the
embed element (the non-standard
embed element is supported in htmLawed because of its widespread use), and the the
xml:space attribute (valid only in XHTML 1.1). A list of such 111 attributes and the elements they are allowed in is in
section 5.2. Using the
$spec argument, htmLawed can be forced to permit custom, non-standard attributes as well as custom rules for standard attributes (
section 2.3).
When
$config["deny_attribute"] is not set, or set to
0, or empty (
""), all the 111 attributes are permitted. Otherwise,
$config["deny_attribute"] can be set as a list of comma-separated names of the denied attributes.
on* can be used to refer to the group of potentially dangerous, script-accepting attributes:
onblur,
onchange,
onclick,
ondblclick,
onfocus,
onkeydown,
onkeypress,
onkeyup,
onmousedown,
onmousemove,
onmouseout,
onmouseover,
onmouseup,
onreset,
onselect and
onsubmit.
Note that attributes specified in
$config["deny_attribute"] are denied globally, for all elements. To deny attributes for only specific elements,
$spec (see
section 2.3) can be used.
$spec can also be used to element-specifically permit an attribute otherwise denied through
$config["deny_attribute"].
With
$config["safe"] = 1 (
section 3.6), the
on* attributes are automatically disallowed.
Note: To deny all but a few attributes globally, a simpler way to specify
$config["deny_attribute"] would be to use the notation
* -attribute1 -attribute2 .... Thus, a value of
* -title -href implies that except
href and
title (where allowed as per standards) all other attributes are to be removed. With this notation, the value for the parameter
safe (
section 3.6) will have no effect on
deny_attribute.
htmLawed (function
hl_tag()) also:
* Lower-cases attribute names
* Removes duplicate attributes (last one stays)
* Gives attributes the form
name="value" and single-spaces them, removing unnecessary white-spacing
* Provides
required attributes (see
section 3.4.1)
* Double-quotes values and escapes any
" inside them
* Replaces the possibly dangerous soft-hyphen characters (hexadecimal code-point
ad) in the values with spaces
* Allows custom function to additionally filter/modify attribute values (see
section 3.4.9)
3.4.1 Auto-addition of XHTML-required attributes
(to top)
If indicated attributes for the following elements are found missing, htmLawed (function
hl_tag()) will add them (with values same as attribute names unless indicated otherwise below):
* area - alt (
area)
* area, img - src, alt (
image)
* bdo - dir (
ltr)
* form - action
* map - name
* optgroup - label
* param - name
* script - type (
text/javascript)
* textarea - rows (
10), cols (
50)
Additionally, with
$config["xml:lang"] set to
1 or
2, if the
lang but not the
xml:lang attribute is declared, then the latter is added too, with a value copied from that of
lang. This is for better standard-compliance. With
$config["xml:lang"] set to
2, the
lang attribute is removed (XHTML 1.1 specs).
Note that the
name attribute for
map, invalid in XHTML 1.1, is also transformed if required -- see
section 3.4.6.
3.4.2 Duplicate/invalid id values
(to top)
If
$config["unique_ids"] is
1, htmLawed (function
hl_tag()) removes
id attributes with values that are not XHTML-compliant (must begin with a letter and can contain letters, digits,
:,
.,
- and
_) or duplicate. If
$config["unique_ids"] is a word, any duplicate but otherwise valid value will be appropriately prefixed with the word to ensure its uniqueness. The word should begin with a letter and should contain only letters, numbers,
:,
.,
_ and
-.
Even if multiple inputs need to be filtered (through multiple calls to htmLawed), htmLawed ensures uniqueness of
id values as it uses a global variable (
$GLOBALS["hl_Ids"] array). Further, an admin can restrict the use of certain
id values by presetting this variable before htmLawed is called into use. E.g.:
$GLOBALS['hl_Ids'] = array('top'=>1, 'bottom'=>1, 'myform'=>1); // id values not allowed in input
$processed = htmLawed($text); // filter input
3.4.3 URL schemes (protocols) and scripts in attribute values
(to top)
htmLawed edits attributes that take URLs as values if they are found to contain un-permitted schemes. E.g., if the
afp scheme is not permitted, then
<a href="afp://domain.org"> becomes
<a href="denied:afp://domain.org">, and if Javascript is not permitted
<a onclick="javascript:xss();"> becomes
<a onclick="denied:javascript:xss();">.
By default htmLawed permits these schemes in URLs for the
href attribute:
aim, feed, file, ftp, gopher, http, https, irc, mailto, news, nntp, sftp, ssh, telnet
Also, only
file,
http and
https are permitted in attributes whose names start with
o (like
onmouseover), and in these attributes that accept URLs:
action, cite, classid, codebase, data, href, longdesc, model, pluginspage, pluginurl, src, style, usemap
These default sets are used when
$config["schemes"] is not set (see
section 2.2). To over-ride the defaults,
$config["schemes"] is defined as a string of semi-colon-separated sub-strings of type
attribute: comma-separated schemes. E.g.,
href: mailto, http, https; onclick: javascript; src: http, https. For unspecified attributes,
file,
http and
https are permitted. This can be changed by passing schemes for
* in
$config["schemes"]. E.g.,
href: mailto, http, https; *: https, https.
* can be put in the list of schemes to permit all protocols. E.g.,
style: *; img: http, https results in protocols not being checked in
style attribute values. However, in such cases, any relative-to-absolute URL conversion, or vice versa, (
section 3.4.4) is not done.
Thus,
to allow Javascript, one can set
$config["schemes"] as
href: mailto, http, https; *: http, https, javascript, or
href: mailto, http, https, javascript; *: http, https, javascript, or
*: *, and so on.
As a side-note, one may find
style: * useful as URLs in
style attributes can be specified in a variety of ways, and the patterns that htmLawed uses to identify URLs may mistakenly identify non-URL text.
! can be put in the list of schemes to disallow all protocols as well as
local URLs. Thus, with
href: http, style: !, '<a href="http://cnn.com" style="background-image: url('local.jpg');">CNN</a>' will become '<a href="http://cnn.com" style="background-image: url('denied:local.jpg');">CNN</a>'.
Note: If URL-accepting attributes other than those listed above are being allowed, then the scheme will not be checked unless the attribute name contains the string
src (e.g.,
dynsrc) or starts with
o (e.g.,
onbeforecopy).
With
$config["safe"] = 1, all URLs are disallowed in the
style attribute values.
3.4.4 Absolute & relative URLs in attribute values
(to top)
htmLawed can make absolute URLs in attributes like
href relative (
$config["abs_url"] is
-1), and vice versa (
$config["abs_url"] is
1). URLs in scripts are not considered for this, and so are URLs like
#section_6 (fragment),
?name=Tim#show (starting with query string), and
;var=1?name=Tim#show (starting with parameters). Further, this requires that
$config["base_url"] be set properly, with the
:// and a trailing slash (
/), with no query string, etc. E.g.,
file:///D:/page/,
https://abc.com/x/y/, or
http://localhost/demo/ are okay, but
file:///D:/page/?help=1,
abc.com/x/y/ and
http://localhost/demo/index.htm are not.
For making absolute URLs relative, only those URLs that have the
$config["base_url"] string at the beginning are converted. E.g., with
$config["base_url"] = "https://abc.com/x/y/",
https://abc.com/x/y/a.gif and
https://abc.com/x/y/z/b.gif become
a.gif and
z/b.gif respectively, while
https://abc.com/x/c.gif is not changed.
When making relative URLs absolute, only values for scheme, network location (host-name) and path values in the base URL are inherited. See
section 5.5 for more about the URL specification as per RFC
1808.
3.4.5 Lower-cased, standard attribute values
(to top)
Optionally, for standard-compliance, htmLawed (function
hl_tag()) lower-cases standard attribute values to give, e.g.,
input type="password" instead of
input type="Password", if
$config["lc_std_val"] is
1. Attribute values matching those listed below for any of the elements (plus those for the
type attribute of
button or
input) are lower-cased:
all, baseline, bottom, button, center, char, checkbox, circle, col, colgroup, cols, data, default, file, get, groups, hidden, image, justify, left, ltr, middle, none, object, password, poly, post, preserve, radio, rect, ref, reset, right, row, rowgroup, rows, rtl, submit, text, top
a, area, bdo, button, col, form, img, input, object, option, optgroup, param, script, select, table, td, tfoot, th, thead, tr, xml:space
The following
empty (
minimized) attributes are always assigned lower-cased values (same as the names):
checked, compact, declare, defer, disabled, ismap, multiple, nohref, noresize, noshade, nowrap, readonly, selected
3.4.6 Transformation of deprecated attributes
(to top)
If
$config["no_deprecated_attr"] is
0, then deprecated attributes (see appendix in
section 5.2) are removed and, in most cases, their values are transformed to CSS style properties and added to the
style attributes (function
hl_tag()). Except for
bordercolor for
table,
tr and
td, the scores of proprietary attributes that were never part of any cross-browser standard are not supported.
Note: The attribute
target for
a is allowed even though it is not in XHTML 1.0 specs. This is because of the attribute's wide-spread use and browser-support, and because the attribute is valid in XHTML 1.1 onwards.
* align - for
img with value of
left or
right, becomes, e.g.,
float: left; for
div and
table with value
center, becomes
margin: auto; all others become, e.g.,
text-align: right
* bgcolor - E.g.,
bgcolor="#ffffff" becomes
background-color: #ffffff
* border - E.g.,
height= "10" becomes
height: 10px
* bordercolor - E.g.,
bordercolor=#999999 becomes
border-color: #999999;
* compact -
font-size: 85%
* clear - E.g., 'clear="all" becomes
clear: both
* height - E.g.,
height= "10" becomes
height: 10px and
height="*" becomes
height: auto
* hspace - E.g.,
hspace="10" becomes
margin-left: 10px; margin-right: 10px
* language -
language="VBScript" becomes
type="text/vbscript"
* name - E.g.,
name="xx" becomes
id="xx"
* noshade -
border-style: none; border: 0; background-color: gray; color: gray
* nowrap -
white-space: nowrap
* size - E.g.,
size="10" becomes
height: 10px
* start - removed
* type - E.g.,
type="i" becomes
list-style-type: lower-roman
* value - removed
* vspace - E.g.,
vspace="10" becomes
margin-top: 10px; margin-bottom: 10px
* width - like
height
Example input:
<img src="j.gif" alt="image" name="dad's" /><img src="k.gif" alt="image" id="dad_off" name="dad" />
<br clear="left" />
<hr noshade size="1" />
<img name="img" src="i.gif" align="left" alt="image" hspace="10" vspace="10" width="10em" height="20" border="1" style="padding:5px;" />
<table width="50em" align="center" bgcolor="red">
<tr>
<td width="20%">
<div align="center">
<h3 align="right">Section</h3>
<p align="right">Para</p>
<ol type="a" start="e"><li value="x">First item</li></ol>
</div>
</td>
<td width="*">
<ol type="1"><li>First item</li></ol>
</td>
</tr>
</table>
<br clear="all" />
And the output with
$config["no_deprecated_attr"] = 1:
<img src="j.gif" alt="image" /><img src="k.gif" alt="image" id="dad_off" />
<br style="clear: left;" />
<hr style="border-style: none; border: 0; background-color: gray; color: gray; size: 1px;" />
<img src="i.gif" alt="image" width="10em" height="20" style="padding:5px; float: left; margin-left: 10px; margin-right: 10px; margin-top: 10px; margin-bottom: 10px; border: 1px;" id="img" />
<table width="50em" style="margin: auto; background-color: red;">
<tr>
<td style="width: 20%;">
<div style="margin: auto;">
<h3 style="text-align: right;">Section</h3>
<p style="text-align: right;">Para</p>
<ol style="list-style-type: lower-latin;"><li>First item</li></ol>
</div>
</td>
<td style="width: auto;">
<ol style="list-style-type: decimal;"><li>First item</li></ol>
</td>
</tr>
</table>
<br style="clear: both;" />
For
lang, deprecated in XHTML 1.1, transformation is taken care of through
$config["xml:lang"]; see
section 3.4.1.
The attribute
name is deprecated in
form,
iframe, and
img, and is replaced with
id if an
id attribute doesn't exist and if the
name value is appropriate for
id. For such replacements for
a and
map, for which the
name attribute is deprecated in XHTML 1.1,
$config["no_deprecated_attr"] should be set to
2 (when set to
1, for these two elements, the
name attribute is retained).
3.4.7 Anti-spam & href
(to top)
htmLawed (function
hl_tag()) can check the
href attribute values (link addresses) as an anti-spam (email or link spam) measure.
If
$config["anti_mail_spam"] is not
0, the
@ of email addresses in
href values like
mailto:a@b.com is replaced with text specified by
$config["anti_mail_spam"]. The text should be of a form that makes it clear to others that the address needs to be edited before a mail is sent; e.g.,
<remove_this_antispam>@ (makes the example address
a<remove_this_antispam>@b.com).
For regular links, one can choose to have a
rel attribute with
nofollow in its value (which tells some search engines to not follow a link). This can discourage link spammers. Additionally, or as an alternative, one can choose to empty the
href value altogether (disable the link).
For use of these options,
$config["anti_link_spam"] should be set as an array with values
regex1 and
regex2, both or one of which can be empty (like
array("", "regex2")) to indicate that that option is not to be used. Otherwise,
regex1 or
regex2 should be PHP- and PCRE-compatible regular expression patterns:
href values will be matched against them and those matching the pattern will accordingly be treated.
Note that the regular expressions should have
delimiters, and be well-formed and preferably fast. Absolute efficiency/accuracy is often not needed.
An example, to have a
rel attribute with
nofollow for all links, and to disable links that do not point to domains
abc.com and
xyz.org:
$config["anti_link_spam"] = array('`.`', '`://\W*(?!(abc\.com|xyz\.org))`');
3.4.8 Inline style properties
(to top)
htmLawed can check URL schemes and dynamic expressions (to guard against Javascript, etc., script-based insecurities) in inline CSS style property values in the
style attributes. (CSS properties like
background-image that accept URLs in their values are noted in
section 5.3.) Dynamic CSS expressions that allow scripting in the IE browser, and can be a vulnerability, can be removed from property values by setting
$config["css_expression"] to
1 (default setting). Note that when
$config["css_expression"] is set to
1, htmLawed will remove
/* from the
style values.
Note: Because of the various ways of representing characters in attribute values (URL-escapement, entitification, etc.), htmLawed might alter the values of the
style attribute values, and may even falsely identify dynamic CSS expressions and URL schemes in them. If this is an important issue, checking of URLs and dynamic expressions can be turned off (
$config["schemes"] = "...style:*...", see
section 3.4.3, and
$config["css_expression"] = 0). Alternately, admins can use their own custom function for finer handling of
style values through the
hook_tag parameter (see
section 3.4.9).
It is also possible to have htmLawed let through any
style value by setting
$config["style_pass"] to
1.
As such, it is better to set up a CSS file with class declarations, disallow the
style attribute, set a
$spec rule (see
section 2.3) for
class for the
oneof or
match parameter, and ask writers to make use of the
class attribute.
3.4.9 Hook function for tag content
(to top)
It is possible to utilize a custom hook function to alter the tag content htmLawed has finalized (i.e., after it has checked/corrected for required attributes, transformed attributes, lower-cased attribute names, etc.).
When
$config parameter
hook_tag is set to the name of a function, htmLawed (function
hl_tag()) will pass on the element name, and, in the case of an opening tag, the
finalized attribute name-value pairs as array elements to the function. The function, after completing a task such as filtering or tag transformation, will typically return an empty string, the full opening tag string like
<element_name attribute_1_name="attribute_1_value"...> (for empty elements like
img and
input, the element-closing slash
/ should also be included), etc.
Any
hook_tag function, since htmLawed version 1.1.11, also receives names of elements in closing tags, such as
a in the closing
</a> tag of the element
<a href="http://cnn.com">CNN</a>. Unlike for opening tags, no other value (i.e., the attribute name-value array) is passed to the function since a closing tag contains only element names. Typically, the function will return an empty string or a full closing tag (like
</a>).
This is a
powerful functionality that can be exploited for various objectives: consolidate-and-convert inline
style attributes to
class, convert
embed elements to
object, permit only one
caption element in a
table element, disallow embedding of certain types of media,
inject HTML, use
CSSTidy to sanitize
style attribute values, etc.
As an example, the custom hook code below can be used to force a series of specifically ordered
id attributes on all elements, and a specific
param element inside all
object elements:
function my_tag_function($element, $attribute_array=0){
// If second argument is not received, it means a closing tag is being handled
if(is_numeric($attribute_array)){
return "</$element>";
}
static $id = 0;
// Remove any duplicate element
if($element == 'param' && isset($attribute_array['allowscriptaccess'])){
return '';
}
$new_element = '';
// Force a serialized ID number
$attribute_array['id'] = 'my_'. $id;
++$id;
// Inject param for allowscriptaccess
if($element == 'object'){
$new_element = '<param id='my_'. $id; allowscriptaccess="never" />';
++$id;
}
$string = '';
foreach($attribute_array as $k=>$v){
$string .= " {$k}=\"{$v}\"";
}
static $empty_elements = array('area'=>1, 'br'=>1, 'col'=>1, 'embed'=>1, 'hr'=>1, 'img'=>1, 'input'=>1, 'isindex'=>1, 'param'=>1);
return "<{$element}{$string}". (isset($in_array($element, $empty_elements) ? ' /' : ''). '>'. $new_element;
}
The
hook_tag parameter is different from the
hook parameter (
section 3.7).
Snippets of hook function code developed by others may be available on the
htmLawed website.
3.5 Simple configuration directive for most valid XHTML
(to top)
If
$config["valid_xhtml"] is set to
1, some relevant
$config parameters (indicated by
~ in
section 2.2) are auto-adjusted. This allows one to pass the
$config argument with a simpler value. If a value for a parameter auto-set through
valid_xhtml is still manually provided, then that value will over-ride the auto-set value.
3.6 Simple configuration directive for most safe HTML
(to top)
Safe HTML refers to HTML that is restricted to reduce the vulnerability for scripting attacks (such as XSS) based on HTML code which otherwise may still be legal and compliant with the HTML standard specs. When elements such as
script and
object, and attributes such as
onmouseover and
style are allowed in the input text, an input writer can introduce malevolent HTML code. Note that what is considered
safe depends on the nature of the web application and the trust-level accorded to its users.
htmLawed allows an admin to use
$config["safe"] to auto-adjust multiple
$config parameters (such as
elements which declares the allowed element-set), which otherwise would have to be manually set. The relevant parameters are indicated by
" in
section 2.2). Thus, one can pass the
$config argument with a simpler value.
With the value of
1, htmLawed considers
CDATA sections and HTML comments as plain text, and prohibits the
applet,
embed,
iframe,
object and
script elements, and the
on* attributes like
onclick. ( There are
$config parameters like
css_expression that are not affected by the value set for
safe but whose default values still contribute towards a more
safe output.) Further, URLs with schemes (see
section 3.4.3) are neutralized so that, e.g.,
style="moz-binding:url(http://danger)" becomes
style="moz-binding:url(denied:http://danger)".
Admins, however, may still want to completely deny the
style attribute, e.g., with code like
$processed = htmLawed($text, array('safe'=>1, 'deny_attribute'=>'style'));
Permitting the
style attribute brings in risks of
click-jacking, etc. CSS property values can render a page non-functional or be used to deface it. Except for URLs, dynamic expressions, and some other things, htmLawed does not completely check
style values. It does provide ways for the code-developer implementing htmLawed to do such checks through the
$spec argument, and through the
hook_tag parameter (see
section 3.4.8 for more). Disallowing style completely and relying on CSS classes and stylesheet files is recommended.
If a value for a parameter auto-set through
safe is still manually provided, then that value can over-ride the auto-set value. E.g., with
$config["safe"] = 1 and
$config["elements"] = "*+script",
script, but not
applet, is allowed.
A page illustrating the efficacy of htmLawed's anti-XSS abilities with
safe set to
1 against XSS vectors listed by
RSnake may be available
here.
3.7 Using a hook function
(to top)
If
$config["hook"] is not set to
0, then htmLawed will allow preliminarily processed input to be altered by a hook function named by
$config["hook"] before starting the main work (but after handling of characters, entities, HTML comments and
CDATA sections -- see code for function
htmLawed()).
The hook function also allows one to alter the
finalized values of
$config and
$spec.
Note that the
hook parameter is different from the
hook_tag parameter (
section 3.4.9).
Snippets of hook function code developed by others may be available on the
htmLawed website.
3.8 Obtaining finalized parameter values
(to top)
htmLawed can assign the
finalized $config and
$spec values to a variable named by
$config["show_setting"]. The variable, made global by htmLawed, is set as an array with three keys:
config, with the
$config value,
spec, with the
$spec value, and
time, with a value that is the Unix time (the output of PHP's
microtime() function) when the value was assigned. Admins should use a PHP-compliant variable name (e.g., one that does not begin with a numerical digit) that does not conflict with variable names in their non-htmLawed code.
The values, which are also post-hook function (if any), can be used to auto-generate information (on, e.g., the elements that are permitted) for input writers.
3.9 Retaining non-HTML tags in input with mixed markup
(to top)
htmLawed does not remove certain characters that, though invalid, are nevertheless
discouraged in HTML documents as per the specifications (see
section 5.1). This can be utilized to deal with input that contains mixed markup. Input that may have HTML markup as well as some other markup that is based on the
<,
> and
& characters is considered to have mixed markup. The non-HTML markup can be rather proprietary (like markup for emoticons/smileys), or standard (like MathML or SVG). Or it can be programming code meant for execution/evaluation (such as embedded PHP code).
To deal with such mixed markup, the input text can be pre-processed to hide the non-HTML markup by specifically replacing the
<,
> and
& characters with some of the HTML-discouraged characters (see
section 3.1.2). Post-htmLawed processing, the replacements are reverted.
An example (mixed HTML and PHP code in input text):
$text = preg_replace('`<\?php(.+?)\?>`sm', "\x83?php\\1?\x84", $text);
$processed = htmLawed($text);
$processed = preg_replace('`\x83\?php(.+?)\?\x84`sm', '<?php$1?>', $processed);
This code will not work if
$config["clean_ms_char"] is set to
1 (
section 3.1), in which case one should instead deploy a hook function (
section 3.7). (htmLawed internally uses certain control characters, code-points
1 to
7, and use of these characters as markers in the logic of hook functions may cause issues.)
Admins may also be able to use
$config["and_mark"] to deal with such mixed markup; see
section 3.2.
4 Other
(to top)
4.3 Change-log
(to top)
(The release date for the downloadable package of files containing documentation, demo script, test-cases, etc., besides the
htmLawed.php file, may be updated without a change-log entry if the secondary files, but not htmLawed per se, are revised.)
Version number - Release date. Notes
1.1.14 - 8 August 2012. Fix for possible segmental loss of incremental indentation during
tidying when
balance is disabled; fix for non-effectuation under some circumstances of a corrective behavior to preserve plain text within elements like
blockquote.
1.1.13 - 22 July 2012. Added feature allowing use of custom, non-standard attributes or custom rules for standard attributes
1.1.12 - 5 July 2012. Fix for a bug in identifying an unquoted value of the
face attribute
1.1.11 - 5 June 2012. Fix for possible problem with handling of multi-byte characters in attribute values in an mbstring.func_overload enviroment.
$config["hook_tag"], if specified, now receives names of elements in closing tags.
1.1.10 - 22 October 2011. Fix for a bug in the
tidy functionality that caused the entire input to be replaced with a single space; new parameter,
$config["direct_list_nest"] to allow direct descendance of a list in a list. (5 April 2012. Dual licensing from LGPLv3 to LGPLv3 and GPLv2+.)
1.1.9.5 - 6 July 2011. Minor correction of a rule for nesting of
li within
dir
1.1.9.4 - 3 July 2010. Parameter
schemes now accepts
! so any URL, even a local one, can be
denied. An issue in which a second URL value in
style properties was not checked was fixed.
1.1.9.3 - 17 May 2010. Checks for correct nesting of
param
1.1.9.2 - 26 April 2010. Minor fix regarding rendering of denied URL schemes
1.1.9.1 - 26 February 2010. htmLawed now uses the LGPL version 3 license; support for
flashvars attribute for
embed
1.1.9 - 22 December 2009. Soft-hyphens are now removed only from URL-accepting attribute values
1.1.8.1 - 16 July 2009. Minor code-change to fix a PHP error notice
1.1.8 - 23 April 2009. Parameter
deny_attribute now accepts the wild-card
*, making it simpler to specify its value when all but a few attributes are being denied; fixed a bug in interpreting
$spec
1.1.7 - 11-12 March 2009. Attributes globally denied through
deny_attribute can be allowed element-specifically through
$spec;
$config["style_pass"] allowing letting through any
style value introduced; altered logic to catch certain types of dynamic crafted CSS expressions
1.1.3-6 - 28-31 January - 4 February 2009. Altered logic to catch certain types of dynamic crafted CSS expressions
1.1.2 - 22 January 2009. Fixed bug in parsing of
font attributes during tag transformation
1.1.1 - 27 September 2008. Better nesting correction when omitable closing tags are absent
1.1 - 29 June 2008.
$config["hook_tag"] and
$config["tidy"] introduced for custom tag/attribute check/modification/injection and output compaction/beautification; fixed a regex-in-$spec parsing bug
1.0.9 - 11 June 2008. Fix for a bug in checks for invalid HTML code-point entities
1.0.8 - 15 May 2008. Permit
bordercolor attribute for
table,
td and
tr
1.0.7 - 1 May 2008. Support for
wmode attribute for
embed;
$config["show_setting"] introduced; improved
$config["elements"] evaluation
1.0.6 - 20 April 2008.
$config["and_mark"] introduced
1.0.5 - 12 March 2008.
style URL schemes essentially disallowed when $config
safe is on; improved regex for CSS expression search
1.0.4 - 10 March 2008. Improved corrections for
blockquote,
form,
map and
noscript
1.0.3 - 3 March 2008. Character entities for soft-hyphens are now replaced with spaces (instead of being removed); fix for a bug allowing
td directly inside
table;
$config["safe"] introduced
1.0.2 - 13 February 2008. Improved implementation of
$config["keep_bad"]
1.0.1 - 7 November 2007. Improved regex for identifying URLs, protocols and dynamic expressions (
hl_tag() and
hl_prot()); no error display with
hl_regex()
1.0 - 2 November 2007. First release
4.4 Testing
(to top)
To test htmLawed using a form interface, a
demo web-page is provided with the htmLawed distribution (
htmLawed.php and
htmLawedTest.php should be in the same directory on the web-server). A file with
test-cases is also provided.
4.5 Upgrade, & old versions
(to top)
Upgrading is as simple as replacing the previous version of
htmLawed.php (assuming it was not modified for customized features). As htmLawed output is almost always used in static documents, upgrading should not affect old, finalized content.
Important The following upgrades may affect the functionality of a specific htmLawed installation:
(1) From version 1.1-1.1.10 to 1.1.11 (or later), if a
hook_tag function is in use: In version 1.1.11, elements in closing tags (and not just the opening tags) are also passed to the function. There are no attribute names/values to pass, so a
hook_tag function receives only the element name. The
hook_tag function therefore may have to be edited. See
section 3.4.9.
Old versions of htmLawed may be available online. E.g., for version 1.0, check
http://www.bioinformatics.org/phplabware/downloads/htmLawed1.zip, for 1.1.1, htmLawed111.zip, and for 1.1.10, htmLawed1110.zip.
4.6 Comparison with HTMLPurifier
(to top)
The HTMLPurifier PHP library by Edward Yang is a very good HTML filtering script that uses object oriented PHP code. Compared to htmLawed, it (as of year 2010):
* does not support PHP versions older than 5.0 (HTMLPurifier dropped PHP 4 support after version 2)
* is 15-20 times bigger (scores of files totalling more than 750 kb)
* consumes 10-15 times more RAM memory (just including the HTMLPurifier files without calling the filter requires a few MBs of memory)
* is expectedly slower
* does not allow admins to fully allow all valid HTML (because of incomplete HTML support, it always considers elements like
script illegal)
* lacks many of the extra features of htmLawed (like entity conversions and code compaction/beautification)
* has poor documentation
However, HTMLPurifier has finer checks for character encodings and attribute values, and can log warnings and errors. Visit the HTMLPurifier
website for updated information.
4.7 Use through application plug-ins/modules
(to top)
Plug-ins/modules to implement htmLawed in applications such as Drupal and DokuWiki may have been developed. Please check the application websites and the forum on the htmLawed
site.
4.8 Use in non-PHP applications
(to top)
Non-PHP applications written in Python, Ruby, etc., may be able to use htmLawed through system calls to the PHP engine. Such code may have been documented on the internet. Also check the forum on the htmLawed
site.
4.9 Donate
(to top)
A donation in any currency and amount to appreciate or support this software can be sent by
PayPal to this email address: drpatnaik at yahoo dot com.
4.10 Acknowledgements
(to top)
Nicholas Alipaz, Bryan Blakey, Pádraic Brady, Ulf Harnhammer, Gareth Heyes, Klaus Leithoff, Lukasz Pilorz, Shelley Powers, Edward Yang, and many anonymous users.
Thank you!
5 Appendices
(to top)
5.1 Characters discouraged in XHTML
(to top)
Characters represented by the following hexadecimal code-points are
not invalid, even though some validators may issue messages stating otherwise.
7f to
84,
86 to
9f,
fdd0 to
fddf,
1fffe,
1ffff,
2fffe,
2ffff,
3fffe,
3ffff,
4fffe,
4ffff,
5fffe,
5ffff,
6fffe,
6ffff,
7fffe,
7ffff,
8fffe,
8ffff,
9fffe,
9ffff,
afffe,
affff,
bfffe,
bffff,
cfffe,
cffff,
dfffe,
dffff,
efffe,
effff,
ffffe,
fffff,
10fffe and
10ffff
5.2 Valid attribute-element combinations
(to top)
Valid attribute-element combinations as per
W3C specs.
* includes deprecated attributes (marked
^), attributes for the non-standard
embed element (marked
*), and the proprietary
bordercolor (marked
~)
* only non-frameset, HTML body elements
*
name for
a and
map, and
lang are invalid in XHTML 1.1
*
target is valid for
a in XHTML 1.1 and higher
*
xml:space is only for XHTML 1.1
abbr - td, th
accept - form, input
accept-charset - form
accesskey - a, area, button, input, label, legend, textarea
action - form
align - caption^, embed, applet, iframe, img^, input^, object^, legend^, table^, hr^, div^, h1^, h2^, h3^, h4^, h5^, h6^, p^, col, colgroup, tbody, td, tfoot, th, thead, tr
alt - applet, area, img, input
archive - applet, object
axis - td, th
bgcolor - embed, table^, tr^, td^, th^
border - table, img^, object^
bordercolor~ - table, td, tr
cellpadding - table
cellspacing - table
char - col, colgroup, tbody, td, tfoot, th, thead, tr
charoff - col, colgroup, tbody, td, tfoot, th, thead, tr
charset - a, script
checked - input
cite - blockquote, q, del, ins
classid - object
clear - br^
code - applet
codebase - object, applet
codetype - object
color - font
cols - textarea
colspan - td, th
compact - dir, dl^, menu, ol^, ul^
coords - area, a
data - object
datetime - del, ins
declare - object
defer - script
dir - bdo
disabled - button, input, optgroup, option, select, textarea
enctype - form
face - font
flashvars* - embed
for - label
frame - table
frameborder - iframe
headers - td, th
height - embed, iframe, td^, th^, img, object, applet
href - a, area
hreflang - a
hspace - applet, img^, object^
ismap - img, input
label - option, optgroup
language - script^
longdesc - img, iframe
marginheight - iframe
marginwidth - iframe
maxlength - input
method - form
model* - embed
multiple - select
name - button, embed, textarea, applet^, select, form^, iframe^, img^, a^, input, object, map^, param
nohref - area
noshade - hr^
nowrap - td^, th^
object - applet
onblur - a, area, button, input, label, select, textarea
onchange - input, select, textarea
onfocus - a, area, button, input, label, select, textarea
onreset - form
onselect - input, textarea
onsubmit - form
pluginspage* - embed
pluginurl* - embed
prompt - isindex
readonly - textarea, input
rel - a
rev - a
rows - textarea
rowspan - td, th
rules - table
scope - td, th
scrolling - iframe
selected - option
shape - area, a
size - hr^, font, input, select
span - col, colgroup
src - embed, script, input, iframe, img
standby - object
start - ol^
summary - table
tabindex - a, area, button, input, object, select, textarea
target - a^, area, form
type - a, embed, object, param, script, input, li^, ol^, ul^, button
usemap - img, input, object
valign - col, colgroup, tbody, td, tfoot, th, thead, tr
value - input, option, param, button, li^
valuetype - param
vspace - applet, img^, object^
width - embed, hr^, iframe, img, object, table, td^, th^, applet, col, colgroup, pre^
wmode - embed
xml:space - pre, script, style
These are allowed in all but the shown elements:
class - param, script
dir - applet, bdo, br, iframe, param, script
id - script
lang - applet, br, iframe, param, script
onclick - applet, bdo, br, font, iframe, isindex, param, script
ondblclick - applet, bdo, br, font, iframe, isindex, param, script
onkeydown - applet, bdo, br, font, iframe, isindex, param, script
onkeypress - applet, bdo, br, font, iframe, isindex, param, script
onkeyup - applet, bdo, br, font, iframe, isindex, param, script
onmousedown - applet, bdo, br, font, iframe, isindex, param, script
onmousemove - applet, bdo, br, font, iframe, isindex, param, script
onmouseout - applet, bdo, br, font, iframe, isindex, param, script
onmouseover - applet, bdo, br, font, iframe, isindex, param, script
onmouseup - applet, bdo, br, font, iframe, isindex, param, script
style - param, script
title - param, script
xml:lang - applet, br, iframe, param, script
5.3 CSS 2.1 properties accepting URLs
(to top)
background
background-image
content
cue-after
cue-before
cursor
list-style
list-style-image
play-during
5.4 Microsoft Windows 1252 character replacements
(to top)
Key:
d double,
l left,
q quote,
r right,
s. single
Code-point (decimal) - hexadecimal value - replacement entity - represented character
127 - 7f - (removed) - (not used)
128 - 80 - € - euro
129 - 81 - (removed) - (not used)
130 - 82 - ‚ - baseline s. q
131 - 83 - ƒ - florin
132 - 84 - „ - baseline d q
133 - 85 - … - ellipsis
134 - 86 - † - dagger
135 - 87 - ‡ - d dagger
136 - 88 - ˆ - circumflex accent
137 - 89 - ‰ - permile
138 - 8a - Š - S Hacek
139 - 8b - ‹ - l s. guillemet
140 - 8c - Œ - OE ligature
141 - 8d - (removed) - (not used)
142 - 8e - Ž - Z dieresis
143 - 8f - (removed) - (not used)
144 - 90 - (removed) - (not used)
145 - 91 - ‘ - l s. q
146 - 92 - ’ - r s. q
147 - 93 - “ - l d q
148 - 94 - ” - r d q
149 - 95 - • - bullet
150 - 96 - – - en dash
151 - 97 - — - em dash
152 - 98 - ˜ - tilde accent
153 - 99 - ™ - trademark
154 - 9a - š - s Hacek
155 - 9b - › - r s. guillemet
156 - 9c - œ - oe ligature
157 - 9d - (removed) - (not used)
158 - 9e - ž - z dieresis
159 - 9f - Ÿ - Y dieresis
5.5 URL format
(to top)
An
absolute URL has a
protocol or
scheme, a
network location or
hostname, and, optional
path,
parameters,
query and
fragment segments. Thus, an absolute URL has this generic structure:
(scheme) : (//network location) /(path) ;(parameters) ?(query) #(fragment)
The schemes can only contain letters, digits,
+,
. and
-. Hostname is the portion after the
// and up to the first
/ (if any; else, up to the end) when
: is followed by a
// (e.g.,
abc.com in
ftp://abc.com/def); otherwise, it consists of everything after the
: (e.g.,
def@abc.com in mailto:def@abc.com').
Relative URLs do not have explicit schemes and network locations; such values are inherited from a
base URL.
5.6 Brief on htmLawed code
(to top)
Much of the code's logic and reasoning can be understood from the documentation above.
The
output of htmLawed is a text string containing the processed input. There is no custom error tracking.
Function arguments for htmLawed are:
*
$in - first argument; a text string; the
input text to be processed. Any extraneous slashes added by PHP when
magic quotes are enabled should be removed beforehand using PHP's
stripslashes() function.
*
$config - second argument; an associative array; optional; named
$C within htmLawed code. The array has keys with names like
balance and
keep_bad, and the values, which can be boolean, string, or array, depending on the key, are read to accordingly set the
configurable parameters (indicated by the keys). All configurable parameters receive some default value if the value to be used is not specified by the user through
$config.
Finalized $config is thus a filtered and possibly larger array.
*
$spec - third argument; a text string; optional. The string has rules, written in an htmLawed-designated format,
specifying element-specific attribute and attribute value restrictions. Function
hl_spec() is used to convert the string to an associative-array, named
$S within htmLawed code, for internal use.
Finalized $spec is thus an array.
Finalized $config and
$spec are made
global variables while htmLawed is at work. Values of any pre-existing global variables with same names are noted, and their values are restored after htmLawed finishes processing the input (to capture the
finalized values, the
show_settings parameter of
$config should be used). Depending on
$config, another global variable
hl_Ids, to track
id attribute values for uniqueness, may be set. Unlike the other two variables, this one is not reset (or unset) post-processing.
Except for the main function
htmLawed() and the functions
kses() and
kses_hook(), htmLawed's functions are
name-spaced using the
hl_ prefix. The
functions and their roles are:
*
hl_attrval - checking attribute values against $spec
*
hl_bal - tag balancing
*
hl_cmtcd - handling CDATA sections and HTML comments
*
hl_ent - entity handling
*
hl_prot - checking a URL scheme/protocol
*
hl_regex - checking syntax of a regular expression
*
hl_spec - converting user-supplied $spec value to one used by htmLawed internally
*
hl_tag - handling tags
*
hl_tag2 - transforming tags
*
hl_tidy - compact/beautify HTML
*
hl_version - reporting htmLawed version
*
htmLawed - main function
*
kses - main function of
kses
*
kses_hook - hook function of
kses
The last two are for compatibility with pre-existing code using the
kses script. htmLawed's
kses() basically passes on the filtering task to
htmLawed() function after deciphering
$config and
$spec from the argument values supplied to it.
kses_hook() is an empty function and is meant for being filled with custom code if the
kses script users were using one.
htmLawed() finalizes
$spec (with the help of
hl_spec()) and
$config, and globalizes them. Finalization of
$config involves setting default values if an inappropriate or invalid one is supplied. This includes calling
hl_regex() to check well-formedness of regular expression patterns if such expressions are user-supplied through
$config.
htmLawed() then removes invalid characters like nulls and
x01 and appropriately handles entities using
hl_ent(). HTML comments and CDATA sections are identified and treated as per
$config with the help of
hl_cmtcd(). When retained, the
< and
> characters identifying them, and the
<,
> and
& characters inside them, are replaced with control characters (code-points
1 to
5) till any tag balancing is completed.
After this
initial processing htmLawed() identifies tags using regex and processes them with the help of
hl_tag() -- a large function that analyzes tag content, filtering it as per HTML standards,
$config and
$spec. Among other things,
hl_tag() transforms deprecated elements using
hl_tag2(), removes attributes from closing tags, checks attribute values as per
$spec rules using
hl_attrval(), and checks URL protocols using
hl_prot().
htmLawed() performs tag balancing and nesting checks with a call to
hl_bal(), and optionally compacts/beautifies the output with proper white-spacing with a call to
hl_tidy(). The latter temporarily replaces white-space, and
<,
> and
& characters inside
pre,
script and
textarea elements, and HTML comments and CDATA sections with control characters (code-points
1 to
5, and
7).
htmLawed permits the use of custom code or
hook functions at two stages. The first, called inside
htmLawed(), allows the input text as well as the finalized
$config and
$spec values to be altered right after the initial processing (see
section 3.7). The second is called by
hl_tag() once the tag content is finalized (see
section 3.4.9).
The functionality of htmLawed is dictated by the external HTML standard. It is thus coded for a clear-cut objective with not much concern for tweakability. The code is only minimally annotated with comments -- it is not meant to instruct; PHP developers familiar with the HTML specifications will see the logic, and others can always refer to the htmLawed documentation. The compact structuring of the statements is meant to aid a quick grasp of the logic.
htmLawed 1.1.14, 8 August 2012
Copyright Santosh Patnaik
Dual licensed with LGPL 3 and GPL 2+
A PHP Labware internal utility - http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed