WebApp Sec mailing list archives

Re: Preventing cross site scripting


From: "Tim Greer" <chatmaster () charter net>
Date: Fri, 20 Jun 2003 13:49:04 -0700



----- Original Message -----
From: "Laurian Gridinoc" <laur () grapefruitdesign com>
To: "Tim Greer" <chatmaster () charter net>
Cc: <webappsec () securityfocus com>
Sent: Friday, June 20, 2003 2:55 PM
Subject: Re: Preventing cross site scripting


On Fri, 2003-06-20 at 20:11, Tim Greer wrote:
Please provide some examples of this. I'd like to see your idea(s) at
work
and how it would solve this problem. I'm honestly not quite clear on the
context in which you mean this to solve this problem and I'm interested
knowing. I'm not sure I agree right now, so some examples illustrating
it
would be great--if you'd be so kind. Thanks.

This thread started with `how to export safely HTML mail messages to the
web'.
This may require to deal with the some of the following issues:

1. broken markup (<ni <foo href="a"d"" bar='> baz> &quot no semicolon)
2. unacceptable entities
3. unacceptable tags (applet, object)
4. unacceptable attributes on acceptable tags (onmouseover, ...)
5. unacceptable attribute values (href="javascript:...", width="100000")
6. unacceptable text tokens (offensive words)

I suggest to deal with them in the stated order, and not treat the HTML
string as a mere string, but dissect it in markup and content; clean the
markup (first elements, then attributes of the accepted elements) then
text.

[1] is wonderfully solved by filtering through tidy outputting xml
(xhtml) - this would be the data for the next steps.

The rest of the issues may be controlled by a XSL transformation on the
above generated xml.

[2] with a proper DTD you may alter the `rendering' of any unaccepted
entity, let's say that I want to change &acirc; (capital A, circumflex
accent) to capital A instead, simply by defining it in the DTD:
<!ENTITY Acirc  CDATA "A">

Note that &lt;, &gt;, &amp; and &quote; cannot be handled this way.

[3] unacceptable tags, now is preferable to use white lists; let's see a
black list solution:

<!-- drop script silently-->
<xsl:template match="script" />

<!-- or drop script and leave a note -->
<xsl:template match="script">
<xsl:comment>here was an evil script</xsl:comment>
</xsl:template>

<!-- drop applet preserving it's content (ex. the `backup' markup for
useragents that don't understand applet tag) -->
<xsl:template match="applet">
<xsl:apply-templates />
</xsl:template>

<!-- and accept everything since this is a blacklist solution -->
<xsl:template match="*|@*|text()|comment()">
    <xsl:copy>
        <xsl:apply-templates select="*|@*|text()|comment()" />
    </xsl:copy>
</xsl:template>

The whitelist solution would match only accepted tags:

<!-- accept only p, ul, li and attributes on them (and text nodes too,
and comments) -->
<xsl:template match="p|ul|li|@*|text()|comment()">
    <xsl:copy>
        <xsl:apply-templates select="*|@*|text()|comment()" />
    </xsl:copy>
</xsl:template>

[4] unacceptable attributes, blacklist version:

<!-- accept everything on `a' except on* attributes -->
<xsl:template match="a">
<xsl:element name="a">
<xsl:for-each select="@*">
        <xsl:if test="not(starts-with(name(), 'on'))">
<xsl:variable name="attribute">
<xsl:value-of select="name()" />
</xsl:variable>
<xsl:attribute name="$attribute">
<xsl:value-of select="." />
</xsl:attribute>
                </xsl:if>
        </xsl:for-each>
<xsl:apply-templates />
</xsl:element>
</xsl:template>

Whitelist version:

<!-- accept only href and title on `a' -->
<xsl:template match="a">
<xsl:element name="a">
<xsl:attribute name="href">
<xsl:value-of select="@href" />
</xsl:attribute>
<xsl:attribute name="title">
<xsl:value-of select="@title" />
</xsl:attribute>
<xsl:apply-templates />
</xsl:element>
</xsl:template>

[5, 6] unacceptable attribute and text values, now here is funny, the
string manipulation functions in XSL are few and not so powerful as
regex, but there isn't impossible to build proper value validation.

On strings (node and attribute names, attribute and text node values)
you have just concat, contains, starts-with, string-length, substring,
substring-after, substring-before and translate; almost nothing compared
to regex power, but in the end is not a contest of writing it all on a
line.

I'm not writing this to say regex are bad, I'm just stating that not
everything that can be hold in a string should be treated this way; this
means that HTML should be represented as (parsed to) a DOM tree (where
only nodes/attributes names, attributes values, text nodes and comments
are separate strings) where what cannot be divided anymore (atom) to
another set of tokens should be the subject of validation as a string or
number; however an attribute value which should represent an URL should
be validated by using a parser specifically built for this task (based
on URL grammar).


It's interesting, but I don't believe in a blacklist feature. It's
impossible to catch. You can only go one way safely, and that's whitelist.
Again, interesting idea, but I don't see the advantage for me personally.
Whatever works, works though. Also, regex's don't have to be written on one
line. In Perl, for example, simply use the /x anchor and you can break it up
to be very readable. I don't think anything for this task would work better
(as accurate and dynamic) than a few good regex's. it would save a lot of
time coding routines and logic, be more efficient and be a complete
solution. You can easily create conditions that can be specified, controlled
or altered per user, or just have degrees of defaults, which could handle
anything quite easily. We're literally talking a few lines of code that will
be more efficient. Nonetheless, if you develop anything along the lines you
speak, please let me know, I'd like to check it out and what you're doing.
Cheers!
--
Regards,
Tim Greer  chatmaster () charter net
Server administration, security, programming, consulting.


Current thread: