WebApp Sec mailing list archives

Re: Canicalization Of User Input In PHP


From: Paul Johnston <paul () westpoint ltd uk>
Date: Wed, 19 Jan 2005 14:16:54 +0000

Hi,

In general I feel that trying to develop a generic "sanitize_input" function is not fruitful. The set of dangerous characters depends on where the string is used. For example, I just audited some code which had such a function "safe_io" it called the MySQL and HTML escaping functions. This was rigorously called for inputs, however, I found some places where variables protected like that were passed to the shell. Also, such functions can very easily corrupt data.

For escaping dangerous characters, I advocate escaping very close to where the string will be reparsed. e.g. system("program " + escape_shell(args)). Applying this principle to cross-site scripting means escaping HTML as it is generated. A consequence of this is that your database may contain HTML special characters. However, as you follow this principle, you become more and more encouraged to just make everything binary safe and sidestep the dangerous characters problem. For SQL queries, most interfaces support some kind of parameterised queries that are binary safe. As for passing to the shell, it usually turns out the only good policy is to avoid this at all costs. I haven't mentioned the string length, and most of the time a long string is not dangerous.

One thing to note: in many situations the programmer should be able to get a complete list of dangerous characters - because they control the code that reparses the string. However, the most notable exception is HTML. Here the client's browser does the parsing - programmer has no control. Various browser-specific features require protecting more characters.

Now, escaping bad characters is just one part of the puzzle. As a major second line of defence every input value should be whitelist validated as early as possible. Any input encoding (e.g. URL encoding) must be decoded before this validation. Handling UTF-8 requires some consideration here. This is a major defence against the possibility that you've missed a character from your dangerous character list. Also, this is a good place to put sensible length limits. However, for many inputs quite permissive validation is the only acceptable option, a regex I often use is ^[\x20-\x7e]*$ While helpful, this does nothing to protect HTML or SQL special characters, so it is not a defence by itself.

So, the overall sequence is:

   input -> unescape -> validate -> [do stuff] -> escape -> output

I haven't mentioned "canonicalisation" but that is implicit in the above sequence. Of course, all this only protects you against one class of attacks. While the inputs are now validated, they remain untrusted. You still have to design your logic correctly. Also, you need to consider carefully what your inputs are. e.g. it would be easy to protect all form input fields, but forget to apply the same validation to cookies. For network applications it is usually clear what the inputs are (but take care of things like reverse DNS lookups). A tradition Unix situation where the inputs are many and varied is a setuid executable - something very difficult to secure.

Regards,

Paul



warnings () envisagement com wrote:

I am working on implementing a basic PHP user input validation scheme
and have come across several references to canonicalizing input before
performing validation. After researching this topic on the net I have finally
reached a point where I feel okay asking for help.

At this point I have found a few basic functions related to this subject, but
I am getting lost in alphabet soup (UTF-8, RFC 2279, ISO 10646, ...) and
I am reaching a momentary saturation point where I am finding the learning
curve is only getting steeper with the more I learn.

For the basic validation I have found the following set of PHP filters via the
owasp.org site.

http://www.owasp.org/software/labs/phpfilters.html
// sanitize.inc.php
// Sanitization functions for PHP
// by: Gavin Zuchlinski, Jamie Pratt, Hokkaido
// webpage: http://libox.net
// Last modified: December 21, 2003

Now these functions are fairly clear and easy to understand and have
generally validated what I have come to understand as best practices.
as I have experience with fault tolerant coding, just not security. But, the issue I am having trouble coming to terms with is canonicalization of the data.
Beyond the above routines, I have also found the urldecode() function in
the PHP manual.

At this point I feel (weakly, not securely) that one should use the following
to canonicalize the data prior to validating any input.

reset($_GET);
foreach($_GET as $key => $value){
   // Transform to canonical form.
   $ckey = my_utf8_decode(urldecode($key));
   $cvalue = my_utf8_decode(urldecode($value));
   if( $ckey != sanitize_paranoid_string($ckey) ||
           $cvalue != sanitize_paranoid_string($cvalue) ){
       header('location:www.somesight.net/index.php');
   }
}

I understand this example is simplistic, but is this a proper way
to canonicalize the input values?  Or am I missing something here?

Should I be looking at the following too?

$_SERVER['CONTENT_TYPE'] == 'application/x-www-form-urlencoded'

Is this data even trustworthy? I would at first guess think it could be forged in
the header data.

Any input would be appreciated.

thanks,

Sean


--
Paul Johnston, GSEC
Internet Security Specialist
Westpoint Limited
Albion Wharf, 19 Albion Street,
Manchester, M1 5LN
England
Tel: +44 (0)161 237 1028
Fax: +44 (0)161 237 1031
email: paul () westpoint ltd uk
web: www.westpoint.ltd.uk


Current thread: