WebApp Sec mailing list archives

Re: Canicalization Of User Input In PHP

From: Paul Johnston <paul () westpoint ltd uk>
Date: Wed, 19 Jan 2005 14:16:54 +0000

Hi,

In general I feel that trying to develop a generic "sanitize_input"function is not fruitful. The set of dangerous characters depends onwhere the string is used. For example, I just audited some code whichhad such a function "safe_io" it called the MySQL and HTML escapingfunctions. This was rigorously called for inputs, however, I found someplaces where variables protected like that were passed to the shell.Also, such functions can very easily corrupt data.

For escaping dangerous characters, I advocate escaping very close towhere the string will be reparsed. e.g. system("program " +escape_shell(args)). Applying this principle to cross-site scriptingmeans escaping HTML as it is generated. A consequence of this is thatyour database may contain HTML special characters. However, as youfollow this principle, you become more and more encouraged to just makeeverything binary safe and sidestep the dangerous characters problem.For SQL queries, most interfaces support some kind of parameterisedqueries that are binary safe. As for passing to the shell, it usuallyturns out the only good policy is to avoid this at all costs. I haven'tmentioned the string length, and most of the time a long string is notdangerous.

One thing to note: in many situations the programmer should be able toget a complete list of dangerous characters - because they control thecode that reparses the string. However, the most notable exception isHTML. Here the client's browser does the parsing - programmer has nocontrol. Various browser-specific features require protecting morecharacters.

Now, escaping bad characters is just one part of the puzzle. As a majorsecond line of defence every input value should be whitelist validatedas early as possible. Any input encoding (e.g. URL encoding) must bedecoded before this validation. Handling UTF-8 requires someconsideration here. This is a major defence against the possibility thatyou've missed a character from your dangerous character list. Also, thisis a good place to put sensible length limits. However, for many inputsquite permissive validation is the only acceptable option, a regex Ioften use is ^[\x20-\x7e]*$ While helpful, this does nothing to protectHTML or SQL special characters, so it is not a defence by itself.


So, the overall sequence is:

   input -> unescape -> validate -> [do stuff] -> escape -> output

I haven't mentioned "canonicalisation" but that is implicit in the abovesequence. Of course, all this only protects you against one class ofattacks. While the inputs are now validated, they remain untrusted. Youstill have to design your logic correctly. Also, you need to considercarefully what your inputs are. e.g. it would be easy to protect allform input fields, but forget to apply the same validation to cookies.For network applications it is usually clear what the inputs are (buttake care of things like reverse DNS lookups). A tradition Unixsituation where the inputs are many and varied is a setuid executable -something very difficult to secure.


Regards,

Paul



warnings () envisagement com wrote:

I am working on implementing a basic PHP user input validation scheme
and have come across several references to canonicalizing input before

performing validation. After researching this topic on the net I havefinally

reached a point where I feel okay asking for help.

At this point I have found a few basic functions related to thissubject, but

I am getting lost in alphabet soup (UTF-8, RFC 2279, ISO 10646, ...) and

I am reaching a momentary saturation point where I am finding thelearning

curve is only getting steeper with the more I learn.

For the basic validation I have found the following set of PHP filtersvia the

owasp.org site.

http://www.owasp.org/software/labs/phpfilters.html
// sanitize.inc.php
// Sanitization functions for PHP
// by: Gavin Zuchlinski, Jamie Pratt, Hokkaido
// webpage: http://libox.net
// Last modified: December 21, 2003

Now these functions are fairly clear and easy to understand and have
generally validated what I have come to understand as best practices.

as I have experience with fault tolerant coding, just not security.But, theissue I am having trouble coming to terms with is canonicalization ofthe data.

Beyond the above routines, I have also found the urldecode() function in
the PHP manual.

At this point I feel (weakly, not securely) that one should use thefollowing

to canonicalize the data prior to validating any input.

reset($_GET);
foreach($_GET as $key => $value){
   // Transform to canonical form.
   $ckey = my_utf8_decode(urldecode($key));
   $cvalue = my_utf8_decode(urldecode($value));
   if( $ckey != sanitize_paranoid_string($ckey) ||
           $cvalue != sanitize_paranoid_string($cvalue) ){
       header('location:www.somesight.net/index.php');
   }
}

I understand this example is simplistic, but is this a proper way
to canonicalize the input values?  Or am I missing something here?

Should I be looking at the following too?

$_SERVER['CONTENT_TYPE'] == 'application/x-www-form-urlencoded'

Is this data even trustworthy? I would at first guess think it couldbe forged in

the header data.

Any input would be appreciated.

thanks,

Sean


--
Paul Johnston, GSEC
Internet Security Specialist
Westpoint Limited
Albion Wharf, 19 Albion Street,
Manchester, M1 5LN
England
Tel: +44 (0)161 237 1028
Fax: +44 (0)161 237 1031
email: paul () westpoint ltd uk
web: www.westpoint.ltd.uk

Current thread:

Announcing: OWASP AppSec Europe 2005, April 9-10 Jeff Williams (Jan 16)
- Canicalization Of User Input In PHP warnings (Jan 19)
  - Re: Canicalization Of User Input In PHP Paul Johnston (Jan 23)