Nmap Development mailing list archives

[RFC] Improve NSE HTTP architecture.

From: Djalal Harouni <tixxdz () opendz org>
Date: Tue, 14 Jun 2011 14:46:55 +0100
Hi list,

We have started to think about this (me and Henri) at the beginning of
GSoC, then I wrote this proposal, to discuss and to address the current
limitations of the NSE HTTP architecture, and how we can improve it,
taking into consideration Nmap and NSE properties.

Different ideas from various nmap-dev threads on the same object were also
compiled in this proposal with our own ideas. I hope that these solutions
are practical and not too complex.
So feedback is welcome, we count on you: the Web specialists.

Note: I have also studied some parts of w3af [1], another cool Open Source
project, but there are implementation differences, so most of these ideas
are based on Nmap and NSE capabilities.


Table:
------
1) Introduction
2) Motivation
3) Nmap Information exported to NSE
4) Current HTTP scripts
5) Crawler and http-enum
6) Improve HTTP fingerprints and http-enum
7) Improve other HTTP scripts
8) Conclusion


1) Introduction:
----------------
Currently there are more than 20 HTTP scripts, most of them are discovery
scripts that perform checks/tests in order to identify the HTTP
applications. These tests can be incorporated into the http-enum script to
reduce the size of the loaded and running code, and to achieve better
performance. Of course this will reduce the number of the HTTP scripts,
but writing an entire NSE script for a simple check that can be done in
5-10 Lua instructions is not the best solution either.


2) Motivation:
--------------
* Create a scalable architecture for NSE HTTP scanning:
  - Avoid or reduce the: one script per check.
  - Avoid sending irrelevant requests to the targets.
    (e.g. do not probe wordpress for django vuln.)
  - Improve the processing of the data and the information gathered.
  - Reduce the size of the running code.

* Writing and contributing HTTP checks and scripts should be easier.

* Separate the script logic from the tests:
  A lot of tests like vulnerability checks can be automated, I mean
  share common logic. Currently this can be done by http-enum, still
  there is room for more improvements.

* Improve HTTP fingerprints format and http-enum script.


3) Nmap Information exported to NSE:
------------------------------------
This proposal relies on some of the Nmap information that should be
exported to NSE scripts:

* User specified script categories selection "--script='categories'".

* Nmap --version-intensity: version scan intensity.


4) Current HTTP scripts:
------------------------
In order to have a better idea on what we are trying to achieve, we'll start
by enumerating some of the current HTTP scripts and their aims:

* http-enum: enumerates directories used by popular web applications and
  servers. For us this is the most interesting script.
  Script categoiries are: {"discovery", "intrusive", "vuln"}

* http-auth: retrieves the authentication scheme.
  this script checks for the 'www-authenticate' header.
  Categories of the script {"default", "auth", "safe"}.

* http-brute: a script to bruteforce HTTP authentication.
  This script is in the "intrusive" and "auth" categories (bruteforce).

* http-date: gets the date from HTTP service.
  This is a discovery script that checks for the 'date' header.

* http-headers: displays the returned HTTP headers.
  This is a discovery script.

* ...


By looking at these scripts we can see that most of them are discovery
scripts, however there are also bruteforce and vulnerability scripts.

Some of these scripts can be included in http-fingerprints.lua and used by
the http-enum.


5) Crawler and http-enum:
-------------------------
(We assume that there is a crawler).

We can take advantages of the current facilities proposed by NSE, like
the dependencies (scripts wait for the crawler), but NSE parallelism
is the way, and since Lua threads are very cheap, we can optimize this
to achieve better performances.

Note: If caching is available then http-enum with its matching code and
other HTTP scripts can be in a situation when they will not yield since
there are no network operations.
A solution in the http-enum matching code (this is the big code) would
be to use coroutines and make them yield explicitly.


The crawler workers threads will push data into a single cache, and
other scripts like the http-enum script will use it, this will optimize
the arrival rate and other threads/scripts (http-enum) can read it and
perform matches. So if the crawler workers are running the http-enum
master thread should wait for an event from the crawler before starting
its threads. This will depend on the crawler internals design which is
not discussed here, but perhaps we can perform context switch between the
crawler threads and http-enum threads or other scripts based on the
recursion depth level, I mean that crawler threads can signal and yield. 

So currently we consider that the crawler which is a discovery script
and other discovery scripts like http-enum must run in the same dependency
level.
If the crawler is not activated or selected then http-enum script should
behave normally and use coroutines.

More information about the parallelism subject can be found in the NSE
parallelism page [2].


6) Improve HTTP fingerprints and http-enum:
-------------------------------------------
First http-fingerprints.lua format was proposed by Ron [3], and Patrik
has also proposed another format [4]. I think that we should also
consider some good points from Patrik's format, and use or combine both
of the designes with some small modifications. So in this solution
there will be two files: http-matchers.lua and http-fingerprints.lua.

* http-matchers.lua: for more general matches that will be used all the
  time against any HTTP path when a specific condition is met.

* http-fingerprints.lua: for specific fingerprints and matches, if we know
  that the specified X HTTP path can lead us to the Y match and to the
  Z results.


http-fingerprints.lua:
----------------------
First this file should be splitted based on the categories, so when a user
specifies the script categories we do not load all the fingerprints and we
save memory.

* http-fp-vuln.lua for vulnerabilities and exploit checks, with a
  severity rating field, as it's currently documented in the
  http-fingerprint.lua file.
  e.g. 'info', 'low priority', 'warning' and 'critical'.

* http-fp-discovery.lua for discovery checks.

* http-fp-auth.lua for authentication and brute force checks.


The fields of the fingerprint tables can be:
* categories: a list of categories, the same categories used for NSE,
  not all of them but the 'discovery', 'version', 'auth' and 'vuln'
  seem ok. e.g. 'attack' would be 'vuln' or 'exploit'.

  A function to check the current selected script categories
  "--script='categories'" and if the fingerprint is allowed to be loaded
  should be added. It will return true if it is.
  e.g. stdnse.check_script_categories(fingerprint_categories).

  This way user script categories selection is not ignored, which
  includes the boolean operator results. Only the appropriate checks will
  be loaded (avoid loading all the fingerprints).
  
  Note: the 'http-enum.categories' script argument should take precedence
  on the '--script=categories' or on the default used categories.

* app: the type of applications to check: blog, databases, printers, etc.
  We can use a script argument for this: 'http-enum.app' a default value
  would be 'all'.
  - 'http-enum.app=databases' for databases checks only.
  - 'http-enum.app={databases,wiki}' for wiki and databases.

* intensity: a rarity value to decide if we should load the fingerprint
  and run the checks, however this will apply on only some defined cases,
  like the matches of the http-matchers.lua file.
  We can use a script argument for this: 'http-enum.intensity' or use
  the default Nmap exposed intensity value.

* severity: an output field for vulnerability and exploit checks.
  e.g. 'info', 'low priority', 'warning' and 'critical'.

* probes: as they are specified now in http-fingerprint.lua (no changes).

* matches: as they are specified in http-fingerprint.lua with more fields:
  * path: a regex match against the HTTP path. 
  * status: match the returned status.
  * status_code: a number that contains the returned status code.
    If you are sure about the status_code then use this field.
    We should just use on of the 'status' and 'status_code' fields, not
    both of them.
  * header['x']: a regex match against the X field of the header.
  * body: a regex match against the body.
  The fields that can be used to ignore a match are:
  * ignore_header or ignore_header['x']
  * ignore_body
  
  * Handlers by order (functions that will be executed if there is a match):
    - mangle_handler: mangle data.
    - misc_handler: perform different tasks.
      e.g. store data in the registry, perfor other complex operations.
           Load other fingerprints or another fingerprint file, etc.
    - output_handler: if set, it will take precedence on the output field,
      and the result of this function should be the string output.

* output: the returned output.


fingerprint {
  categories = {'discovery'},
  app = "wiki",
  -- intensity, 

  probes = {
    {path="/wiki/", methode="HEAD"},
    {...},
  },

  matches = {
    {
      status_code = 200,
      -- regex checks:
      header['date'] = "(.*)",
      body = "<(.*)>",
      -- severity,

      -- mangle_handler = function(data)
                -- process data
      -- end,
      misc_hadler = function(data)
                -- process data.
                -- cache data or save it in the registry.
                -- load another fingerprint or match file.
      end,
      output_handler = function(data)
                -- process output
      end,
      -- output = "wiki...",
    },
  },
}


The misc handler can perform different tasks, one of them is the ability
to load other fingerprints/matchers and probes. We can start with a bunch
of default fingerpritns and probes, and then load new ones when there is
a match, in other words there will be matches that can load other
fingerprints and matches, 
This idea was suggested by Henri and it fits perfectly with the spirit of
web discovery. First try to detect the main application, its components,
then their versions, etc.


http-matchers.lua:
------------------
A file that contains matchers that will be used all the time when a
specific condition of the match is met (e.g. match 5XX pages). This is like
what Martin suggested [5], and it would use the same format of the
fingerprint table.
If the crawler is running, then a file like this will give us more general
information, since there will be links and responses that can't be
enumerated by the http-fingerprints.lua file.

Patrik's examples make use of a special httpmatch.lua library [4], which
offers regex replacement.
e.g. #status_1# - is replaced by the first match in the status field

  Modified examples from Patrik's url.txt file [4]:
  match {
      status_code = 200,
      header['x-powered-by'] = "(.*)",
      output="#header.x-powered-by_1#"
  }
  
  match {
      status=(options.debug and "([23].*)" or "-1"),
      header['server'] = "(.*)",
      output="Server information: #header.server_1#"
  }

  Of course we must add categories, header fields with the 'content-type'
  to these matches, to avoid grepping an image, pdf, etc. We can also use
  the 'path' or 'ignore_header' fields in this situation.

* Matches can also have multiple handlers (Lua functions) to perform
  complex operations, to mangle response, to modify or adapt the output.


* A real example would be to include the check of the http-auth
  script in the http-matchers.lua file, this way this check can be
  performed against different pages and not only the root one.
  We can add more header fields, an intensity field, etc.

  match {
    categories = {'auth', 'discovery'},
    status = "401",
    header['www-authenticate'] = "(.*)",
    output = "auth...",
  }

  A better check with less code. 


Notes:
* One of the 'status' or 'status_code' fields should be present, since this
  can help the script to locate the correct matches, in order to improve
  performances and to reduce O(n) linear search time. The ideal solution
  would be to directly map the match to achieve O(1) search time, but there
  can be situations when there are several matches with multiple regex
  fields. So the proposed solution would be to use the 'status' or
  'status_code' fields to filter and to store the matches.

  --Examples:
  matchers = {
    -- matches that use the 'status_code' field.
    [200] = {
      match {
        status_code = 200,
        ...
      },
      match {
        status_code = 200,
        ...
      },
    },
    [401] = {
      match {
        status_code = 401,
        ...
      },
    },

    -- matches that use the 'status' line field.
    ['status'] = {
      match {
        status = "([23].*)"
        ...
      },
    },
  }

  In this situation the script or the matching code will have to traverse
  the matchers[status_code] and the matchers['status'] lists to find the
  correct match. Perhaps we can use other fields like the header fields to
  improve this ?

* The http-matchers.lua file can also be splitted based on categories.


7) Improve other HTTP scripts:
------------------------------

* Add fingerprints and matchers dynamically:
There is also this idea of letting scripts to register fingerprint tables
dynamically, especially scripts that will register fingerprints or
matchers with a mangler handler to process and to modify the response
(or the cached data) before it's returned to other scripts. Or a script
with a match that will report all the broken links, etc.
This sort of scripts can be very useful, and the fingerprint can be used
by the http-enum script, if it's running. We say http-enum since we assume
that this script will be well written with coroutines support, and any
matching logic should go there.

If we do this, then perhaps we should adapt some of the http-enum script
arguments, so they can be used by all the HTTP scripts:
* 'http-enum.categories' to 'http.categories'.
* 'http-enum.intensity' to 'http.intensity'.


HTTP scripts:
* http-brute: the design of this script can be improved a lot.
  If the crawler and http-enum script are running, then a dynamically
  registred match table by the http-brute script that checks the returned
  status code and the 'www-authenticate' header field, will be used by the
  http-enum script, to discover multiple protected paths, which can be saved
  in the registry by the match misc handler, and later the http-brute script
  will try to brute force them.
  So in this situation the http-brute will depend on the http-enum script.

* http-auth: we have already said that this can be converted into a general
  match in the http-matchers.lua file. The downside of this is that we will
  remove this script. If we don't want to remove the script we can modify it
  to make it register that match dynamically.

* http-date: we can also convert this script to a simple general fingerprint
  or make the script register the fingerprint dynamically.
  fingerprint {
      categories = {'discovery', 'safe'},
      probes = {path='/', method='HEAD'},
      matches = {
          status_code = 200,
          header['date'] = "(.+)",
          output_handler = function(#header.date_1#)
            -- parse #header.date_1#
          end,
      },
  }

* http-headers: like the previous http-date script. 

* http-wp-plugins: this script can also be improved.
  Make the script register a fingerprint dynamically that will try to
  identify if the Web application is Wordpress, its path and the plugins
  path. The fingerprint will be used by http-enum if it is running otherwise
  it will be used by this script, and if there is a match then another
  figerprint file will be loaded in order to bruteforce the wordpress
  plugins. We propose this design since fingerprinting and enumerating
  Web applications is more appropriate for the http-enum script, if it's
  running.


8) Conclusion:
--------------
Most of the solutions discussed here can improve NSE HTTP performance and
resource usage. The other solutions can perhaps help us to add HTTP
discovery and vulnerability checks more easily.

The downsides of this proposal are: it lacks real code, and we are not going
to do the implementation right now.


References:
-----------
[1] http://w3af.sourceforge.net/
[2] http://nmap.org/book/nse-parallelism.html
[3] http://seclists.org/nmap-dev/2010/q4/112
[4] http://seclists.org/nmap-dev/2011/q2/377
[5] http://seclists.org/nmap-dev/2010/q4/136


-- 
tixxdz
http://opendz.org
_______________________________________________
Sent through the nmap-dev mailing list
http://cgi.insecure.org/mailman/listinfo/nmap-dev
Archived at http://seclists.org/nmap-dev/
Current thread:

[RFC] Improve NSE HTTP architecture. Djalal Harouni (Jun 14)
- Re: [RFC] Improve NSE HTTP architecture. Patrik Karlsson (Jun 15)
  - Re: [RFC] Improve NSE HTTP architecture. Ron (Jun 16)
    - Re: [RFC] Improve NSE HTTP architecture. Djalal Harouni (Jun 18)
  - Re: [RFC] Improve NSE HTTP architecture. Djalal Harouni (Jun 18)
- Re: [RFC] Improve NSE HTTP architecture. Fyodor (Jun 16)
  - Re: [RFC] Improve NSE HTTP architecture. Djalal Harouni (Jun 19)
    - Re: [RFC] Improve NSE HTTP architecture. Patrick Donnelly (Jun 20)
    - Re: [RFC] Improve NSE HTTP architecture. Djalal Harouni (Jun 20)