WebApp Sec mailing list archives

Re: Combatting automated download of dynamic websites?

From: Jayson Anderson <sonick () sonick com>
Date: Mon, 29 Aug 2005 14:56:49 -0700

I've not yet seen recursion-prevention implementations that can provide
unencumbered service to valid customers while effectively denying access
to leisurely yet persistent recursive gets. You can definetely prevent
the aggressive recursion that is default wget behavior, but a
combination of lazy and determined I've yet to see confounded.....

On Mon, 2005-08-29 at 10:18 +0200, Matthijs R. Koot wrote:

Hi folks,

Which preventive or repressive measures could one apply to protect
larger dynamic websites against automated downloading by tools such as
WebCopier and Teleport Pro (or curl, for that matter)?

For a website like Amazon's, I reckon some technical measures would be
in place to protect against 'leakage' of all product information by such
tools (assuming such measures are justified by calculated risk). The
data we publish online are important company gems which we want to be
accessible by any visitor, but to be protected against systematic
download in either non-intentional context (like Internet Explorer's
built-in MSIECrawler) or intentional context (WebCopier, Teleport, ...).

Consider this:
detailpage.html?bid=0000001
detailpage.html?bid=0000002
detailpage.html?bid=0000003
(...)

Or with multiple levels:
detailpage.html?bid=0000001&t=1
detailpage.html?bid=0000001&t=2
detailpage.html?bid=0000002&t=1
detailpage.html?bid=0000002&t=2
(...)

In specific, I was wondering if it's possible and sensible to limit the
allowed number of requests for certain pages per minute/hour. At the
same time, the data displayed by detailpage.html should be indexable by
Google, so the data itself can't be hidden behind a user login and it's
not possible to use any client-side scripting as Google doesn't
interpret it. I'm using Apache 2 on RedHat 4 Enterprise and know about
mod_throttle (which doesn't work with Apache 2) and mod_security (which
also offers some 'throttling' functionality, regression, but is only
able to work with individual requests and can't remember request sequences).

I'd also suppose that dealing with proxy servers of large ISPs, like
AOL, is a big caveat.

Any ideas?

Best regards,
Matthijs

Current thread:

Combatting automated download of dynamic websites? Matthijs R. Koot (Aug 29)
- Re: Combatting automated download of dynamic websites? Jayson Anderson (Aug 29)
  - Re: Combatting automated download of dynamic websites? Serg Belokamen (Aug 29)
- Re: Combatting automated download of dynamic websites? bugtraq (Aug 29)
  - Re: Combatting automated download of dynamic websites? Matthijs R. Koot (Aug 29)
    - Re: Combatting automated download of dynamic websites? Javier Fernandez-Sanguino (Aug 30)
    - Re: Combatting automated download of dynamic websites? Eoin Keary (Aug 31)
    - Re: Combatting automated download of dynamic websites? Javier Fernandez-Sanguino (Sep 05)
    - Re: Combatting automated download of dynamic websites? Michael Boman (Aug 30)
    - Re: Combatting automated download of dynamic websites? Paul M. (Sep 05)
    - Re: Combatting automated download of dynamic websites? Eoin Keary (Sep 07)
- Re: Combatting automated download of dynamic websites? Achim Hoffmann (Aug 31)

(Thread continues...)