Detecting and blocking bad robots

출처 : http://www.laconicsecurity.com/detecting-and-blocking-bad-robots.html

Detecting and blocking bad robots

It is often in the best interest for authors of web robots to obfuscate the true identity of their requests. These obfuscations often consist of changing the HTTP User-Agent header without making further modifications to other HTTP headers. By leveraging existing passive browser fingerprinting projects, it is possible to determine the existence of these robots. If desired these requests can be blocked using applications such as modsecurity in Apache, or the native configuration files of web servers such as lighttpd or Apache.

Authors of noncompliant web robots frequently spoof the HTTP User-Agent header, which is used to nominally used to identify the application making an HTTP request. The motives or spoofing the User-Agent may vary -- but organization which engage in such practices often scrape e-mail addresses, search for intellectual property violations, or perform other data mining tasks. These robots rarely obey the robots.txt exclusion standard. By leveraging the research of the browserrecon project, or merely through extensive logfile analysis, it is possible to detect and block poorly-programmed bots which spoof the HTTP User-Agent header without spoofing other corresonding HTTP headers for that User-Agent.

A simple example may be helpful to illustrate this. In the HTTP header from a simple IE 7 request, the following headers will be set:

GET / HTTP/1.1
Accept: image/gif, image/x-xbitmap, image/jpeg, ..., */*
Accept-Language: en-us
UA-CPU: x86
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT ...)
Host: example.com
Connection: Keep-Alive

However, under Firefox 3.0.3, this request is made:

GET / HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; ...) Firefox/3.0.3
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

If the request is made using curl, the following request is made:

GET / HTTP/1.1
User-Agent: curl/7.18.0 (i486-pc-linux-gnu) libcurl/7.18.0 OpenSSL/0.9.8g ...
Host: example.com
Accept: */*

It should be readily apparent that the native behavior of HTTP headers set by various browsers should allow us to detect rather unsophisticated web robot attacks. For example, if a curl were to send a fake User-Agent header, we would be able to detect it wasn't Firefox or MSIE as it sets different HTTP header values than the native browsers. Most bots with a fake User-Agent have a set of HTTP headers completely inconsistent with the User-Agent in the header. Blocking or monitoring these requests can be performed through custom modsecurity rules, or natively in many HTTP servers. As an example, I will discuss how these requests can be blocked in lighttpd and Apache.

The concept behind blocking bad web bots is simple: we add rules to the configuration file to validate that the HTTP headers in every request are consistent with the headers that would be sent by the claimed User-Agent. The following configuration example in lighttpd should be fairly self-explanatory (note that lighttpd has limited support for filtering on HTTP headers, so I put together simple patch for lighttpd 1.4.26 that adds support for the Accept, Accept-Language, and other important headers):

$HTTP["useragent"] =~ "(Firefox [1-3])" {
  $HTTP["accept-language"] == "" {
     url.access-deny = ( "" ) }
  $HTTP["accept-encoding"] != "gzip,deflate" {
     url.access-deny = ( "" )  }
  $HTTP["accept-charset"] !~ "(utf-8|UTF-8)" {
     url.access-deny = ( "" ) }
  $HTTP["via"] == "" {
    $HTTP["connection"] != "keep-alive" {
       url.access-deny = ( "" ) }
    $HTTP["keepalive"] != "(300|115)" {
       url.access-deny = ( "" ) }
  }
}

$HTTP["useragent"] =~ "(MSIE [6-8])" {
# IE7 doesn't set accept-language for favicon
  $HTTP["url"] !~ "(favicon.ico)" {
    $HTTP["accept-language"] == "" {
     url.access-deny = ( "" ) }
  }
  $HTTP["accept"] == "" {
     url.access-deny = ( "" ) }
  $HTTP["via"] == "" {
    $HTTP["accept-encoding"] != "gzip, deflate" {
        url.access-deny = ( "" )  }
    $HTTP["connection"] != "Keep-Alive" {
         url.access-deny = ( "" ) }
  }
}

We can use mod_rewrite to accomplish the same task in Apache:

RewriteCond %{HTTP_USER_AGENT}      (Firefox/[1-3])
RewriteCond %{HTTP:Accept-Language} ^$                             [OR]
RewriteCond %{HTTP_USER_AGENT}      (Firefox/[1-3])
RewriteCond %{HTTP:Accept-Encoding} !^gzip,deflate$               [OR]
RewriteCond %{HTTP_USER_AGENT}      (Firefox/[1-3])
RewriteCond %{HTTP:Accept-Charset}  !(UTF-8|utf-8)                 [OR]
RewriteCond %{HTTP_USER_AGENT}      (Firefox/[1-3])
RewriteCond %{HTTP:Via}             ^$
RewriteCond %{HTTP:Connection}      !^keep-alive$                  [OR]
RewriteCond %{HTTP_USER_AGENT}      (Firefox/[1-3])
RewriteCond %{HTTP:Via}             ^$
RewriteCond %{HTTP:Keep-Alive}      !^(300|115)$
RewriteRule (.*) - [F]

RewriteCond %{HTTP_USER_AGENT}      (MSIE [6-8])
RewriteCond %{REQUEST_URI}          !favicon.ico
RewriteCond %{HTTP:Accept-Language} ^$                             [OR]
RewriteCond %{HTTP_USER_AGENT}      (MSIE [6-8])
RewriteCond %{HTTP:Via}              ^$
RewriteCond %{HTTP:Accept-Encoding} !^gzip, deflate$             [OR]
RewriteCond %{HTTP_USER_AGENT}      (MSIE [6-8])
RewriteCond %{HTTP:Accept}          ^$                             [OR]
RewriteCond %{HTTP_USER_AGENT}      (MSIE [6-8])
RewriteCond %{HTTP:Via}              ^$
RewriteCond %{HTTP:Connection}      !^Keep-Alive$                  [OR]
RewriteCond %{HTTP_USER_AGENT}      (MSIE 7)
RewriteCond %{HTTP:UA-CPU}          ^$
RewriteRule (.*) - [F]

More sophisticated approaches can be used to detect web robots, such as those described by Park and Lee in Securing Web Service by Automatic Robot Detection.

Implementing any rules which block HTTP requests on a production web server should be seriously considered, as there is always the possibility of blocking legitimate users. Implementing such rules in modsecurity for detection only is highly recommended -- the configuration changes are given as an example to illustrate reasonable logic for detecting such attacks.

Update: I have started the Roboticity project for web robot detection in PHP. Roboticity scores requests based upon header values, which can then be used to take action based upon the likelihood the request came from a misbehaved web robot.

Feel free to contact me if you have any questions.