<p>As Stack Overflow grows, we’re starting to look closely at our IIS logs to identify problem HTTP clients – things like <a href="https://serverfault.com/questions/20383/dynamically-blocking-excessive-http-bandwith-use" rel="noopener nofollow ugc">rogue web spiders</a>, users who have a large page set to refresh every second, poorly written one-off web scrapers, tricksy users who try to increment page count a zillion times, and so forth.</p>
<p>I’ve come up with a few <a href="http://www.microsoft.com/DownLoads/details.aspx?FamilyID=890cd06b-abf8-4c25-91b2-f8d975cf8c07&displaylang=en" rel="noopener nofollow ugc">LogParser</a> queries that help us identify most of the oddities and abnormalities when pointed at an IIS log file.</p>
<p><strong>Top bandwidth usage by URL</strong></p>
<pre><code class="lang-auto">SELECT top 50 DISTINCT
SUBSTR(TO_LOWERCASE(cs-uri-stem), 0, 55) AS Url,
Count(*) AS Hits,
AVG(sc-bytes) AS AvgBytes,
SUM(sc-bytes) as ServedBytes
FROM {filename}
GROUP BY Url
HAVING Hits >= 20
ORDER BY ServedBytes DESC
</code></pre>
<p>url hits avgbyte served</p>
<hr>
<p>/favicon.ico 16774 522 8756028<br>
/content/img/search.png 15342 446 6842532</p>
<p><strong>Top hits by URL</strong></p>
<pre><code class="lang-auto">SELECT TOP 100
cs-uri-stem as Url,
COUNT(cs-uri-stem) AS Hits
FROM {filename}
GROUP BY cs-uri-stem
ORDER BY COUNT(cs-uri-stem) DESC
</code></pre>
<p>url hits</p>
<hr>
<p>/content/img/sf/vote-arrow-down.png 14076<br>
/content/img/sf/vote-arrow-up.png 14018</p>
<p><strong>Top bandwidth and hits by IP / User-Agent</strong></p>
<pre><code class="lang-auto">SELECT TOP 30
c-ip as Client,
SUBSTR(cs(User-Agent), 0, 70) as Agent,
Sum(sc-bytes) AS TotalBytes,
Count(*) as Hits
FROM {filename}
group by c-ip, cs(User-Agent)
ORDER BY TotalBytes desc
</code></pre>
<p>client user-agent totbytes hits</p>
<hr>
<p>66.249.68.47 Mozilla/5.0+(compatible;+Googlebot/2.1; 135131089 16640<br>
194.90.190.41 omgilibot/0.3++<a href="http://omgili.com" rel="noopener nofollow ugc">omgili.com</a> 133805857 6447</p>
<p><strong>Top bandwidth by hour by IP / User-Agent</strong></p>
<pre><code class="lang-auto">SELECT TOP 30
TO_STRING(time, 'h') as Hour,
c-ip as Client,
SUBSTR(cs(User-Agent), 0, 70) as Agent,
Sum(sc-bytes) AS TotalBytes,
count(*) as Hits
FROM {filename}
group by c-ip, cs(User-Agent), hour
ORDER BY sum(sc-bytes) desc
</code></pre>
<p>hr client user-agent totbytes hits</p>
<hr>
<p>9 194.90.190.41 omgilibot/0.3++<a href="http://omgili.com" rel="noopener nofollow ugc">omgili.com</a> 30634860 1549<br>
10 194.90.190.41 omgilibot/0.3++<a href="http://omgili.com" rel="noopener nofollow ugc">omgili.com</a> 29070370 1503</p>
<p><strong>Top hits by hour by IP / User-Agent</strong></p>
<pre><code class="lang-auto">SELECT TOP 30
TO_STRING(time, 'h') as Hour,
c-ip as Client,
SUBSTR(cs(User-Agent), 0, 70) as Agent,
count(*) as Hits,
Sum(sc-bytes) AS TotalBytes
FROM {filename}
group by c-ip, cs(User-Agent), hour
ORDER BY Hits desc
</code></pre>
<p>hr client user-agent hits totbytes</p>
<hr>
<p>10 194.90.190.41 omgilibot/0.3++<a href="http://omgili.com" rel="noopener nofollow ugc">omgili.com</a> 1503 29070370<br>
12 66.249.68.47 Mozilla/5.0+(compatible;+Googlebot/2.1 1363 13186302</p>
<p>The {filename} of course would be a path to an IIS logfile, such as</p>
<p><code>c:\working\sologs\u_ex090708.log</code></p>
<p>I did a lot of web searches for good IIS LogParser queries and found precious little. These 5, above, have helped us tremendously in identifying serious problem clients. But I’m wondering – what are we missing?</p>
<p>What other ways are there to slice and dice the IIS logs (preferably <em>with LogParser queries</em>) to mine them for statistical anomalies? <strong>Do you have any good IIS LogParser queries you run on your servers?</strong></p>