Christof Meerwald@blog.www

home
> blog
>> 593

Weblog RDF feed, Atom feed

[previous] / [up] [overview] [down] / [next]

Sun Mar 01 19:29:22 2009 GMT: Native IPv6

Sat Feb 21 19:54:10 2009 GMT: Open Watcom 1.8

Tue Feb 17 07:58:57 2009 GMT: msnbot turning evil

As an update to the previous entry, when I created the robots.txt file, I had hoped that msnbot will take action accordingly. But what happened instead is really outragous: the second msnbot requested the robots.txt file, it simply changed its User-Agent header to no longer identify itself as msnbot, but continued requesting exactly the same pages and at the same rate as before.

And yes, to be sure, I have checked DNS records and whois information for the offending IP addresses (65.55.51.34 and 65.55.51.37) to check that they really belong to Microsoft/MSN.

So, this is how the User-Agent change looks in the Apache access_log:

65.55.51.37 - - [16/Feb/2009:15:28:38 -0800] "GET /index.php?title=Special:Recentchanges&hidebots=0&days=14&limit=100&feed=rss HTTP/1.1" 200 28455 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.51.37 - - [16/Feb/2009:15:30:10 -0800] "GET /robots.txt HTTP/1.1" 200 389 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.51.37 - - [16/Feb/2009:15:30:10 -0800] "GET /index.php?title=Special:Recentchanges&hideliu=0&hidebots=&feed=atom HTTP/1.1" 200 13171 "-" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 1.1.4322; InfoPath.2; .NET CLR 3.5.21022; .NET CLR 3.0.30618;.NET CLR 3.5.30729;)"

65.55.51.34 - - [16/Feb/2009:15:28:25 -0800] "GET /index.php?title=Special:Recentchanges&hideanons=1&hideliu=1&hidemyself=1&feed=rss HTTP/1.1" 200 13174 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.51.34 - - [16/Feb/2009:15:28:27 -0800] "GET /robots.txt HTTP/1.1" 200 389 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.51.34 - - [16/Feb/2009:15:28:27 -0800] "GET /index.php?title=Special:Recentchanges&days=7&hidemyself=1&feed=atom HTTP/1.1" 200 13176 "-" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 1.1.4322; InfoPath.2; .NET CLR 3.5.21022; .NET CLR 3.0.30618;.NET CLR 3.5.30729;)"

Mon Feb 16 21:27:05 2009 GMT: msnbot considered harmful

Or should I say "msnbot's DDoS attach on the Open Watcom web server"? Peter Chapin noticed high server load on the Open Watcom server, so I took a look and found that msnbot was hitting the server hard. In fact, I counted hits from 99 different IP addresses associated with msnbot and the bot appears to be extremely interested in every conceivable variation of the "recent changes" page on the wiki (which, of course, can be quite CPU intensive to generate).

As a first step, I have created a robots.txt file to tell msnbot to slow down (and prevent it from crawling the "recent changes" page). Hopefully, this will improve the situation in the next few hours, otherwise I will have to completely block msnbot from the server.

But one would really expect that a bot (or a network of bots) would be clever enough not to open lots of concurrent connections to a single server and automatically slow down a bit when the server takes a long time to respond to requests...

Sat Feb 14 17:27:08 2009 GMT: Switched broadband connection to IDNet

I have just switched my broadband connection from Tiscali to IDNet as the connection via Tiscali just got worse over the past few months (not only was it slow at weekends, their TCP port blocking also became more and more annoying). The migration went quite well and I am now enjoying a much faster connection with the ADSL modem syncing at 7616 kbps (although I haven't really tested what that translates to in terms of real world download speed).

BTW, what's really nice about IDNet is that they don't tie you into a lengthy contract as I am keeping an eye on moving to a mobile broadband connection in the future.

Sun Feb 08 15:47:27 2009 GMT: Xref header filtering for newscache

I have finally implemented Xref header filtering in newscache. This is useful as some newsreaders (like slrn) get confused by incorrect Xref headers. If you have configured only a single upstream server in newscache, you won't see the problem, but if you are using different upstream servers for different hierarchies (as I do) and you happen to be reading an article that has been cross-posted to those hierarchies, not all article numbers in the Xref header are valid.

Let's look at an example: say you have 2 upstream servers (s1.example.com for hier1.* and s2.example.com for hier2.*) configured in newscache and an article has been posted to hier1.test and hier2.test. On s1.example.com the Xref header might look like "Xref: s1.example.com hier1.test:210 hier2.test:220" and on s2.example.com the Xref header might look like "Xref: s2.example.com hier1.test:110 hier2.test:120". Now if you are reading the article in hier1.test (via s1.example.com) you don't want newscache to return the Xref header as is because article number 220 in hier2.test refers to a different article on your system (as you are reading hier2.test via s2.example.com).

Ideally, you would want newscache to return the correct article number for hier2.test in the Xref header, but that would involve too much processing (as you might have to fetch the article from s2.example.com to get the article number). The next best thing (and this is what I have implemented) is to filter the Xref header to only include groups you are reading via the same server. So if you are reading the article in hier1.test, you will get "Xref: s1.example.com hier1.test:210" and if you are reading the article in hier2.test, you will get "Xref: s2.example.com hier2.test:120" for the Xref headers. This means that your newsreader might not be able to automatically mark the article as read in the other newsgroup.

As always, the package for Ubuntu hardy heron (i386 and amd64) is available from my .deb packages page. BTW, thanks to Henrik for pointing out how to clarify the initial version of this blog post.

Mon Feb 02 18:05:46 2009 GMT: Finally some snow around here

and the country almost comes to a standstill with complete chaos.

Sat Jan 31 10:23:16 2009 GMT: (Ab)Using OpenVPN

Sat Jan 31 10:05:03 2009 GMT: Twinkle 1.4 Bug

Sun Jan 04 16:17:39 2009 GMT: Python DB-API Rant

Sat Dec 27 21:24:53 2008 GMT: vAdmin SNMP Service

Mon Dec 22 10:26:51 2008 GMT: 64-bit vServer


This Web page is licensed under the Creative Commons Attribution - NonCommercial - Share Alike License. Any use is subject to the Privacy Policy.

Revision: 1.14, cmeerw.org/blog/593.html
Last modified: Mon Sep 03 18:19:55 2018
Christof Meerwald <cmeerw@cmeerw.org>
XMPP: cmeerw@cmeerw.org