A Broadband and ADSL forum. BroadbanterBanter

Welcome to BroadbanterBanter.

You are currently viewing as a guest which gives you limited access to view most discussions and other FREE features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload your own photos and access many other special features. Registration is fast, simple and absolutely free so please, join our community today.

Go Back   Home » BroadbanterBanter forum » Newsgroup Discussions » uk.telecom.broadband (UK broadband)
Site Map Home Register Authors List Search Today's Posts Mark Forums Read Web Partners

uk.telecom.broadband (UK broadband) (uk.telecom.broadband) Discussion of broadband services, technology and equipment as provided in the UK. Discussions of specific services based on ADSL, cable modems or other broadband technology are also on-topic. Advertising is not allowed.

Blocking WebWise (Phorm) by User-Agent



 
 
Thread Tools Display Modes
  #1  
Old April 23rd 09, 01:09 PM posted to uk.telecom.broadband
Chris Hills
external usenet poster
 
Posts: 96
Default Blocking WebWise (Phorm) by User-Agent

Hi

I operate a few websites and in addition to having my domains
blacklisted by Phorm I want to exclude them using robots.txt just in
case. However, the BT WebWise guide at [1] says that their crawler obeys
entries for Yahoo's and Google's crawlers as well as "*", but do not
list their own crawler user agent. This means that in order to block
their crawler you would have to either block Google, Yahoo or both. Of
course I do not want to do that. Does anyone know what user-agent they
use? I made inquiry using the form on the webwise site but they refuse
to answer my question as I am not a BT customer. Alternatively, one
could redirect the crawler to a different robots.txt file, but to do
this one would need to know the ip address(es) from which the crawler
operates.

Regards,

Chris Hills

[1]
www2.bt.com/static/i/btretail/webwise/help.html#how-do-i-prevent-webwise-from-scanning-my-site
  #2  
Old April 23rd 09, 03:29 PM posted to uk.telecom.broadband
Richard Tobin
external usenet poster
 
Posts: 273
Default Blocking WebWise (Phorm) by User-Agent

In article ,
Chris Hills wrote:

I operate a few websites and in addition to having my domains
blacklisted by Phorm I want to exclude them using robots.txt just in
case. However, the BT WebWise guide at [1] says that their crawler obeys
entries for Yahoo's and Google's crawlers as well as "*", but do not
list their own crawler user agent. This means that in order to block
their crawler you would have to either block Google, Yahoo or both. Of
course I do not want to do that. Does anyone know what user-agent they
use? I made inquiry using the form on the webwise site but they refuse
to answer my question as I am not a BT customer.


I also asked them about this, and have received only an automated
acknowledgment.

Alternatively, one
could redirect the crawler to a different robots.txt file, but to do
this one would need to know the ip address(es) from which the crawler
operates.


I did find a list of addresses to block, but I don't remember where.
It's somewhere on the web.

-- Richard
--
Please remember to mention me / in tapes you leave behind.
  #3  
Old April 24th 09, 08:29 AM posted to uk.telecom.broadband
Dave Saville
external usenet poster
 
Posts: 101
Default Blocking WebWise (Phorm) by User-Agent

On Thu, 23 Apr 2009 12:09:48 UTC, Chris Hills wrote:

Hi

I operate a few websites and in addition to having my domains
blacklisted by Phorm I want to exclude them using robots.txt just in
case. However, the BT WebWise guide at [1] says that their crawler obeys
entries for Yahoo's and Google's crawlers as well as "*", but do not
list their own crawler user agent. This means that in order to block
their crawler you would have to either block Google, Yahoo or both. Of
course I do not want to do that. Does anyone know what user-agent they
use? I made inquiry using the form on the webwise site but they refuse
to answer my question as I am not a BT customer. Alternatively, one
could redirect the crawler to a different robots.txt file, but to do
this one would need to know the ip address(es) from which the crawler
operates.


Surely *something* is going to show up in your web logs? Tip: When
processing log files, exclude what you know you don't want. Then
whatever is left is out of the ordinary. You can't program for what
you don't know is there (yet) :-)

--
Regards
Dave Saville

NB Remove nospam. for good email address
  #4  
Old April 24th 09, 02:07 PM posted to uk.telecom.broadband
Chris Hills
external usenet poster
 
Posts: 96
Default Blocking WebWise (Phorm) by User-Agent

On 24/04/09 09:29, Dave Saville wrote:
Surely *something* is going to show up in your web logs? Tip: When
processing log files, exclude what you know you don't want. Then
whatever is left is out of the ordinary. You can't program for what
you don't know is there (yet) :-)


Dave

When my sites get accessed the user agents are indeed logged. However,
once I know what the crawler agent is, it will be too late since it will
already have been crawled :-)

Regards,

Chris
  #5  
Old April 24th 09, 05:05 PM posted to uk.telecom.broadband
Invalid
external usenet poster
 
Posts: 149
Default Blocking WebWise (Phorm) by User-Agent

In message , Chris Hills
writes
On 24/04/09 09:29, Dave Saville wrote:
Surely *something* is going to show up in your web logs? Tip: When
processing log files, exclude what you know you don't want. Then
whatever is left is out of the ordinary. You can't program for what
you don't know is there (yet) :-)


Dave

When my sites get accessed the user agents are indeed logged. However,
once I know what the crawler agent is, it will be too late since it
will already have been crawled :-)

Regards,

Chris

Does Phorm crawl the sites? Will the traffic they profile show up in the
logs in any way which differs from the original requester?

AIUI Phorm's methodology is to look at the web pages individuals are
browsing in order to profile the individual not the website. It
identifies the individual from cookies set on the users machine, and
then passes on the original request to the website as if it came from
the individual.

Phorm say (see http://www.cl.cam.ac.uk/~rnc1/080518-phorm.pdf) that when
the website is first visited (by any ISP customer) the Robots.Txt file
is retrieved and cached (for a month). That implies you might see in the
log one request for Robots.txt once a month from a Phorm IP. The rest of
the traffic from your website that they profile will be unidentifiable
by you in any way.

The same document also suggests that they will only respect a
User-Agent: * construction and not one targeted at their Agent. See Para
44 "we work on the basis that if a site allows spidering of its
contents by search engines, then its material is being openly published.
Conversely, if the site has disallowed spidering and indexing by search
engines, we respect those restrictions in robots.txt". If they aren't
going to respect a User-Agent: Phorm (and why would they) then there is
no real point in knowing what the agent is really called anyway.

I suspect there is going to be no way a website can block Phorm while
allowing Google etc. to index it without someone resorting to the
courts.
--
Invalid
  #6  
Old April 24th 09, 10:27 PM posted to uk.telecom.broadband
Digby
external usenet poster
 
Posts: 77
Default Blocking WebWise (Phorm) by User-Agent

On Thu, 23 Apr 2009 14:09:48 +0200, Chris Hills
wrote:

Hi

I operate a few websites and in addition to having my domains
blacklisted by Phorm I want to exclude them using robots.txt just in
case. However, the BT WebWise guide at [1] says that their crawler obeys
entries for Yahoo's and Google's crawlers as well as "*", but do not
list their own crawler user agent. This means that in order to block
their crawler you would have to either block Google, Yahoo or both. Of
course I do not want to do that. Does anyone know what user-agent they
use? I made inquiry using the form on the webwise site but they refuse
to answer my question as I am not a BT customer. Alternatively, one
could redirect the crawler to a different robots.txt file, but to do
this one would need to know the ip address(es) from which the crawler
operates.

Regards,

Chris Hills

[1]
www2.bt.com/static/i/btretail/webwise/help.html#how-do-i-prevent-webwise-from-scanning-my-site


Webwise are also maintaining an opt-out list of websites.
You can join Amazon and Wikipedia and put your sites on the list by
emailing .

I would direct you to the appropriate page on the BT website, but it's
unavailable at the moment while they update the site.

There's a little more information he
http://blog.gidley.co.uk/2009/04/pho...r-opt-out.html

  #7  
Old April 25th 09, 09:48 AM posted to uk.telecom.broadband
Alex Fraser
external usenet poster
 
Posts: 553
Default Blocking WebWise (Phorm) by User-Agent

Invalid wrote:
AIUI Phorm's methodology is to look at the web pages individuals are
browsing in order to profile the individual not the website. It
identifies the individual from cookies set on the users machine, and
then passes on the original request to the website as if it came from
the individual.


This is as I understand it too.

Phorm say (see http://www.cl.cam.ac.uk/~rnc1/080518-phorm.pdf) that when
the website is first visited (by any ISP customer) the Robots.Txt file
is retrieved and cached (for a month). That implies you might see in the
log one request for Robots.txt once a month from a Phorm IP. The rest of
the traffic from your website that they profile will be unidentifiable
by you in any way.


Yes, this makes sense.

The same document also suggests that they will only respect a
User-Agent: * construction and not one targeted at their Agent. See Para
44

[snip]
I suspect there is going to be no way a website can block Phorm while
allowing Google etc. to index it without someone resorting to the courts.


If you know the possible addresses they request /robots.txt from, you
may be able to arrange for the server to return a special robots.txt to
them.

More simply, you can allow specific crawlers but disallow all others
(including Webwise/Phorm), eg:

User-agent: Google
User-agent: Yahoo
(etc)
Disallow:

User-agent: *
Disallow: /

Alex
  #8  
Old April 26th 09, 08:33 AM posted to uk.telecom.broadband
Denis McMahon
external usenet poster
 
Posts: 50
Default Blocking WebWise (Phorm) by User-Agent

On Apr 23, 1:09*pm, Chris Hills wrote:

I operate a few websites and in addition to having my domains
blacklisted by Phorm I want to exclude them using robots.txt just in
case.


Surely the best solution is to use the apache configs or .htaccess
files to deny the ip ranges involved at every level of the websites
concerned?

Remember that for any crawler, robots.txt is an issue of good manners
and etiquette, not a set in stone must obey.

Denis
  #9  
Old April 26th 09, 10:50 AM posted to uk.telecom.broadband
Dave {Reply Address in.Sig}
external usenet poster
 
Posts: 28
Default Blocking WebWise (Phorm) by User-Agent

Denis McMahon wrote:

On Apr 23, 1:09 pm, Chris Hills wrote:

I operate a few websites and in addition to having my domains
blacklisted by Phorm I want to exclude them using robots.txt just in
case.


Surely the best solution is to use the apache configs or .htaccess
files to deny the ip ranges involved at every level of the websites
concerned?

Remember that for any crawler, robots.txt is an issue of good manners
and etiquette, not a set in stone must obey.

I occasionally scan my web logs and any crawler found to be ignoring
robots.txt gets an immediate ban in the firewall.

Does anyone have a definitive list of Phorm IP ranges?
--
Dave
da (without the space)
So many gadgets, so little time.

  #10  
Old April 27th 09, 01:41 PM posted to uk.telecom.broadband
Richard Tobin
external usenet poster
 
Posts: 273
Default Blocking WebWise (Phorm) by User-Agent

In article ,
Alex Fraser wrote:

More simply, you can allow specific crawlers but disallow all others
(including Webwise/Phorm), eg:

User-agent: Google
User-agent: Yahoo
(etc)
Disallow:

User-agent: *
Disallow: /


The BT website implies that if you allow Google or Yahoo, Phorm
will take that as permission.

If you know their IP addresses, you can block them at your firewall.

-- Richard
--
Please remember to mention me / in tapes you leave behind.
 




Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
BT, Webwise and ISP-snooping. What you need to know. Eddie R uk.telecom.broadband (UK broadband) 3 December 28th 08 08:44 AM
Furthher legal problems for Phorm (BT's Webwise) Eddie R uk.telecom.broadband (UK broadband) 7 November 26th 08 10:43 PM
BT, Webwise and Phorm: A question of trust nospamthanks uk.telecom.broadband (UK broadband) 13 November 19th 08 11:04 PM
Phorm and Webwise ? [email protected] uk.telecom.broadband (UK broadband) 28 March 26th 08 12:33 PM


All times are GMT +1. The time now is 10:35 PM.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.Content Relevant URLs by vBSEO 2.4.0
Copyright 2004-2019 BroadbanterBanter.
The comments are property of their posters.