Google strikes again

Introduction: practical and philophical speech

My relationship with the Google search engine and services in general is a love-hate relationship. They was (very) good at the very beginning, but now they are more and more commited in the growing of the Dark Side of the Web. Commercial crap everywhere...

Currently I am contributing to a wiki (RosettaCode) where of course we have to deal with that Thing called Law while submitting contents. Recently someone created a task that was about using some way the Google search engines. Luckly someone realized that the task was against the Term Of Service of Google. They have such a TOS...

5. Use of the Services by you

5.3 You agree not to access (or attempt to access) any of the Services by any means other than through the interface that is provided by Google, unless you have been specifically allowed to do so in a separate agreement with Google. You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers) and shall ensure that you comply with the instructions set out in any robots.txt file present on the Services.

For fun, let's imagine a similar TOS for a site... The site provides the contents Google can index, and it's the reason why people use search engines. Without sites, they are dead. So, the sites provide a Service to Google (and Google, by providing its service, must agree with the TOS of the site from where they fished the data).

5. Use of the Service by Google

5.3 Google agrees not to access (or attempt to access) this Service without allowing this Service, or other Services or persons, to use the indexed materials with any mean they can and want; in particular Google agrees to allow access to the indexed materials of this Service through any automated means (including use of scripts or web crawlers).

This basically means: if you index me, then you must allow access to that indexes with any means, human or automated. One can reply: if you don't want Google to index your Site (which from the Google's point of view is a Service), put a proper robots.txt, since they respect it.

But let us invert the reasoning. The only thing that enforce me the respect of the Google TOS, is the TOS itself, having a legal meaning. To know that I am violating it (and so, according to someone, the law), I must read it. Once I've read it and I know what I can do and what I can't do, I know also e.g. I can't write a bot which ignores Google's robots.txt (robots.txt files have to do with politeness, not legal stuff)

If I do not read it, and create such a bot and let it run, I violate the TOS, which has a legal meaning. Since law does not permit ignorance, I am in fault and can be (at last...) sued (of course before that something less annoying can happen, like a simple going-to-sue-you threat). So the important fact is the existence of the TOS, and the compliance is on the shoulder of the user, not of the usee (the one being used).

If it is true for us, it must be true for Google too. The matter is not the the compliance to a robots.txt, which is guaranteed; the matter is the compliance to a TOS they must read before indexing my site...! I allow indexing, but under some Term Of Usage (should I say Term of Indexing instead?); they need to be compliant, then they need to read it, and agree. If they do not agree, they can't index my site. How it can happen technically, I don't mind, it's not my problem (e.g. they can put sites having such a TOS (or TOI) into a black list)

Google can play on the SEO-ill society. It seems like if the Web needs Google more than Google needs the Web. Being the necessity apparently asymmetrical, Google (arrogantly) let you think they provide a Service which is vital to you. Tabula rasa on this believing please. Of course an index for the Web is really useful, but noone says its vital and, most important thing, noone says it must be google-like (e.g. even though maybe Yahoo! and Google are tied someway, their TOS are different; I have not found something so annoying like the Google's section five in the Yahoo! TOS).

They (Google is not alone) are commercializing the Web, transforming it into a showcase from where they can grab money and enforce their conditions to us... It's important we realize that they need us, while the opposite it's not so true (believe it or not)! So let us pretend something from them! Do they need money to run a search engine? Sure, and surely they earn enough to let me use automated means instead of their interfaces.

Can this reasoning be applied even to a commercial site? Yes and no, it depends. I am not interested in pacts a commercial entity can forge with search engines in order to gain pole positions and good rankings. I am talking about the common wo/man, sitting on a chair using his/her own computer, building his/her own site and/or running his/her own Wiki. About his/her freedom, his/her rights to dictate a Term of Indexing, a Term of Usage, or whatever else, and so allowing e.g. the indexing iff the rights to exploit the index with any means is guaranteed, without the need to explicitly and directly inform anyone (like Google) about that TOS/TOI (it's enough to put a small link in the bottom of the page!)

We must stop the numerical asymmetry1 to be used to eat slices of freedom transforming us in servants of persons that decide (instead of us) what the Democracy is, what we can do (to let their business grow, even against our basic interests).

Let us think about an independent, free (for freedom) searching engine, that won't be poisoned by the money-for-nothing mirage (and chicks for free) and by commercial spam/crap. This is the future. (How far...? Too much maybe).

For the sake of the Nation, this Google must die

For the sake of the Nation, this Google must die, must die, must die, this Google must die.

We can start reading and learning how to search... Then we can use Google as they want us to use it, hitting anyway our target. After all, why crawling it or using it with an automated mean? The correct answer is a question: Why not? (Indeed a question is always a wrong answer...)

And we are already half the way to the goal (do not ask what is the goal: just imagine it exists).

Now, let's try something simple like

wget -e robots=off "http://www.google.com/search?q=xmav"

We get a sad and annoying

HTTP request sent, awaiting response... 403 Forbidden
20:17:38 ERROR 403: Forbidden.

(I am almost sure this did not happen in the past, anyway now it happens)

Let's try with Yahoo! instead

wget -e robots=off "http://search.yahoo.com/search?p=xmav"

It worked perfectly! Now let's go back to Google... What does it mean forbidden?! It means that Google checks HTTP headers: it's the only way it can distinguish between wget and browsers. In particular, maybe it does not analyse them so deeply. Maybe it just checks the User-Agent declaration. So let's play with this one.

I've tried first the following

wget -U "Unknown/1.0 (X11; Linux; en-GB) Godo/1.0.0" \
        "http://www.google.com/search?q=xmav"

... And it worked!! The string is modelled after the real User-Agent string Opera send; so that also the following will work:

wget -U "Opera/9.64 (X11; Linux i686; U; en-GB) Presto/2.1.1" \
        "http://www.google.com/search?q=xmav"

It is still not exactly the way how a browser contact the server, but clearly Google does not analyze other headers. So they probably think the vast majority of the lusers are able to use wget, but not to add a -U option! (Maybe they are right) This also means that they just recognize the default wget user-agent. This makes us think they have a list of known User-Agents which are not allowed to send the request. But we also showed that if the User-Agent is unknown, then they prefer to let it go.

I've also tried -U "", and it worked!!

Now let me say the Rule Number 1: do not abuse it. If you plan to make several requests, use an option like --wait, or a sleep (preferably random) between a request and the following. That maybe will take longer, but this way you do not risk to be discovered (and once it happens, you don't know how Google can behave; they could add more checks and make it harder to exploit the search engine this way; they could ... and so on)

We can try something similar with curl, or write a complex Perl script using LWP.

Final note

Times are a-changing. I remember (but memory can fail, of course) that I was very enthusiastic about Google once I've started to learn the basic of web scraping, since I've noticed that their search results contained HTML comments that made it simpler to parse them (even without using a HTML parser or similar). Now it is not so anymore.

I think Google is going too much into business, and doing so it's becoming (or it's already become) a searching engine like others (say, like Yahoo!, which has a TOS that sounds a little bit better); which are its features and why it's the most used search engine by the masses?

Why should it be considered, or used, as if it were special? It's just a tool. And when a tool does not fit a need, we use another tool, and we could discover that after all we don't miss the previous one... Maybe. Anyway, there's no reason why to renounce to Google, if we think it does something better (for images it seems to me better than other search engines). But we should find a way to stop policies we have the duty to condamn.


1 They often say that something is impossible since while we (as individuals) are too much, they (as holdings, companies, firms and so on) are few. So, it is easy for us to check their TOS or whatever, while it would hard for them to track all our TOS or whatever. (Replace TOS with whatever you want to make a more general speech). This is the reason why often we are enforced to inform them about something actively, while they are not enforced to inform us, but passively. But this explanation would be ok if we were fifty years or more in the past. Nowadays a lot of things can be easily symmetrical. But still they continue saying it's impossible; and the reason is that our world would be a step closer to Democracy if the Simmetry would be less broken (broken less often). (The Democracy has almost nothing to do with what it is called simply democracy: this last one is just a rough approximation; the grade of roughness makes a lot of "things" in our "democratic" world far less Democratic than "they" want us to believe)


Home page