Google Search Engine Optimisation Search Engines Statistics

On-line Tools for Searching the AOL Data

A number of tools are appearing allowing the interrogation of the released AOL material:

AOL Search Database (currently under heavy load)

and (IMO) a more superior site:

DontDelete (new domain so you may have to access via

I am currently indexing the entire dataset myself (2.1GB of search goodness) and hope to bring you some Irish related queries from the data in the next day or two.


Michael Duz has created this tool for calculating the increase in traffic from higher SERP positions which you might find handy.

Google Search Engine Optimisation Search Engines Statistics

AOL Release (and quickly remove) Search Records of 0.5m Users

[EDIT] You can find some mined gems from this data over at the plentyoffish blog (and while your there, learn about a guy who makes >$10k PER DAY from Adsense on his free dating site).

According to this post AOL released, and very, very promptly removed, the entire search records of 500,000 users collected over a three month period.

Apart from the obvious privacy concerns (most likely the reason for the removal), this data represents a unique opportunity to research the what people search for and the iterative approach they take within their searches. You can see the initial search patterns people use and how they refine those search patterns to find the results they want.

Interesting also because, to the best of my knowledge, AOL search repackages Google’s Search so in essence this is really Google data (Google also recently announced its intention to release 30GB of word/phrase data).

From the ReadMe.txt :

This collection consists of ~20M web queries collected from ~650k users over three months.

The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}.
AnonID – an anonymous user ID number.
Query – the query issued by the user, case shifted with
most punctuation removed.
QueryTime – the time at which the query was submitted for search.
ItemRank – if the user clicked on a search result, the rank of the
item on which they clicked is listed.
ClickURL – if the user clicked on a search result, the domain portion of
the URL in the clicked result is listed.

Each line in the data represents one of two types of events:
1. A query that was NOT followed by the user clicking on a result item.
2. A click through on an item in the result list returned from a query.

Normalized queries:
36,389,567 lines of data
21,011,340 instances of new queries (w/ or w/o click-through)
7,887,022 requests for “next page” of results
19,442,629 user click-through events
16,946,938 queries w/o user click-through
10,154,742 unique (normalized) queries
657,426 unique user ID’s

And this being the wonderful Internet, the 439MB compressed file is still floating around with the filename AOL-data.tgz. Here are some mirrors I know of:…BB5BE….01.txt.gz….02.txt.gz….03.txt.gz….04.txt.gz….05.txt.gz….06.txt.gz….07.txt.gz….08.txt.gz….09.txt.gz….10.txt.gz

Google Search Engine Optimisation Search Engines

Google Webmaster Central

Google has updated its Sitemaps application and added a number of new features under the umbrella ‘Google Webmaster Central‘:


The Sitemaps UI seems to have been cleaned up quite a bit and my first impressions are that it is somewhat easier to navigate and problems/errors are far better highlighted now.

So far I have only found one new feature – ‘Preferred Domain’. This feature allows you to display either or in Google’s index:


[Update] I have just noticed that if you click on the ‘Preferred domain’ link the navigation bar gives you a new option ‘Crawl rate’. At the moment this option is only available from that particular page (to me anyway) and is giving a 404 Not Found error. I expect this option will allow you to tell Google to ease off your site if the Googlebot is too aggressive when crawling your site.

Google Search Engine Optimisation Search Engines

Yet more Video Blogs from Matt Cutts

Matt Cutts has posted two more videos:

1. Session 9: All about datacenters has some info about variations across datacenters;
2. Session 10: Lightning Round! some other bits ‘n pieces regarding how Google scores HTML tags, in particular bold tag verses strong and i tag versus em.

Google Search Engine Optimisation Search Engines Statistics

More Matt Cutts Video Blogs

Some comments about duplicate content and Google Analytics: Session 7.

And some discussion about algorithms, updates and how Google PageRank is constantly updating (as opposed to Toolbar Pagerank which is updated only every so often): Session 8: Google Terminology

Again one or two interesting tidbits.

Google Search Engine Optimisation Search Engines SEO

Excellent Academic Study of Clickstreams

You can find a very good synopsis of a recent study by Hamburg University into Internet usage habits over at

You should check out the clickthrough heatmap for Google (Figure 5: The Golden Triangle – Eye Tracking on Google Results (Hotchkiss 2005)) which shows the importance of the top 3 positions in Google SERPs.

The synopsis contains some other good reference material also.

If you are interested in the original Hamburg University study (not a bad read if you have the time and interest) you can view it here.

Google Search Engine Optimisation Search Engines

Matt Cutts Video Blog

Some people might be interested in Matt Cutts latest comments on Google and SEO.

There’s nothing too exciting here, but if you are doing your own SEO and want to learn a little about how Google crawls and indexes your site He gives some good information and tips.

You can find the videos here:

qualities of a good site
some SEO myths
Optimize for Search Engines or for Users
Session 4
Session 5
Session 6: All about Supplemental Results

If you’re interested in Google (and I think you should be if you optimise for the Irish market) then you might gain one or two good insights from his comments.

Google Search Engine Optimisation Search Engines Private Meta Search – a good thing for SEO?

From the IXQuick press release:

As personal privacy concerns create growing alarm about the freedom of the Internet, the Ixquick metasearch engine ( has taken a pioneering step: starting today, Ixquick will permanently delete all personal search details gleaned from its users from the log files.

This may be a welcome development for searchers but doesn’t bode well for SE techniques such as geo-targeting results.

Services such as IXQuicl also make SEO and SEM more difficult, again because (I presume) the SE’s cannot geo-track the searcher:

Ixquick’s Meta Search feature enables the user to simultaneously search 12 of the best search engines. However, Ixquick does not share the user’s personal data with these individual search engines in any circumstances.

I sent IXQuick some feedback requesting info on what data is stripped from the request and geo-targeting is handled.

If this type of meta search engine really takes off I think it will make SEO and SEM that bit more difficult.