OntoSearch

Ontology Search Engine

Critical Analysis Of Web Crawlers’ Algorithms

July 13th, 2011

Admin

Critical Analysis of Web Crawlers’ Algorithms

Minou Parhizkar 0527553

Abstract- A web crawler is a program or automated script which browses the World Wide Web in a methodical, automated manner. The objective of the paper is to make a make a critical analysis of the algorithms used by Web Crawlers. It intends to review and evaluate the different and various approaches to the methods used by the different web search engines to catalog the information.

Index Terms-

Web Crawler, Search Engines, WWW, SEO

•I. INTRODUCTION

The software that searches for information and returns sites which provide that information is referred to as a search engine or web crawler. Everyone uses web crawlers-indirectly, at least! Every time you search the Internet using a service such as Alta Vista, Excite, or Lycos, you’re making use of an index that’s based on the output of a web crawler. Web crawlers-also known as spiders, robots, or wanderers-are software programs that automatically traverse the Web. Search engines use crawlers to find what’s on the Web; then they construct an index of the pages that were found.

Search Engines use spiders to index websites. When you submit your website pages to a search engine by completing their required submission page, the search engine spider will index your entire site. A ‘spider’ is an automated program that is run by the search engine system. Spider visits a web site, read the content on the actual site, the site’s Meta tags and also follow the links that the site connects. The spider then returns all that information back to a central depository, where the data is indexed. It will visit each link you have on your website and index those sites as well. Some spiders will only index a certain number of pages on your site.

A spider is almost like a book where it contains the table of contents, the actual content and the links and references for all the websites it finds during its search, and it may index up to a million pages a day.

Example: Google spider

When you ask a search engine to locate information, it is actually searching through the index which it has created and not actually searching the Web. Different search engines produce different rankings because not every search engine uses the same algorithm to search through the indices.

One of the things that a search engine algorithm scans for is the frequency and location of keywords on a web page, but it can also detect artificial keyword stuffing or spamdexing. Then the algorithms analyze the way that pages link to other pages in the Web. By checking how pages link to each other, an engine can both determine what a page is about, if the keywords of the linked pages are similar to the keywords on the original page. Most of the top-ranked search engines are crawler based search engines while some may be based on human compiled directories. The people behind the search engines want the same thing every webmaster wants – traffic to their site. Since their content is mainly links to other sites, the thing for them to do is to make their search engine bring up the most relevant sites to the search query, and to display the best of these results first. In order to accomplish this, they use a complex set of rules called algorithms. When a search query is submitted at a search engine, sites are determined to be relevant or not relevant to the search query according to these algorithms, and then ranked in the order it calculates from these algorithms to be the best matches first.

Search engines keep their algorithms secret and change them often in order to prevent webmasters from manipulating their databases and dominating search results. They also want to provide new sites at the top of the search results on a regular basis rather than always having the same old sites show up month after month. An important difference to realize is that search engines and directories are not the same. Search engines use a spider to “crawl” the web and the web sites they find, as well as submitted sites. As they crawl the web, they gather the information that is used by their algorithms in order to rank your site.

This paper aims at critically analyzing various search engineers, how they work and comparing their algorithms.

•II. Working of web crawlers – a detailed look up

Let us now look at a more detailed explanation on how Search Engines work. Crawler based search engines are primarily composed of three parts.

A search engine robot’s action is called spidering, as it resembles the multiple legged spiders. The spider’s job is to go to a web page, read the contents, connect to any other pages on that web site through links, and bring back the information. From one page it will travel to several pages and this proliferation follows several parallel and nested paths simultaneously. Spiders frequent the site at some interval, may be a month to a few months, and re-index the pages. This way any changes that may have occurred in your pages could also be reflected in the index. The spiders automatically visit your web pages and create their listings. An important aspect is to study what factors promote “deep crawl” – the depth to which the spider will go into your website from the page it first visited. Listing ‘submitting or registering’ with a search engine is a step that could accelerate and increase the chances of that engine “spidering” your pages.

The spider’s movement across web pages stores those pages in its memory, but the key action is in indexing. The index is a huge database containing all the information brought back by the spider. The index is constantly being updated as the spider collects more information. The entire page is not indexed and the searching and page-ranking algorithm is applied only to the index that has been created. Most search engines claim that they index the full visible body text of a page. In a subsequent section, we explain the key considerations to ensure that indexing of your web pages improves relevance during search. The combined understanding of the indexing and the page-ranking process will lead to developing the right strategies. The Meta tags ‘Description’ and ‘Keywords’ have a vital role as they are indexed in a specific way. Some of the top search engines do not index the keywords that they consider spam. They will also not index certain ‘stop words’ (commonly used words such as ‘a’ or ‘the’ or ‘of’” so as to save space or speed up the process. Images are obviously not indexed, but image descriptions or Alt text or “text within comments” is included in the index by some search engines.

The search engine software or program is the final part. When a person requests a search on a keyword or phrase, the search engine software searches the index for relevant information. The software then provides a report back to the searcher with the most relevant web pages listed first. The algorithm-based processes used to determine ranking of results are discussed in greater detail later.

These directories compile listings of websites into specific industry and subject categories and they usually carry a short description about the website. Inclusion in directories is a human task and requires submission to the directory producers. Visitors and researchers over the net quite often use these directories to locate relevant sites and information sources. Thus directories assist in structured search. Another important reason is that crawler engines quite often find websites to crawl through their listing and links in directories. Yahoo and The Open Directory are amongst the largest and most well known directories. LookSmart is a directory that provides results to partner sites such as MSN Search, Excite and others. Lycos is an example of a site that pioneered the search engine but shifted to the Directory model depending on AlltheWeb.com for its listings.

Hybrid Search Engines are both crawler based as well as human powered. In plain words, these search engines have two sets of listings based on both the mechanisms mentioned above. The best example of hybrid search engines is Yahoo, which has got a human powered directory as well as a Search toolbar administered by Google. Although, such engines provide both listings they are generally dominated by one of the two mechanisms. Yahoo is known more for its directory rather than crawler based search engine.

Search engines rank web pages according to the software’s understanding of the web page’s relevancy to the term being searched. To determine relevancy, each search engine follows its own group of rules. The most important rules are.

- The location of keywords on your web page; and – How often those keywords appear on the page ‘the frequency’

For example, if the keyword appears in the title of the page, then it would be considered to be far more relevant than the keyword appearing in the text at the bottom of the page. Search engines consider keywords to be more relevant if they appear sooner on the page (like in the headline) rather than later. The idea is that you’ll be putting the most important words – the ones that really have the relevant information – on the page first.

Search engines also consider the frequency with which keywords appear. The frequency is usually determined by how often the keywords are used out of all the words on a page. If the keyword is used 4 times out of 100 words, the frequency would be 4%. Of course, you can now develop the perfect relevant page with one keyword at 100% frequency – just put a single word on the page and make it the title of the page as well. Unfortunately, the search engines don’t make things that simple.

While all search engines do follow the same basic rules of relevancy, location and frequency, each search engine has its own special way of determining rankings. To make things more interesting, the search engines change the rules from time to time so that the rankings change even if the web pages have remained the same. One method of determining relevancy used by some search engines ‘like HotBot and Infoseek’, but not others ‘like Lycos’, is the Meta tags. Meta tags are hidden HTML codes that provide the search engine spiders with potentially important information like the page description and the page keywords.

Meta tags are often labeled as the secret to getting high rankings, but Meta tags alone will not get you a top 10 ranking. On the other hand, they certainly don’t hurt. Detailed information on meta-tags and other ways of improving search engine ranking is given later in this chapter.

In the early days of the web, webmasters would repeat a keyword hundreds of times in the Meta tags and then add it hundreds of times to the text on the web page by making it the same color as the background. However, now, major search engines have algorithms that may exclude a page from ranking if it has resorted to “keyword spamming”; in fact some search engines will downgrade ranking in such cases and penalize the page.

Link analysis and ‘clickthrough’ measurement are certain other factors that are “off the page” and yet crucial in the ranking mechanism adopted by some leading search engines. This is quickly emerging as the most important determinant of ranking, but before we study this, we must first look at the most popular search engines and then look at the various steps you can take to improve your success at each of the stages – spidering, indexing and ranking.

For March 2003, according to a study by Jupiter Media Metrix, there were an estimated 114 million Internet users online in the US at work or at home, 80 percent of whom are estimated to have made some type of search request during the month.

•III. a summarised comparison OF SEARCH engines

Yahoo!

been in the search game for many years. is better than MSN but nowhere near as good as Google at determining if a link is a natural citation or not. has a ton of internal content and a paid inclusion program. both of which give them incentive to bias search results toward commercial results things like cheesy off topic reciprocal links still work great in Yahoo!

MSN Search

new to the search game is bad at determining if a link is natural or artificial in nature due to sucking at link analysis they place too much weight on the page content their poor relevancy algorithms cause a heavy bias toward commercial results likes bursty recent links new sites that are generally un-trusted in other systems can rank quickly in MSN Search things like cheesy off topic reciprocal links still work great in MSN Search

Google

has been in the search game a long time, and saw the web graph when it is much cleaner than the current web graph is much better than the other engines at determining if a link is a true editorial citation or an artificial link looks for natural link growth over time heavily biases search results toward informational resources trusts old sites way too much a page on a site or sub-domain of a site with significant age or link related trust can rank much better than it should, even with no external citations they have aggressive duplicate content filters that filter out many pages with similar content if a page is obviously focused on a term they may filter the document out for that term. on page variation and link anchor text variation are important. a page with a single reference or a few references of a modifier will frequently outrank pages that are heavily focused on a search phrase containing that modifier crawl depth determined not only by link quantity, but also link quality. Excessive low quality links may make your site less likely to be crawled deep or even included in the index. things like cheesy off topic reciprocal links are generally ineffective in Google when you consider the associated opportunity cost

Ask

looks at topical communities due to their heavy emphasis on topical communities they are slow to rank sites until they are heavily cited from within their topical community due to their limited market share they probably are not worth paying much attention to unless you are in a vertical where they have a strong brand that drives significant search traffic

•IV. Detailed Analysis of Search Engines

Now that we have understood the working and basics of web crawlers and reviewed a summarized comparison of a few major search engines out in the market, now we are in a position to have a detailed analysis and comparison between these and get into nitty gritty technical details. The sections below will deal with each of these engines one by one with a detailed analysis.

•V. Yahoo!

Yahoo! was founded in 1994 by David Filo and Jerry Yang as a directory of websites. For many years they outsourced their search service to other providers, but by the end of 2002 they realized the importance and value of search and started aggressively acquiring search companies.

Overture purchased AllTheWeb and AltaVista. Yahoo! purchased Inktomi (in December 2002) and then consumed Overture (in July of 2003), and combined the technologies from the various search companies they bought to make a new search engine.

•a) On Page Content

Yahoo! offers a paid inclusion program, so when Yahoo! Search users click on high ranked paid inclusion results in the organic search results Yahoo! profits. In part to make it easy for paid inclusion participants to rank, I believe Yahoo! places greater weight on on-the-page content than a search engine like Google does.

Being the #1 content destination site on the web, Yahoo! has a boatload of their own content which they frequently reference in the search results. Since they have so much of their own content and make money from some commercial organic search results it might make sense for them to bias their search results a bit toward commercial websites.

Using descriptive page titles and page content goes a long way in Yahoo!

In my opinion their results seem to be biased more toward commerce than informational sites, when compared with Google.

•b) Crawling

Yahoo! is pretty good at crawling sites deeply so long as they have sufficient link popularity to get all their pages indexed. One note of caution is that Yahoo! may not want to deeply index sites with many variables in the URL string, especially since

Yahoo! already has a boatload of their own content they would like to promote (including verticals like Yahoo! Shopping) Yahoo! offers paid inclusion, which can help Yahoo! increase revenue by charging merchants to index some of their deep database contents.

You can use Yahoo! Site Explorer to see how well they are indexing your site and which sites link at your site.

•c) Query Processing

Certain words in a search query are better at defining the goals of the searcher. If you search Yahoo! for something like “how to SEO ” many of the top ranked results will have “how to” and “SEO” in the page titles, which might indicate that Yahoo! puts quite a bit of weight even on common words that occur in the search query.

Yahoo! seems to be more about text matching when compared to Google, which seems to be more about concept matching.

•d) Link Reputation

Yahoo! is still fairly easy to manipulate using low to mid quality links and somewhat to aggressively focused anchor text. Rand Fishken recently posted about many Technorati pages ranking well for their core terms in Yahoo!. Those pages primarily have the exact same anchor text in almost all of the links pointing at them.

Sites with the trust score of Technorati may be able to get away with more unnatural patterns than most webmasters can, but I have seen sites flamethrown with poorly mixed anchor text on low quality links, only to see the sites rank pretty well in Yahoo! quickly.

•e) Page vs Site

A few years ago at a Search Engine Strategies conference Jon Glick stated that Yahoo! looked at both links to a page and links to a site when determining the relevancy of a page. Pages on newer sites can still rank well even if their associated domain does not have much trust built up yet so long as they have some descriptive inbound links.

•f) Site Age

Yahoo! may place some weight on older sites, but the effect is nowhere near as pronounced as the effect in Google’s SERPs.

It is not unreasonable for new sites to rank in Yahoo! in as little as 2 or 3 months.

•g) Paid Search

Yahoo! prices their ads in an open auction, with the highest bidder ranking the highest. By early 2007 they aim to make Yahoo! Search Marketing more of a closed system which factors in clickthrough rate (and other algorithmic factors) into their ad ranking algorithm.

Yahoo! also offers a paid inclusion program which charges a flat rate per click to list your site in Yahoo!’s organic search results.

Yahoo! also offers a contextual ad network. The Yahoo! Publisher program does not have the depth that Google’s ad system has, and they seem to be trying to make up for that by biasing their targeting to more expensive ads, which generally causes their syndicated ads to have a higher click cost but lower average clickthrough rate.

•h) Editorial

Yahoo! has many editorial elements to their search product. When a person pays for Yahoo! Search Submit that content is reviewed to ensure it matches Yahoo!’s quality guidelines. Sites submitted to the Yahoo! Directory are reviewed for quality as well.

In addition to those two forms of paid reviews, Yahoo! also frequently reviews their search results in many industries. For competitive search queries some of the top search results may be hand coded. If you search for Viagra, for example, the top 5 listings looked useful, and then I had to scroll down to #82 before I found another result that wasn’t spammy.

Yahoo! also manually reviews some of the spammy categories somewhat frequently and then reviews other samples of their index. Sometimes you will see a referral like http://corp.yahoo-inc.com/project/health-blogs/keepers if they reviewed your site and rated it well.

Sites which have been editorially reviewed and were of decent quality may be given a small boost in relevancy score. Sites which were reviewed and are of poor quality may be demoted in relevancy or removed from the search index.

Yahoo! has published their content quality guidelines. Some sites that are filtered out of search results by automated algorithms may return if the site cleans up the associated problems, but typically if any engine manually reviews your site and removes it for spamming you have to clean it up and then plead your case.

•i) Social Aspects

Yahoo! firmly believes in the human aspect of search. They paid many millions of dollars to buy Del.icio.us, a social bookmarking site. They also have a similar product native to Yahoo! called My Yahoo!

Yahoo! has also pushed a question answering service called Yahoo! Answers which they heavily promote in their search results and throughout their network. Yahoo! Answers allows anyone to ask or answer questions. Yahoo! is also trying to mix amateur content from Yahoo! Answers with professionally sourced content in verticals such as Yahoo! Tech.

•j) Yahoo! SEO Tools

Yahoo! has a number of useful SEO tools.

Overture Keyword Selector Tool – shows prior month search volumes across Yahoo! and their search network. Overture View Bids Tool – displays the top ads and bid prices by keyword in the Yahoo! Search Marketing ad network. Yahoo! Site Explorer – shows which pages Yahoo! has indexed from a site and which pages they know of that link at pages on your site. Yahoo! Mindset – shows you how Yahoo! can bias search results more toward informational or commercial search results. Yahoo! Advanced Search Page – makes it easy to look for .edu and .gov backlinks Yahoo! Buzz – shows current popular searches

•k) Yahoo! Business Perspectives

Being the largest content site on the web makes Yahoo! run into some inefficiency issues due to being a large internal customer. For example, Yahoo! Shopping was a large link buyer for a period of time while Yahoo! Search pushed that they didn’t agree with link buying. Offering paid inclusion and having so much internal content makes it make sense for Yahoo! to have a somewhat commercial bias to their search results.

They believe strongly in the human and social aspects of search, pushing products like Yahoo! Answers and My Yahoo!.

I think Yahoo!’s biggest weakness is the diverse set of things that they do. In many fields they not only have internal customers, but in some fields they have product duplication, like with Yahoo! My Web and Del.icio.us.

•l) Search Marketing Perspective

I believe if you do standard textbook SEO practices and actively build quality links it is reasonable to expect to be able to rank well in Yahoo! within 2 or 3 months. If you are trying to rank for highly spammed keyword phrases keep in mind that the top 5 or so results may be editorially selected, but if you use longer tail search queries or look beyond the top 5 for highly profitable terms you can see that many people are indeed still spamming them to bits.

As Yahoo! pushes more of their vertical offerings it may make sense to give your site and brand additional exposure to Yahoo!’s traffic by doing things like providing a few authoritative answers to topically relevant questions on Yahoo! Answers.

•VI. Msn Search

MSN Search had many incarnations, being powered by the likes of Inktomi and Looksmart for a number of years. After Yahoo! bought Inktomi and Overture it was obvious to Microsoft that they needed to develop their own search product. They launched their technology preview of their search engine around July 1st of 2004. They formally switched from Yahoo! organic search results to their own in house technology on January 31st, 2005.

•a) On Page Content

Using descriptive page titles and page content goes a long way to help you rank in MSN. I have seen examples of many domains that ranked for things like

state name+ insurance type + insurance

on sites that were not very authoritative which only had a few instances of state name and insurance as the anchor text. Adding the word health, life, etc. to the page title made the site relevant for those types of insurance, in spite of the site having few authoritative links and no relevant anchor text for those specific niches.

Additionally, internal pages on sites like those can rank well for many relevant queries just by being hyper focused, but MSN currently drives little traffic when compared with the likes of Google.

•b) Crawling

MSN has got better at crawling, but I still think Yahoo! and Google are much better at crawling. It is best to avoid session IDs, sending bots cookies, or using many variables in the URL strings. MSN is nowhere near as comprehensive as Yahoo! or Google at crawling deeply through large sites like eBay.com or Amazon.com.

•c) Query Processing

I believe MSN might be a bit better than Yahoo! at processing queries for meaning instead of taking them quite so literally, but I do not believe they are as good as Google is at it.

While MSN offers a tool that estimates how commercial a page or query is I think their lack of ability to distinguish quality links from low quality links makes their results exceptionally biased toward commercial results.

•d) Link Reputation

By the time Microsoft got in the search game the web graph was polluted with spammy and bought links. Because of this, and Microsoft’s limited crawling history, they are not as good as the other major search engines at telling the difference between real organic citations and low quality links.

MSN search reacts much more quickly than the other engines at ranking new sites due to link bursts. Sites with relatively few quality links that gain enough descriptive links are able to quickly rank in MSN. I have seen sites rank for one of the top few dozen most expensive phrases on the net in about a week.

•e) Page vs Site

I think all major search engines consider site authority when evaluating individual pages, but with MSN it seems as though you do not need to build as much site authority as you would to rank well in the other engines.

•f) Site Age

Due to MSN’s limited crawling history and the web graph being highly polluted before they got into search they are not as good as the other engines at determining age related trust scores. New sites doing general textbook SEO and acquiring a few descriptive inbound links (perhaps even low quality links) can rank well in MSN within a month.

•g) Paid Search

Microsoft’s paid search product, AdCenter, is the most advanced search ad platform on the web. Like Google, MSN ranks ads based on both max bid price and ad clickthrough rate. In addition to those relevancy factors MSN also allows you to place adjustable bids based on demographic details. For example, a mortgage lead from a wealthy older person might be worth more than an equivalent search from a younger and poorer person.

•h) Editorial

All major search engines have internal relevancy measurement teams. MSN seems to be highly lacking in this department, or they are trying to use the fact that their search results are spammy as a marketing angle.

MSN is running many promotional campaigns to try to get people to try out MSN Search, and in many cases some of the searches they are sending people to have bogus spam or pornography type results in them. A good example of this is when they used Stacey Kiebler to market their Celebrity Maps product. As of writing this, their top search result for Stacey Kiebler is still pure spam.

Based on MSN’s lack of feedback or concern toward the obvious search spam noted above on a popular search marketing community site I think MSN is trying to automate much of their spam detection, but it is not a topic you see people talk about very often. Here are MSN’s Guidelines for Successful Indexing, but they still have a lot of spam in their search results.

•i) Social Aspects

Microsoft continues to lag in understanding what the web is about. Executives there should read The Cluetrain Manifesto. Twice.Or maybe three times.

They don’t get the web. They are a software company posing as a web company.

They launch many products as though they have the market stranglehold monopolies they once enjoyed, and as though they are not rapidly losing them. Many of Microsoft’s most innovative moves get little coverage because when they launch key products they often launch them without supporting other browsers and trying to lock you into logging in to Microsoft.

•j) MSN SEO Tools

MSN has a wide array of new and interesting search marketing tools. Their biggest limiting factor with them is that they have limited search market share.

Some of the more interesting tools are

Keyword Search Funnel Tool – shows terms that people search for before or after they search for a particular keyword Demographic Prediction Tool – predicts the demographics of searchers by keyword or site visitors by website Online Commercial Intention Detection Tool – estimates the probability of a search query or web page being commercial, informational-transactional, or Search Result Clustering Tool – clusters search results based on related topics

You can view more of their tools under the demo section at Microsoft’s Adlab.

•VII. Google Search

Google sprang out of a Stanford research project to find authoritative link sources on the web. In January of 1996 Larry Page and Sergey Brin began working on BackRub.

After they tried shopping the Google search technology to no avail they decided to set up their own search company. Within a few years of forming the company they won distribution partnerships with AOL and Yahoo! that helped build their brand as the industry leader in search. Traditionally search was viewed as a loss leader.

Google did not have a profitable business model until the third iteration of their popular AdWords advertising program in February of 2002, and was worth over 100 billion dollars by the end of 2005.

•a) On Page Content

If a phrase is obviously targeted (ie: the exact same phrase is in most of the following location: in most of your inbound links, internal links, at the start of your page title, at the beginning of your first page header, etc.) then Google may filter the document out of the search results for that phrase. Other search engines may have similar algorithms, but if they do those algorithms are not as sophisticated or aggressively deployed as those used by Google.

Google is scanning millions of books, which should help them create an algorithm that is pretty good at differentiating real text patterns from spammy manipulative text (although I have seen many garbage content cloaked pages ranking well in Google, especially for 3 and 4 word search queries).

You need to write naturally and make your copy look more like a news article than a heavily SEOed page if you want to rank well in Google. Sometimes using less occurrences of the phrase you want to rank for will be better than using more.

You also want to sprinkle modifiers and semantically related text in your pages that you want to rank well in Google.

Some of Google’s content filters may look at pages on a page by page basis while others may look across a site or a section of a site to see how similar different pages on the same site are. If many pages are exceptionally similar to content on your own site or content on other sites Google may be less willing to crawl those pages and may throw them into their supplemental index. Pages in the supplemental index rarely rank well, since generally they are trusted far less than pages in the regular search index.

Duplicate content detection is not just based on some magical percentage of similar content on a page, but is based on a variety of factors. Both Bill Slawski and Todd Malicoat offer great posts about duplicate content detection. This shingles PDF explains some duplicate content detection techniques.

•b) Crawling

While Google is more efficient at crawling than competing engines, it appears as though with Google’s BigDaddy update they are looking at both inbound and outbound link quality to help set crawl priority, crawl depth, and weather or not a site even gets crawled at all. To quote Matt Cutts:

The sites that fit “no pages in Bigdaddy” criteria were sites where our algorithms had very low trust in the inlinks or the outlinks of that site. Examples that might cause that include excessive reciprocal links, linking to spammy neighborhoods on the web, or link buying/selling.

In the past crawl depth was generally a function of PageRank (PageRank is a measure of link equity – and the more of it you had the better you would get indexed), but now adding in this crawl penalty for having an excessive portion of your inbound or outbound links pointing into low quality parts of the web creates an added cost which makes dealing in spammy low quality links far less appealing for those who want to rank in Google.

•c) Query Processing

While I mentioned above that Yahoo! seemed to have a bit of a bias toward commercial search results it is also worth noting that Google’s organic search results are heavily biased toward informational websites and web pages.

Google is much better than Yahoo! or MSN at determining the true intent of a query and trying to match that instead of doing direct text matching. Common words like how to may be significantly deweighted compared to other terms in the search query that provide a better discrimination value.

Google and some of the other major search engines may try to answer many common related questions to the concept being searched for. For example, in a given set of search results you may see any of the following:

a relevant .gov and/or .edu document a recent news article about the topic a page from a well known directory such as DMOZ or the Yahoo! Directory a page from the Wikipedia an archived page from an authority site about the topic the authoritative document about the history of the field and recent changes a smaller hyper focused authority site on the topic a PDF report on the topic a relevant Amazon, eBay, or shopping comparison page on the topic one of the most well branded and well known niche retailers catering to that market product manufacturer or wholesaler sites a blog post / review from a popular community or blog site about a slightly broader field

Some of the top results may answer specific relevant queries or be hard to beat, while others might be easy to compete with. You just have to think of how and why each result was chosen to be in the top 10 to learn which one you will be competing against and which ones may perhaps fall away over time.

•d) Link Reputation

PageRank is a weighted measure of link popularity, but Google’s search algorithms have moved far beyond just looking at PageRank.

As mentioned above, gaining an excessive number of low quality links may hurt your ability to get indexed in Google, so stay away from known spammy link exchange hubs and other sources of junk links. I still sometimes get a few junk links, but I make sure that I try to offset any junky link by getting a greater number of good links.

If your site ranks well some garbage automated links will end up linking to you weather you like it or not. Don’t worry about those links, just worry about trying to get a few real high quality editorial links.

Google is much better at being able to determine the difference between real editorial citations and low quality, spammy, bought, or artificial links.

When determining link reputation Google (and other engines) may look at

link age rate of link acquisition anchor text diversity deep link ratio link source quality (based on who links to them and who else they link at) weather links are editorial citations in real content (or if they are on spammy pages or near other obviously non-editorial links) does anybody actually click on the link?

It is generally believed that .edu and .gov links are trusted highly in Google because they are generally harder to influence than the average .com link, but keep in mind that there are some junky .edu links too (I have seen stuff like .edu casino link exchange directories).

When getting links for Google it is best to look in virgin lands that have not been combed over heavily by other SEOs. Either get real editorial citations or get citations from quality sites that have not yet been abused by others. Google may strip the ability to pass link authority (even from quality sites) if those sites are known obvious link sellers or other types of link manipulators. Make sure you mix up your anchor text and get some links with semantically related text.

Google likely collects usage data via Google search, Google Analytics, Google AdWords, Google AdSense, Google news, Google accounts, Google notebook, Google calendar, Google talk, Google’s feed reader, Google search history annotations, and Gmail. They also created a Firefox browser bookmark synch tool, an anti-phishing tool which is built into Firefox and have relationships with the Opera (another web browser company). Most likely they can lay some of this data over the top of the link graph to record a corroborating source of the legitimacy of the linkage data. Other search engines may also look at usage data.

•e) Page vs Site

Sites need to earn a certain amount of trust before they can rank for competitive search queries in Google. If you put up a new page on a new site and expect it to rank right away for competitive terms you are probably going to be disappointed.

If you put that exact same content on an old trusted domain and link to it from another page on that domain it can leverage the domain trust to quickly rank and bypass the concept many people call the Google Sandbox.

Many people have been exploiting this algorithmic hole by throwing up spammy subdomains on free hosting sites or other authoritative sites that allow users to sign up for a cheap or free publishing account. This is polluting Google’s SERPs pretty bad, so they are going to have to make some major changes on this front pretty soon.

•f) Site Age

Google filed a patent about information retrieval based on historical data which stated many of the things they may look for when determining how much to trust a site. Many of the things I mentioned in the link section above are relevant to the site age related trust (ie: to be well trusted due to site age you need to have at least some link trust score and some age score).

I have seen some old sites with exclusively low quality links rank well in Google based primarily on their site age, but if a site is old AND has powerful links it can go a long way to helping you rank just about any page you write (so long as you write it fairly naturally).

Older trusted sites may also be given a pass on many things that would cause newer lesser trusted sites to be demoted or de-indexed.

The Google Sandbox is a concept many SEOs mention frequently. The idea of the ‘box is that new sites that should be relevant struggle to rank for some queries they would be expected to rank for. While some people have debunked the existence of the sandbox as garbage, Google’s Matt Cutts said in an interview that they did not intentionally create the sandbox effect, but that it was created as a side effect of their algorithms:

“I think a lot of what’s perceived as the sandbox is artefacts where, in our indexing, some data may take longer to be computed than other data.”

•g) Paid Search

Google AdWords factors in max bid price and clickthrough rate into their ad algorithm. In addition they automate reviewing landing page quality to use that as another factor in their ad relevancy algorithm to reduce the amount of arbitrage and other noisy signals in the AdWords program.

The Google AdSense program is an extension of Google AdWords which offers a vast ad network across many content websites that distribute contextually relevant Google ads. These ads are sold on a cost per click or flat rate CPM basis.

•h) Editorial

Google is known to be far more aggressive with their filters and algorithms than the other search engines are. They are known to throw the baby out with the bath water quite often. They flat out despise relevancy manipulation, and have shown they are willing to trade some short term relevancy if it guides people along toward making higher quality content.

Short term if your site is filtered out of the results during an update it may be worth looking into common footprints of sites that were hurt in that update, but it is probably not worth changing your site structure and content format over one update if you are creating true value add content that is aimed at your customer base. Sometimes Google goes too far with their filters and then adjusts them back.

Google published their official webmaster guidelines and their thoughts on SEO. Matt Cutts is also known to publish SEO tips on his personal blog. Keep in mind that Matt’s job as Google’s search quality leader may bias his perspective a bit.

Google Sitemaps gives you a bit of useful information from Google about what keywords your site is ranking for and which keywords people are clicking on your listing.

•i) Social Aspects

Google allows people to write notes about different websites they visit using Google Notebook. Google also allows you to mark and share your favorite feeds and posts. Google also lets you flavorize search boxes on your site to be biased towards the topics your website covers.

Google is not as entrenched in the social aspects of search as much as Yahoo! is, but Google seems to throw out many more small tests hoping that one will perhaps stick.They are trying to make software more collaborative and trying to get people to share things like spreadsheets and calendars, while also integrating chat into email. If they can create a framework where things mesh well they may be able to gain further marketshare by offering free productivity tools.

•j) Google SEO Tools

Google Sitemaps – helps you determine if Google is having problems indexing your site. AdWords Keyword Tool – shows keywords related to an entered keyword, web page, or web site AdWords Traffic Estimator – estimates the bid price required to rank #1 on 85% of Google AdWords ads near searches on Google, and how much traffic an AdWords ad would drive Google Suggest – auto completes search queries based on the most common searches starting with the characters or words you have entered Google Trends – shows multi-year search trends Google Sets – creates semantically related keyword sets based on keyword(s) you enter Google Zeitgeist – shows quickly rising and falling search queries Google related sites – shows sites that Google thinks are related to your site related:www.site.com Google related word search – shows terms semantically related to a keyword ~term -term

•k) Business Perspectives

Google has the largest search distribution, the largest ad network, and by far the most efficient search ad auction. They have aggressively extended their brand and amazing search distribution network through partnerships with small web publishers, traditional media companies, portals like AOL, computer and other hardware manufacturers such as Dell, and popular web browsers such as Firefox and Opera.

I think Google’s biggest strength is also their biggest weakness. With some aspects of business they are exceptionally idealistic. While that may provide them an amazingly cheap marketing vehicle for spreading their messages and core beliefs it could also be part of what unravels Google.

As they throw out bits of their relevancy in an attempt to keep their algorithm hard to manipulate they create holes where competing search businesses can become more efficient.

In the real world there are celebrity endorsements. Google’s idealism associated with their hatred toward bought links and other things which act similarly to online celebrity endorsements may leave holes in their algorithms, business model, and business philosophy that allows a competitor to sneak in and grab a large segment of the market by factoring the celebrity endorsement factor into being part of the way that businesses are marketed.

•VIII. Ask Search

Ask was originally created as Ask Jeeves, and was founded by Garrett Gruener and David Warthen in 1996 and launched in April of 1997. It was a natural query processing engine that used editors to match common search queries, and backfilled the search results via a meta search engine that searched other popular engines.

As the web scaled and other search technologies improved Ask Jeeves tried using other technologies, such as Direct Hit (which roughly based popularity on page views until it was spammed to death), and then in 2001 they acquired Teoma, which is the core search technology they still use today. In March of 2005 InterActive Corp. announced they were buying Ask Jeeves, and by March of 2006 they dumped Jeeves, changing the brand to Ask.

•a) On Page Content

For topics where there is a large community Ask is good at matching concepts and authoritative sources. Where those communities do not exist Ask relies a bit much on the on page content and is pretty susceptible to repetitive keyword dense search spam.

•b) Crawling

Ask is generally slower at crawling new pages and sites than the other major engines are. They also own Bloglines, which gives them incentive to quickly index popular blog content and other rapidly updated content channels.

•c) Query Processing

I believe Ask has a heavy bias toward topical authority sites independent of anchor text or on the page content. This has a large effect on the result set the provide for any query in that it creates a result set that is more conceptually and community oriented than keyword oriented.

•d) Link Reputation

Ask is focused on topical communities using a concept they call Subject-Specific PopularitySM. This means that if you are entering a saturated or hyper saturated field that Ask will generally be one of the slowest engines to rank your site since they will only trust it after many topical authorities have shown they trusted it by citing it. Due to their heavy bias toward topical communities, for generic search they seem to be far more biased on how many quality related citations you have than looking as much at anchor text. For queries where there is not much of a topical community their relevancy algorithms are nowhere near as sharp.

•e) Page vs Site

Pages on a well referenced trusted site tend to rank better than one would expect. For example, I saw some spammy press releases on a popular press release site ranking well for some generic SEO related queries. Presumably many companies link to some of their press release pages and this perhaps helps those types of sites be seen as community hubs.

•f) Site Age

Directly I do not believe it is much of a factor. Indirectly I believe it is important in that it usually takes some finite amount of time to become a site that is approved by your topical peers.

•g) Paid Search

Ask gets most of their paid search ads from Google AdWords. Some ad buyers in verticals where Ask users convert well may also want to buy ads directly from Ask. Ask will only place their internal ads above the Google AdWords ads if they feel the internal ads will bring in more revenue.

•h) Editorial

Ask heavily relies upon the topical communities and industry experts to in essence be the editors of their search results. They give an overview of their ExpertRank technology on their web search FAQ page. While they have such limited distribution that few people talk about their search spam policies they reference a customer feedback form on their editorial guidelines page.

•i) Social Aspects

Ask is a true underdog in the search space. While they offer Bloglines and many of the save a search personalization type features that many other search companies offer they do not have the critical mass of users that some of the other major search companies have.

•j) Ask SEO Tools

Ask search results show related search phrases in the right hand column. Due to the nature of their algorithms Ask is generally not good at offering link citation searches, but recently their Bloglines service has allowed you to look for blog citations by authority, date, or relevance.

•IX. Technical Working of a Search Engine – Taking Google as example

•1) Google Architecture Overview

In this section, we will give a high level overview of how the whole system works as pictured in Figure below. Further sections will discuss the applications and data structures not mentioned in this section. Most of Google is implemented in C or C++ for efficiency and can run in either Solaris or Linux.

In Google, the web crawling (downloading of web pages) is done by several distributed crawlers. There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page. The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of “barrels”, creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.

The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.

The sorter takes the barrels, which are sorted by docID, and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.

•2) Major Data Structures

Google’s data structures are optimized so that a large document collection can be crawled, indexed, and searched with little cost. Although, CPUs and bulk input output rates have improved dramatically over the years, a disk seek still requires about 10 ms to complete. Google is designed to avoid disk seeks whenever possible, and this has had a considerable influence on the design of the data structures.

•a) BigFiles

BigFiles are virtual files spanning multiple file systems and are addressable by 64 bit integers. The allocation among multiple file systems is handled automatically. The BigFiles package also handles allocation and deallocation of file descriptors, since the operating systems do not provide enough for our needs. BigFiles also support rudimentary compression options.

•b) Repository

The repository contains the full HTML of every web page. Each page is compressed using zlib. The choice of compression technique is a tradeoff between speed and compression ratio. We chose zlib’s speed over a significant improvement in compression offered by bzip. The compression rate of bzip was approximately 4 to 1 on the repository as compared to zlib’s 3 to 1 compression. In the repository, the documents are stored one after the other and are prefixed by docID, length, and URL as can be seen in Figure below. The repository requires no other data structures to be

used in order to access it. This helps with data consistency and makes development much easier; we can rebuild all the other data structures from only the repository and a file which lists crawler errors.

•c) Document Index

The document index keeps information about each document. It is a fixed width ISAM (Index sequential access mode) index, ordered by docID. The information stored in each entry includes the current document status, a pointer into the repository, a document checksum, and various statistics. If the document has been crawled, it also contains a pointer into a variable width file called docinfo which contains its URL and title. Otherwise the pointer points into the URLlist which contains just the URL. This design decision was driven by the desire to have a reasonably compact data structure, and the ability to fetch a record in one disk seek during a search

Additionally, there is a file which is used to convert URLs into docIDs. It is a list of URL checksums with their corresponding docIDs and is sorted by checksum. In order to find the docID of a particular URL, the URL’s checksum is computed and a binary search is performed on the checksums file to find its docID. URLs may be converted into docIDs in batch by doing a merge with this file. This is the technique the URLresolver uses to turn URLs into docIDs. This batch mode of update is crucial because otherwise we must perform one seek for every link which assuming one disk would take more than a month for our 322 million link dataset.

•d) Lexicon

The lexicon has several different forms. One important change from earlier systems is that the lexicon can fit in memory for a reasonable price. In the current implementation we can keep the lexicon in memory on a machine with 256 MB of main memory. The current lexicon contains 14 million words (though some rare words were not added to the lexicon). It is implemented in two parts — a list of the words (concatenated together but separated by nulls) and a hash table of pointers. For various functions, the list of words has some auxiliary information which is beyond the scope of this paper to explain fully.

•e) Hit Lists

A hit list corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information. Hit lists account for most of the space used in both the forward and the inverted indices. Because of this, it is important to represent them as efficiently as possible. We considered several alternatives for encoding position, font, and capitalization — simple encoding (a triple of integers), a compact encoding (a hand optimized allocation of bits), and Huffman coding. In the end we chose a hand optimized compact encoding since it required far less space than the simple encoding and far less bit manipulation than Huffman coding. The details of the hits are shown in Figure below.

Our compact encoding uses two bytes for every hit. There are two types of hits: fancy hits and plain hits. Fancy hits include hits occurring in a URL, title, anchor text, or meta tag. Plain hits include everything else. A plain hit consists of a capitalization bit, font size, and 12 bits of word position in a document (all positions higher than 4095 are labeled 4096). Font size is represented relative to the rest of the document using three bits (only 7 values are actually used because 111 is the flag that signals a fancy hit). A fancy hit consists of a capitalization bit, the font size set to 7 to indicate it is a fancy hit, 4 bits to encode the type of fancy hit, and 8 bits of position. For anchor hits, the 8 bits of position are split into 4 bits for position in anchor and 4 bits for a hash of the docID the anchor occurs in. This gives us some limited phrase searching as long as there are not that many anchors for a particular word. We expect to update the way that anchor hits are stored to allow for greater resolution in the position and docIDhash fields. We use font size relative to the rest of the document because when searching, you do not want to rank otherwise identical documents differently just because one of the documents is in a larger font.

The length of a hit list is stored before the hits themselves. To save space, the length of the hit list is combined with the wordID in the forward index and the docID in the inverted index. This limits it to 8 and 5 bits respectively (there are some tricks which allow 8 bits to be borrowed from the wordID). If the length is longer than would fit in that many bits, an escape code is used in those bits, and the next two bytes contain the actual length.

•f) Forward Index

The forward index is actually already partially sorted. It is stored in a number of barrels (we used 64). Each barrel holds a range of wordID’s. If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID’s with hitlists which correspond to those words. This scheme requires slightly more storage because of duplicated docIDs but the difference is very small for a reasonable number of buckets and saves considerable time and coding complexity in the final indexing phase done by the sorter. Furthermore, instead of storing actual wordID’s, we store each wordID as a relative difference from the minimum wordID that falls into the barrel the wordID is in. This way, we can use just 24 bits for the wordID’s in the unsorted barrels, leaving 8 bits for the hit list length.

•g)