Search This Blog

&

Google Architecture Overview

Monday, January 3, 2011

In this post, I will tell you how the whole Google Search Engine system works.  You can see the complete Google functioning in below figure.

Google Architecture

This is a high level Google Architecture; several distributed crawlers are used for downloading web pages from the web. URLserver sends list of urls to the crawlers. Then crawler fetches the web pages and sends them to the storeserver. Then the storeserver compresses the fetched web pages and stores them in a repository. Each web page has a associated id number which is called docID. docID is assigned when any url is parsed from a web page. The indexer part reads the repository, uncompresses and parses the documents. The documents are converted into hits (i.e. set of word occurrences).  The indexer then distributes these hits into various set of barrels.

The indexer parses all the links in web pages and stores important information in an anchors file. The file contains all information regarding a link.

The URLresolver takes data from anchors file & convert relative urls into absolute urls, then turn them into docIDs. URLresolver also generates links database, which is used to calculate page rank of all documents.

The sorter takes barrelts data and resorts them by wordID and create an inverted index.

The sorter generates list of wordIDs and offsets into an inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.

Referred by: The Anatomy of a Large-Scale Hypertextual Web Search Engine

2 comments:

Anonymous said...

nice info

Bulbul Ahmed said...

The techniques discussed by you in this post about powering up the article are very helpful for this regard. You described very well how to choose right keywords for your article and website to get it on ranking. ColibriTool" Colibri tool" is also the best for knowing the position of your keywords as it monitor them automatically from the search engine.

Post a Comment