Parts of a Search Engine
While there are different ways to organize web content, every crawling search
engine has the same basic parts:
• a crawler
• an index (or catalog)
• a search interface
Crawler (or Spider)
The crawler does just what its name implies. It scours the web following links,
updating pages, and adding new pages when it comes across them. Each search
engine has periods of deep crawling and periods of shallow crawling. There is also
a scheduler mechanism to prevent a spider from overloading servers and to tell the
spider what documents to crawl next and how frequently to crawl them.
Rapidly changing or highly important documents are more likely to get crawled
frequently. The frequency of crawl should typically have little effect on search
relevancy; it simply helps the search engines keep fresh content in their index. The
home page of CNN.com might get crawled once every ten minutes. A popular,
rapidly growing forum might get crawled a few dozen times each day. A static site
with little link popularity and rarely changing content might only get crawled once
or twice a month.
The best benefit of having a frequently crawled page is that you can get your new
sites, pages, or projects crawled quickly by linking to them from a powerful or
frequently changing page.

No Comments »
RSS feed for comments on this post. TrackBack URL