Search engine technology

Search engine technology

Modern web search engines are complex software systems using the technology that has evolved over the years. The largest search engines such as Google and Yahoo! utilize tens or hundreds of thousands of computers to process billions of web pages and return results for thousands of searches per secondFact|date=February 2007. High volume of queries and text processing requires the software to run in highly distributed environment with high degree of redundancyFact|date=February 2007. Modern search engines have the following main components:

Crawl

The first step in preparing web pages for search is to find and index them. In the past, search engines started with a small list of URLs as seed list, fetched the content, parsed for the links on those pages, fetched the web pages pointed to by those links which provided new links and the cycle continued until enough pages were foundFact|date=February 2007. Most modern search engines now utilize a continuous crawl method rather than discovery based on a seed listFact|date=November 2007. The continuous crawl method is just an extension of discovery method but there is no seed list because the crawl never stopsFact|date=February 2007. The current list of pages is visited on regular intervals and new pages are found when links are added or deleted from those pagesFact|date=February 2007. Many search engines use sophisticated scheduling algorithms to decide when to revisit a particular page. These algorithms range from constant visit-interval with higher priority for more frequently changing pages to adaptive visit-interval based on several criteria such as frequency of change, popularity and overall quality of site, speed of web server serving the page and resource constraints like amount of hardware and bandwidth of Internet connection. Search engines crawl many more pages than they make available for searching because crawler find lots duplicate content pages on the web and many pages don't have useful content. Duplicate and useless content often represents more than half the pages available for indexing.

Link Map

Pages discovered by crawlers are fed into (often distributed) service that creates a link map of the pages. Link map is a graph structure in which pages are represented as nodes connected by the links among those pages. This data is stored in data structures that allow fast access to the data by certain algorithms which compute the popularity score of pages on the web, essentially based on how many links point to a web page and the quality of those links. One such algorithm, PageRank, proposed by Google founders Larry Page and Sergey Brin, is well known and has attracted a lot of attention. The idea of doing link analysis to compute a popularity rank is older than PageRank and many variants of the same idea are currently in use. These ideas can be categorized in three main categories: rank of individual pages, rank of web sites, and nature of web site content (Jon Kleinberg's HITS algorithm). Search engines often differentiate between internal links and external links, with the assumption that links on a page pointing other pages on the same site are less valuable because they are often created by web site owners to artificially increase the rank of their web sites and pages. Link map data structures typically also store the anchor text embedded in the links because anchor text often provides a very good quality short-summary of a web page's content.

Index

Indexing is the process of extracting text from web pages, tokenizing it and then creating an index structure (inverted index) that can be used to quickly find which pages contain a particular word. Search engines differ quite a lot in tokenization process. The issues involved in tokenization are: detecting the encoding used for the page, determining the language of the content (some pages use multiple languages), finding word, sentence and paragraph boundaries, combining multiple adjacent-words into one phrase and changing the case of text and stemming the words into their roots (lower-casing and stemming is applicable only to some languages). This phase also decides which sections of page to index and how much text from very large pages (such as technical manuals) to index. Search engines also differ in the document formats they interpret and extract the text from.

Some search engines go through the indexing process every few weeks and refresh the complete index used for web search requests while others keep updating small fragments of the index continuouslyFact|date=February 2007. Before web pages can be indexed, an algorithm decides which node (a server in a distributed service) will index any given page and makes the information available as metadata for other components in the search engine. The index structure is complex and typically employs some compression algorithmFact|date=February 2007. The choice of compression algorithm involves a trade-off between on-disk storage space and speed of decompression when needed to satisfy search requests. The largest search engines use thousands of computers to index pages in parallelFact|date=February 2007.

ee also

*Search engine
*Web crawler
*Search engine indexing


Wikimedia Foundation. 2010.

Игры ⚽ Нужно решить контрольную?

Look at other dictionaries:

  • Search engine submission — is how a webmaster submits a web site directly to a search engine. While Search Engine Submission is often seen as a way to promote a web site, it generally is not necessary. Because the major search engines like Google, Yahoo, and MSN use… …   Wikipedia

  • Search Engine Watch — (SEW) is a website that provides news and information about search engines and search engine marketing. [cite news |first= |last= |authorlink= |coauthors= |title=The Crumbs You Leave Behind |url= |quote=Search Engine Watch (searchenginewatch.com) …   Wikipedia

  • Search Engine Roundtable — is a news web site that discusses topics related to the premier search engines as reported on Internet forums. Search Engine Roundtable was founded in December 2003 by Barry Schwartz and has been referenced in several news articles. [ [http://www …   Wikipedia

  • Search engine optimization — SEO redirects here. For other uses, see SEO (disambiguation). Internet marketing …   Wikipedia

  • Index (search engine) — Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and… …   Wikipedia

  • Search Engine (radio show) — Infobox Podcast title = Search Engine caption = Search Engine s Current Logo host = Jesse Brown url = http://cbc.ca/searchengine rss = http://www.cbc.ca/podcasting/includes/searchengine.xml format = Podcast, Radio genre = Technology Search Engine …   Wikipedia

  • Search engine (computing) — A search engine is an information retrieval system designed to help find information stored on a computer system. Search engines help to minimize the time required to find information and the amount of information which must be consulted, akin to …   Wikipedia

  • Web search engine — Search engine redirects here. For other uses, see Search engine (disambiguation). The three most widely used web search engines and their approximate share as of late 2010.[1] A web search engine is designed to search for information on the Wo …   Wikipedia

  • ChaCha (search engine) — ChaCha Founded September 1, 2006 (2006 09 01) Founder Scott A. Jones and Brad Bostic Headquarters …   Wikipedia

  • Distributed search engine — A distributed search engine is a search engine where there is no central server. Unlike traditional centralized search engines, work such as crawling, data mining, indexing, and query processing is distributed among several peers in decentralized …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”