URL normalization

URL normalization

URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs are equivalent.

Search engines employ URL normalization in order to assign importance to web pages and to reduce indexing of duplicate pages. Web crawlers perform URL normalization in order to avoid crawling the same resource more than once. Web browsers may perform normalization to determine if a link has been visited or to determine if a page has been cached.

Normalization process

There are several type of normalization that may be performed:
* Converting the scheme and host to lower case. The scheme and host components of the URL are case-insensitive. Most normalizers will convert them to lowercase. Example: :HTTP://www.Example.com/http://www.example.com/

* Adding trailing / Directories are indicated with a trailing slash and should be included in URLs. Example: :http://www.example.comhttp://www.example.com/

* Removing directory index. Default directory indexes are generally not needed in URLs. Examples::http://www.example.com/default.asphttp://www.example.com/:http://www.example.com/a/index.htmlhttp://www.example.com/a/

* Converting the entire URL to lower case. Some web servers that run on top of case-insensitive file systems allow URLs to be case-insensitive. URLs from a case-insensitive web server may be converted to lowercase to avoid ambiguity. Example::http://www.example.com/BAR.htmlhttp://www.example.com/bar.html

* Capitalizing letters in escape sequences. All letters within a percent-encoding triplet (e.g., "%3A") are case-insensitive, and should be capitalized. Example::http://www.example.com/a%c2%b1bhttp://www.example.com/a%C2%B1b

* Removing the fragment. The fragment component of a URL is usually removed. Example: :http://www.example.com/bar.html#section1http://www.example.com/bar.html

* Removing the default port. The default port (port 80 for the “http” scheme) may be removed from (or added to) a URL. Example: :http://www.example.com:80/bar.htmlhttp://www.example.com/bar.html

* Removing dot-segments. The segments “..” and “.” are usually removed from a URL according to the algorithm described in RFC 3986 (or a similar algorithm). Example::http://www.example.com/../a/b/../c/./d.htmlhttp://www.example.com/a/c/d.html

* Removing “www” as the first domain label. Some websites operate in two Internet domains: one whose least significant label is “www” and another whose name is the result of omitting the least significant label from the name of the first. For example, http://example.com/ and http://www.example.com/ may access the same website. Although many websites redirect the user to the non-www address (or vice versa), some do not. A normalizer may perform extra processing to determine if there is a non-www equivalent and then normalize all URLs to the non-www prefix. Example::http://www.example.com/http://example.com/

* Sorting the variables of active pages. Some active web pages have more than one variable in the URL. A normalizer can remove all the variables with their data, sort them into alphabetical order (by variable name), and reassemble the URL. Example::http://www.example.com/display?lang=en&article=fredhttp://www.example.com/display?article=fred&lang=en

* Removing arbitrary querystring variables. An active page may expect certain variables to appear in the querystring; all unexpected variables should be removed. Example::http://www.example.com/display?id=123&fakefoo=fakebarhttp://www.example.com/display?id=123

* Removing default querystring variables. A default value in the querystring will render identically whether it is there or not. When a default value appears in the querystring, it should be removed. Example::http://www.example.com/display?id=&sort=ascendinghttp://www.example.com/display

* Removing the "?" when the querystring is empty. When the querystring is empty, there is no need for the "?". Example::http://www.example.com/display?http://www.example.com/display

Normalization based on URL lists

Some normalization rules may be developed for specific websites by examining URL lists obtained from previous crawls or web server logs. For example, if the URL:http://foo.org/story?id=xyzappears in a crawl log several times along with:http://foo.org/story_xyzwe may assume that the two URLs are equivalent and can be normalized to one of the URL forms.

Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URLs with similar text) rules that can be applied to URL lists. They showed that once the correct DUST rules were found and applied with a canonicalization algorithm, they were able to find up to 68% of the redundant URLs in a URL list.

References

* RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax

*

*

*

*

ee also

*Web crawler
*Uniform Resource Locator


Wikimedia Foundation. 2010.

Игры ⚽ Поможем решить контрольную работу

Look at other dictionaries:

  • Normalization — may refer to: Contents 1 Mathematics and statistics 2 Science 3 Technology …   Wikipedia

  • Нормализация URL — (или канонизация URL)  процесс при котором URL приводится к единообразному виду. Цель процесса нормализации заключается в преобразовании URL в нормализованный или канонический вид URL с тем, чтобы определить эквивалентность двух… …   Википедия

  • Web crawler — For the search engine of the same name, see WebCrawler. For the fictional robots called Skutters, see Red Dwarf characters#The Skutters. Not to be confused with offline reader. A Web crawler is a computer program that browses the World Wide Web… …   Wikipedia

  • Uniform Resource Locator — is an URI which also specifies where the identified resource is available and the protocol for retrieving it. [ [http://www.faqs.org/rfcs/rfc1738.html RFC 1738 Uniform Resource Locators] ] In popular usage and many technical documents, it is… …   Wikipedia

  • Search engine optimization — SEO redirects here. For other uses, see SEO (disambiguation). Internet marketing …   Wikipedia

  • Japan–Korea relations — Japanese–Korean relations involve three parties: Japan, North Korea, and South Korea. Japan s relations with North Korea and South Korea has a legacy of bitterness stemming from harsh Japanese colonial rule over Korea from 1910 to 1945. In the… …   Wikipedia

  • United States Senate Select Committee on POW/MIA Affairs — The Senate Select Committee on POW/MIA Affairs was a special committee convened by the United States Senate during the George H. W. Bush administration (1989 to 1993) to investigate the fate of United States service personnel listed as missing in …   Wikipedia

  • Six-party talks — Infobox East Asian title=Six party talks sort=korean3 koreanname=North Korean name context=north hangul=륙자 회담 hanja=六者會談 mr=Ryukcha hoedam rr=Ryukja hoedam koreanname2=South Korean name hangul2=육자 회담 hanja2=六者會談 rr2=Yukja hoedam mr2=Yukcha hoedam …   Wikipedia

  • Histogram of oriented gradients — Histogram of Oriented Gradient descriptors, or HOG descriptors, are feature descriptors used in computer vision and image processing for the purpose of object detection. The technique counts occurrences of gradient orientation in localized… …   Wikipedia

  • Replay Gain — is a proposed standard published in 2001 to normalize the perceived loudness of computer audio formats such as MP3 and Ogg Vorbis. It works on a track/album basis, and is now supported in a growing number of media players. Although the standard… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”