Nutch

Nutch
Lucene Nutch
Lucene Nutch Logo
Nutch.png
Developer(s) Apache Software Foundation
Stable release 1.3 / June 7, 2011; 5 months ago (2011-06-07)
Development status Active
Written in Java
Operating system Cross-platform
Type Search Engine
License Apache License 2.0
Website nutch.apache.org

Nutch is an effort to build an open source web search engine based on Lucene Java for the search and index component.

Contents

Features

Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.

The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project.

History

Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella.

In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multimachine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. The two facilities have been spun out into their own subproject, called Hadoop.

In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation.[1]

Advantages [2]

Some of the advantages of Nutch, when compared to a simple Fetcher

  • highly scalable and relatively feature rich crawler
  • features like politeness which obeys robots.txt rules
  • robust and scalable - you can run Nutch on a cluster of 100 machines
  • quality - you can bias the crawling to fetch “important” pages first

Scalability

IBM Research studied the performance[3] of Nutch/Lucene as part of its Commercial Scale Out (CSO) project.[4] Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the Power5.

The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second.[5]

Related projects

  • Hadoop - Java framework that supports distributed applications running on large clusters
  • nutchWAX - Uses Nutch to search a web archive
  • Sixearch - An unstructured peer network application, which provides a complementary way for users to actively and collaboratively share their own document collections.

Search engines built with Nutch

See also

References

Bibliography

External links


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

  • Nutch — Développeur Doug Cutting Dernière version 1.3 (7 juin 2011) [ …   Wikipédia en Français

  • Nutch — Entwickler Apache Software Foundation Aktuelle Version 1.2 (24. September 2010) Betriebssystem Cross platform Kategorie Crawler, Parser und …   Deutsch Wikipedia

  • Nutch — Desarrollador Apache Software Foundation http://lucene.apache.org/nutch/ Información general Última versión estable 1 …   Wikipedia Español

  • Lucene — Developer(s) Apache Software Foundation Stable release 3.4 / September 14, 2011; 2 months ago ( …   Wikipedia

  • Doug Cutting — Douglas Reed Cutting is an advocate and creator of open source search technology. He originated Lucene and, with Mike Cafarella, Nutch, both open source search technology projects which are now managed through the Apache Software Foundation. He… …   Wikipedia

  • Hadoop — Apache Hadoop Тип Система для распределённых вычислений Разработчик Apache Software Foundation …   Википедия

  • Hadoop — Infobox Software name = Apache Hadoop caption = developer = Apache Software Foundation latest release version = 0.18.0 latest release date = release date|2008|08|22 latest preview version = latest preview date = operating system = Cross platform… …   Wikipedia

  • Frutch — est un groupe de travail visant à développer un moteur de recherche francophone, basé sur le moteur de recherche opensource Nutch. Liens externes (fr) Frutch.org Groupe de travail francophone sur Nutch (fr) Frutch.com Adresse du futur moteur de… …   Wikipédia en Français

  • Wikia Search — Не путайте с Википедией многоязычной свободной энциклопедией Wikia Search …   Википедия

  • Hadoop — Apache Hadoop Logotipo de Hadoop Desarrollador Apache Software Foundation http://hadoop.apache.org/ Información general …   Wikipedia Español

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”