Web-scraping software comparison

Web-scraping software comparison

This article provides a basic feature comparison for several types of web scraping software. Additional feature details are available from the individual products' websites and/or articles. This article is not all-inclusive or necessarily up to date.

The comparisons are made on the stable versions of software – not the upcoming versions or beta releases – and without the use of any add-ons, extensions or external programs (unless specified in footnotes).

Prices

This following companies are listed with their pricing and trial details.

Cross-platform

The following software packages have their operating system compatibility listed.

Features & Capabilities

The following table highlights several key features that are available on web-scraping software packages. To see the definitions of the Features & Capabilities, look below the table.

Features and Capabilities Definitions

* RSS Feed: The program can place the output into an RSS Feed.

* Interact w/ Database: The program can read inputs/write outputs into a Database.

* Extract Links: The program can read in links as it looks for data.

* Write to XML: The program can write data into XML format.

* Download Files: The program will download files as well as scrape data from a web page.

* Anonymous Proxies: Occasionally, sites block scrapes by blocking ip addresses from where the user is scraping. Anonymous Proxies allow the user to continually scrape the site by generating new ip addresses.

* Export to spreadsheet: This indicates that the program can export the data scraped into a spreadsheet.

* Built-in Timer: When there is a built-in timer, the user can more easily scrape at a desired set time.

* Extract Table: The program can extract the data from a table on the scraped web page.

* Traverse pages/Fill forms: The program can go through a web page and fill in forms automatically.

* Standalone web-scraper robots: Upon the creation of a robot, the program automatically scrapes so you don't have to do it manually.

* Server: The program can act much like a database server would act. This allows the possibility of invoking the program, to scrape needed data, through programs designed by the user.

* Custom parsing: Uses customizable delimiters for parsing rather than something like the DOM. This gives greater versatility at the expense of being more complex.

* DOM parsing: Uses objects from the DOM to parse HTML.

* Visual Learning: Generate web extraction code/rules by visual demonstration, including a recording interface.

* Multi-thread: Scrape web data in multi-thread mode.

ee Also

* Screen scraping
* Web scraping

Notes

For software that does not incorporate a timer, but can be run on a Linux Platform, Linux has the ability to perform a cron-job, which allows a user to run an application on a predefined schedule. Additionally, most Windows scraping programs can be run with command line options and the "Windows Task Scheduler" to get a timer effect.

References

*(1) [http://www.screen-scraper.com/download/choose_version.php Screen-scraper versions]
*(2) http://www.download.com/Data-Ferret/3000-2650_4-10526777.html?hhTest
*(3) http://www.irobotsoft.com/download.htm
*(4) http://www.newprosoft.com/web-content-extractor.htm
*(5) http://softbytelabs.com/us/products.html
*(6) http://www.tethyssolutions.com/all-automation-software-versions.htm
*(7) http://www.mozenda.com/mozenda-products.php


Wikimedia Foundation. 2010.

Игры ⚽ Нужно решить контрольную?

Look at other dictionaries:

  • Web scraping — (sometimes called harvesting) generically describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context. Those who scrape websites… …   Wikipedia

  • Screen scraping — is a technique in which a computer program extracts data from the display output of another program. The program doing the scraping is called a screen scraper. The key element that distinguishes screen scraping from regular parsing is that the… …   Wikipedia

  • Web crawler — For the search engine of the same name, see WebCrawler. For the fictional robots called Skutters, see Red Dwarf characters#The Skutters. Not to be confused with offline reader. A Web crawler is a computer program that browses the World Wide Web… …   Wikipedia

  • List of free and open source software packages — This article is about software free to be modified and distributed. For examples of software free in the monetary sense, see List of freeware. This is a list of free and open source software packages: computer software licensed under free… …   Wikipedia

  • Price comparison service — Part of a series on Electronic commerce Online goods and services Streaming media Electronic books Softwa …   Wikipedia

  • XBMC — Media Center XBMC Media Center Home Screen Developer(s) …   Wikipedia

  • Greasemonkey — Infobox Software name = Greasemonkey caption = Screenshot of the BookBurro user script running in Greasemonkey. BookBurro alters an amazon.com page to show the prices of the same book offered by competing retailers. collapsible = author = Aaron… …   Wikipedia

  • Data mining — Not to be confused with analytics, information extraction, or data analysis. Data mining (the analysis step of the knowledge discovery in databases process,[1] or KDD), a relatively young and interdisciplinary field of computer science[2][3] is… …   Wikipedia

  • Data extraction — is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage (data migration). The import into the intermediate extracting system is thus usually followed by… …   Wikipedia

  • Yahoo! Widgets — Infobox Software | name = Yahoo! Widgets caption = Yahoo! Widgets running under Mac OS X. developer = Ed Voas, Sam Magnuson and Michael Galloway latest release version = 4.5.1.0 latest release date = release date and age|2007|12|13 operating… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”