Website Parse Template

Website Parse Template

Infobox file format
name = Website Parse Template
icon =

extension = .icdl
mime =
type code =
uniform type =
magic =
owner = [http://www.omfica.org/ OMFICA]
genre = Website Parse Template
container for = ICDL Crawling
extended from = XML
extended to =
standard =
url = [http://www.omfica.org/npo_website_template.php WPT]

Website Parse Template (WPT) is an XML based open format which provides HTML structure description of website pages. WPT format allows web crawlers to generate Semantic Web’s RDFs for web pages. WPT is compatible with existing Semantic Web concepts defined by W3C (RDF and OWL) and UNL specifications.

WPT Syntax

Website Parse Template consists of following sections:

* "Ontology", where publisher defines concepts and relations which are used in the website.
* "Templates", where publisher provides templates for groups of web pages which are similar by their content category and structure. Publisher provides the HTML elements’ XPath or TagIDs and links with website Ontology concepts.
* "URLs", where publisher provides URL Patterns which collect the group of web pages linking them to "Parse Template". In the URLs section publisher can separate form URLs the part as a concept and link to website Ontology.

Website Parse Template begins with opening <"icdl"> tag and ends with closing tag. Single Website Parse Template is referred to the same host, while single host may have several Website Parse Templates describing its HTML structure. It is required to specify the host for Website Parse Template at the beginning in <"icdl"> tag:

. . . . . . . . . . . . . . . . . . .

WPT Ontology

Ontology section contains enumeration and definition of all concepts used in website. Listed concepts must be enclosed within <"ontology"> tags. It is required to specify the ontology name (any rational string) and indicate supported language ("icdl:ontology", "owl" or "") which is used to specify the concepts.

Example 1. Concepts used in Yahoo! Music for "artist" object

Each concept’s definition should start with <"concept"> tag and ends with tag. <"inherit"> tag shows inheritance relations and <"has"> tag shows attributable relations between two concepts. Either of defined concepts has default attribute - object identifier (id) to be used by web crawlers to co-ordinate the same object's attributes used in different pages of the same website.

Website Parse Template foresees several predefined concepts that are general for all kind of websites:

"Menu"” - navigation bar/menu
"Logo"” - design element/logo
"Content"” - element that contains main textual content of the page
"Advertisement"” – advertisement/banner
"External Link"” – element that contains external links

WPT Templates

Templates section contains number of templates for groups of similarly structured web pages. Either of those templates refers to a single group of similarly structured web pages. HTML elements’ XPath references or TagIDs are used for linking structured content with defined concepts. The template description starts with opening <"template"> tags and ends with closing tag. In <"template"> tag it is required to specify template name and language used for templates description. As a template name can be chosen any string, but for the language it is necessary to indicate supported language type, e.g. "icdl:template", "rdf" or "".

Example 2. Simple template for single artist page on Yahoo! Music

The web page may contain structured repeatable content () included in one main HTML element () that are specified as follows:

Example 3. Repeatable content representation

In case of specified complex HTML element is already described by another template the tag can be used to point to that template block. It makes possible to create hierarchic relations between WPT templates so that web crawlers can use specified reference(s) to identify the same object in different pages of a given website.

Example 4. Hierarchic relations between WPT Templates

URLs section

This section defines the URLs/URL patterns that are corresponding to groups of similarly structured web pages described in Templates section. In accordance with Templates section URLs section also may consist of several blocks and either of those blocks should start with <"urls"> tag and ends with tag.

Example 5. URLs/URL patterns

As a URLs block name can be chosen any string, but for the template it is necessary to indicate certain template name described in previous section. The URL pattern provided in "Example 5" also include the represented real URL. RegExp specifications are used for URL patterns descriptions. The concepts necessary for URL pattern definition (such as "id" and "fullname") are to be defined previously in Ontology section.

See also

* ICDL Crawling
* Open Market For Internet Content Accessibility
* Semantic Web
* World Wide Web Consortium
* RDF
* OWL
* Regular Expressions
* Universal Networking Language

External links

* [http://www.omfica.org OMFICA]
* [http://www.omfica.org/editor/index.php ICDL Editor]
* [http://www.w3.org W3C]
* [http://www.regular-expressions.info Regular Expressions]
* [http://www.undl.org UNDL]


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать курсовую

Look at other dictionaries:

  • Web template (disambiguation) — Web template may refer to:* Web template, web site design templates * Website Parse Template, web site structured content description for web crawling …   Wikipedia

  • Chip Template Engine — Infobox Software name = Chip Template Engine developer = Mike A. Leonetti latest release version = 0.31 latest release date = 16th August 2006 genre = Template engine license = LGPL website = [http://code.divineaspirations.net/chip… …   Wikipedia

  • Open Power Template — (OPT) Developer(s) Invenzzia Group Stable release 2.0.6 / September 3, 2010; 13 months ago (2010 09 03) Preview release 2.1 beta1 / September 3, 2010; 13 months ago …   Wikipedia

  • ICDL crawling — is an open distributed web crawling technology based on Website Parse Template (WPT). What is Website Parse Template? Website Parse Template (WPT) is an XML based open format which provides HTML structure description of website pages. WPT format… …   Wikipedia

  • Web crawler — For the search engine of the same name, see WebCrawler. For the fictional robots called Skutters, see Red Dwarf characters#The Skutters. Not to be confused with offline reader. A Web crawler is a computer program that browses the World Wide Web… …   Wikipedia

  • Web search engine — Search engine redirects here. For other uses, see Search engine (disambiguation). The three most widely used web search engines and their approximate share as of late 2010.[1] A web search engine is designed to search for information on the Wo …   Wikipedia

  • Index (search engine) — Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and… …   Wikipedia

  • Open Market For Internet Content Accessibility — OMFICA LTD Type Company Limited by Guarantee Industry Internet, Search Technologies Founded London, UK (February 4, 2008) …   Wikipedia

  • Web Ontology Language — OWL Web Ontology Language Current Status Published Year Started 2002 Editors Mike Dean, Guus Schreiber Base Standards Resource Description Framework, RDFS Domain Semantic Web A …   Wikipedia

  • Web 3.0 — is one of the terms used to describe the evolutionary stage of the Web that follows Web 2.0. Given that technical and social possibilities identified in this latter term are yet to be fully realised the nature of defining Web 3.0 is highly… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”