Tag soup

Tag soup

In Web development, tag soup refers to HTML code written for a Web page without regard for the rules of HTML structure and semantics. Generally, tag soup is created when the author is using HTML for a presentational document rather than a semantic document. Because web browsers have always treated HTML errors leniently, tag soup is also used by browser implementers to refer to all HTML. HTML must be treated by web browsers as tag soup in comparison to XML where errors need not, and should not, be corrected according to the specification.

Tag soup is characterized by a large number of common authoring mistakes, such as malformed HTML tags, improperly-nested HTML elements, unescaped character entities (especially ampersands (&) and less-than signs (<)), and the use of presentational HTML elements and attributes in order to create visual effects without respect for their implied meaning (that is, against their semantic purpose).

Although often thought of as typifying private and semi-professional or hobbyist Web sites, tag soup is created by many professional web page layout programs, and written by hand by many professional web developers for some of the highest-profile sites.

Overview

Tag soup is a term used to denigrate various practices in web authoring. Some of these (roughly ordered from most severe to least severe) include:

# Malformed markup where tags are improperly nested. For example, the following:

This is a malformed fragment of HTML


# Misuse of semantic elements for presentational purposes. This may include the use of "table" elements and "img" elements to provide pixel-level layout instead of using CSS. Another common example is the use of the blockquote element for paragraph indentation of non-quoted material.
# Invalid markup where elements are improperly nested according to the DTD for the document. Examples of this include nesting a "ul" element directly inside another "ul" element for any of the HTML 4.01 or XHTML DTDs.
# Misuse of elements for semantic compensating. There has been a trend to overcompensate to be hyper-semantic. So some authors will use the semantic element "strong" instead of using the "b" element for bold when the purpose is not to emphasize strongly. The use of the semantic strong emphasis element is mistakenly seen as more correct even when the author does not intend to convey strong emphasis.
# Use of proprietary or discontinued elements and attributes instead of W3C recommended ones. This includes use of elements such as "embed", which serves much the same purpose as "object".
# HTML compared to XHTML. Because XHTML requires browsers to avoid rendering malformed code, the XML parser behind XHTML has much fewer demands placed upon it than the HTML parser. This has led some to refer to HTML in all of its non-XHTML versions as tag soup.

Causes and implications

Malformed markup

Malformed markup is arguably the most severe problem in web authoring. However, thanks to better education and information and perhaps a butterfly effect from XHTML, the issue of malformed markup is becoming less common. Whenever pages are rendered drastically differently across different browsers, malformed content is often the reason why. This is because browsers, when faced with malformed markup, must interpolate the meaning of the author. They must infer closing tags where they expect them and then infer opening tags to match other closing-tags. The interpretation can vary markedly from one browser to the next. [http://ian.hixie.ch/ Ian Hixon] wrote a detailed article investigating the differences between [http://ln.hixie.ch/?start=1037910467&count=1 how browsers handle tag soup] .

While many graphical web editors produce well-formed markup, an author writing code manually with a text-editor and then testing only in one browser can easily miss such errors. The visual, aural, and tactile presentation can therefore vary drastically from one browser to another as each tries to “correct” the authorʼs intent in different ways and then applies styling to those “corrections”.

Misuse of semantic elements and attributes

Some design idioms that were once good workarounds given the lack of presentational elements in early HTML specifications are now considered tag soup. These include the use of HTML table elements for structural markup (not for tabular data), the HTML font element and single pixel GIF images used for spacing (spacer GIFs). It is now advisable that CSS be used in place of such hacks. Another common approach, perhaps necessary before CSS, was to choose heading elements (for example, "h1" or "h2") according to the default display on a given browser and not according to the hierarchical level of the heading (in other words, chapter, section, subsection).

Invalid markup

Invalid markup here means only the use of attributes and elements where they do not belong. For example, placing a "cite" attribute on a "cite" element is invalid since the HTML and XHTML DTDs do not ascribe any meaning to that attribute on that element. Similarly, including a "p" element within the content of an "em" element is also invalid. With the move toward separating malformed markup from invalid markup, the problems with invalid markup have increasingly been seen as less severe. Some have begun to advocate looser content models that allow greater flexibility in authoring HTML documents (whether in HTML or XHTML). However, use of invalid markup can blur the author's intended meaning, though not as severely as malformed markup.

Many graphic web editors still produce invalid markup. Moreover, many professional web designers and authors pay little attention to issues of validity. It is common to see invalid markup in many of the sites throughout the world wide web.

Misuse of elements for semantic compensating

There has been a trend to overcompensate to be hyper-semantic. So some authors will use the semantic element "strong" instead of using the "b" element for bold when the purpose is not to emphasize strongly. The use of the semantic strong emphasis element is mistakenly seen as more correct even when the author does not intend to convey strong emphasis. Another example might be an author who wants to visually present languages other than the document's root language in italics. However, to be "semantically correct" chooses to use the "em" element rather than the "i" (italics) element like this: As they say in France, <em>C'est la vie</em>. In this case, the emphasis element conveys no more meaning than the italics element. In some ways, it is worse since the emphasis element implies the italics are due to emphasis rather than a change in language. Whereas the italics element would have made clear the italics were strictly presentational and the semantic content was missing. A more meaningful approach should be to use the "span" element, attach a class attribute with the value "other-language" for instance, and then use CSS to provide style for that class. The recommended solution [http://www.w3.org/TR/html401/struct/dirlang.html#h-8.1 W3C recommendation regarding specifying the language of HTML content] is to use the attribute lang with an ISO language-code.As they say in France, C'est la vie.If all browsers targeted by the design support CSS2.1, this can be styled using the selector span [lang|="fr"] [http://www.w3.org/TR/CSS21/selector.html CSS2.1 selectors] . If this cannot be guaranteed to be supported then adding class="other-language" as well, and styling on this basis may be safer.

Use of proprietary/discontinued elements

In the early age of the web (much of the 1990s), the semantic design of the official HTML specification became increasingly strained compared to the desire of designers to display this content in visually vibrant ways. The browser developers therefore pushed HTML into increasingly non-semantic avenues. They also introduced new semantics, often with conflicting names or for redundant purposes. This meant there were versions of HTML that worked in one browser, but not in another. The growth of the W3C and, in particular, its introduction of CSS in 1998 helped to provide an outlet for presentational properties or attributes that did not require presentational HTML elements and attributes.

Many of these attributes and elements have either been combined into a single semantic construct (such as the "applet", "embed" and "object" elements) or have been deprecated (such as the "s", "strike" and "u" elements). Nevertheless, browser developers have continued to introduce new elements to HTML when they have perceived a need. Some browsers include tabindex attributes on any element. WebKit developers aligned with Apple introduced the "canvas" element that behaves much like the "object" or "embed" element. Mozilla then introduced their own "canvas" element, which behaves even more like the "object" element.

HTML compared to XHTML

Because XHTML requires browsers to avoid rendering malformed code , the XML parser behind XHTML has much less demands placed upon it than the HTML parser. This has led some to refer to HTML in all of its non-XHTML versions as tag soup. Web browsers shoulder the greatest burden in parsing and simultaneously correcting malformed HTML markup. In contrast, web browsers need perform no corrections whatsoever on malformed XHTML markup. The browser is expected to simply fail in rendering. An error message is displayed pointing the user to the first instance of malformed content. XML parsing is greatly simplified by this practice compared to HTML parsing. This has led some to call HTML parsers tag soup parsers: meaning everything parsed as HTML is parsed as tag soup. In particular, this is the tag soup that refers to malformed markup. If or when XHTML takes hold and as XML parsers mature, the rendering process should speed up tremendously.

Evolving specifications to solve tag soup

While some of the issues of tag soup are due to shortcomings of browsers and sometimes due to a lack of information for web authors, some of the proliferation of tag soup was due to missing links in the web standards themselves. The W3C has spearheaded several efforts to address the shortcomings of web standards.

Cascading Style Sheets (CSS)

Cascading Style Sheets (CSS) provide a mechanism to select specific elements within markup and present them or style them according to the designer's intent. By providing extremely flexible and device independent styling of semantic markup, CSS tries to overcome the urge of authors to use semantic elements for presentational purposes. In fact, it even seeks to eliminate the need for any presentational markup whatsoever. While early web authors and designers had little choice but to employ presentational elements and misuse semantic ones to create visually effective web pages, CSS has largely done away with the need for those methods. However, old habits die hard and many authors continue to unnecessarily employ these tag soup hacks rather than embrace CSS.

XML and XHTML

XHTML is a reformulation of the HTML language based on XML. XHTML was developed to address many of the problems associated with tag soup.

First, XML separates the malformed-ness of a document from its invalidity. In HTML and SGML, these two concepts are intertwined. By requiring all elements to be explicitly closed, XML authors can first determine if there are malformed-ness errors before checking for the validity of the document. In HTML, these two operations are inherently mixed together.

Second, the XML Specification clearly defines what a conforming user agent (such as a web browser) must do when malformed code is encountered. Thus, a browser interpreting a Web page as XHTML will refuse to display the page if it encounters a formation error. This can help ensure that when authors test XHTML code against a conforming browser they will immediately be informed of malformation problems: perhaps the most severe problem facing web browsers. When code is malformed, the intent of the author is extremely ambiguous. Without the directives of XML, HTML browsers must perform complicated algorithms to interpolate the authors intended meaning. If more and more authors use XHTML based authoring tools, the problem of malformed documents could be eliminated: this would happen even if users continued to browse only with HTML only browsers.

Third, XML and XHTML introduce the concept of namespaces. Whereas HTML has a "class" attribute to support author-defined custom semantics, XML creates a much more complete solution by using namespaces. With namespaces, authors or communities of authors can define new elements and attributes with new semantics, and intermix those within their XHTML documents. Namespaces ensure that element names from the various namespaces will not be conflated. For example, a "table" element could be defined in a new namespace with new semantics different from the HTML "table" element and the browser will be able to differentiate between the two. In providing namespaces, XHTML combined with CSS allow authoring communities to easily extend the semantic vocabulary of documents. This accommodates the use of proprietary elements so long as those elements can be presented to the intended audience through complete style sheet definitions (including aural/speech and tactile styles).

Note that XHTML is parsed as XML by major web browsers only if it is served using the MIME type application/xhtml+xml. Many XHTML documents are currently served on the Web using the MIME type text/html, in order to ensure backwards compatibility with older browsers, and with all current releases of Microsoft Internet Explorer. [cite web
url=http://www.w3.org/TR/xhtml1/#guidelines
title=XHTML™ 1.0 The Extensible HyperText Markup Language (Second Edition)A Reformulation of HTML 4 in XML 1.0, Appendic C. HTML Compatibility Guidelines
publisher=W3C Recommendation
date=26 January 2000, revised 1 August 2002
accessdate=2008-09-13
quote=XHTML Documents which follow the guidelines set forth in Appendix C, "HTML Compatibility Guidelines" may be labeled with the Internet Media Type "text/html" [RFC2854] , as they are compatible with most HTML browsers. Those documents, and any other document conforming to this specification, may also be labeled with the Internet Media Type "application/xhtml+xml" as defined in [RFC3236] . For further information on using media types with XHTML, see the informative note [XHTMLMIME] .
] See also the discussion of this issue in the XHTML article.

Tools to fix tag soup

* HTML Tidy has been ported to almost all platforms
* Aggiorno is a Visual Studio add-in that focuses on making web sites standards compliant

References


Wikimedia Foundation. 2010.

Игры ⚽ Нужно решить контрольную?

Look at other dictionaries:

  • tag soup — noun Poorly structured code in a markup language that uses tags (such as HTML), especially when it violates specifications. See Also: spaghetti code …   Wiktionary

  • Soup — For other uses, see Soup (disambiguation). A bowl of French onion soup …   Wikipedia

  • Tag für Tag — Hühnersuppe mit Graupen Tag für Tag Nächstes Jahr in Jerusalem (engl. Chicken Soup with Barley Roots I m Talking about Jerusalem) ist eine Dramentrilogie von Arnold Wesker. Inhaltsverzeichnis 1 Uraufführungen 2 Thema 3 Handlung 4 Literatur 4.1 …   Deutsch Wikipedia

  • The Soup — Infobox Television show name = The Soup caption = The current set of The Soup with Joel McHale . rating = TV 14 format = Comedy runtime = 22 Minutes creator = Jay James starring = Joel McHale country = USA network = E! first aired = July 1, 2004… …   Wikipedia

  • Chicken Soup with Barley — Hühnersuppe mit Graupen Tag für Tag Nächstes Jahr in Jerusalem (engl. Chicken Soup with Barley Roots I m Talking about Jerusalem) ist eine Dramentrilogie von Arnold Wesker. Inhaltsverzeichnis 1 Uraufführungen 2 Thema 3 Handlung 4 Literatur 4.1 …   Deutsch Wikipedia

  • Designing with Web Standards — Designing with Web Standards[1] is a web development book by Jeffrey Zeldman (3rd edition with Ethan Marcotte). Zeldman co founded The Web Standards Project in 1998 and served as its director during the formative years when the Project was… …   Wikipedia

  • HTML — For the use of HTML on Wikipedia, see Help:HTML in wikitext. HTML (HyperText Markup Language) Filename extension .html, .htm Internet media type text/html Type code TEXT …   Wikipedia

  • Progressive enhancement — is a strategy for web design that emphasizes accessibility, semantic markup, and external stylesheet and scripting technologies. Progressive enhancement uses web technologies in a layered fashion that allows everyone to access the basic content… …   Wikipedia

  • Redland RDF Application Framework — Redland is a set of free software libraries written in C that provide support for the Resource Description Framework (RDF), created by Dave Beckett (a former resident of Redland, Bristol).The packages that form Redland are:* Redland RDF… …   Wikipedia

  • HTML 5 — ist die noch nicht erschienene Weiterentwicklung der Auszeichnungssprache HTML (aktuell: Version 4.01). Inhaltsverzeichnis 1 Entstehung 2 Ziele 3 Aufbau 3.1 HTML 5 3.2 XHTML 5 3 …   Deutsch Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”