Do you need to create standards-compliant Web pages? You can do all of the work by hand, or use a little tool called HTML Tidy to help you out. This review looks at what HTML Tidy can and can't do, and how it can help you clean up your Web pages
A Web author worth his or her salt wants to create Web pages that comply with standards. That means no badly-formed HTML -- no missing tags, proprietary extensions, constructs that break in all but one or two browsers. These errors are annoying and can be avoided with little time and effort.
There are numerous applications and online services that validate HTML syntax. More often than not, though, they're good but not great. Most will check HTML, but not correct it. If you have a lot of files, you check each one and make corrections by hand. This takes a lot of time and effort. All in all, just about every app and service out there is either too bulky or lacks the functionality that you need.
Weighing in at only a few hundred kilobytes (believe me, this is tiny for a program that packs this much punch), HTML Tidy is the closest you'll get to a perfect HTML utility. Not only does it check HTML files, Tidy fixes the problems it finds and does a whole lot more. Tidy proves that a lot of functions can be crammed into a tiny package.
Tidy is an anachronism in the world of the graphical user interface. It's a command line application, meaning you have to type a string of commands to get Tidy to run. It may sound like an old fashioned way of doing things, however it's anything but. The command line interface gives Tidy a great deal of flexibility. And it has a feature -- discussed later -- that can save you keystrokes.
How It Works
Tidy fixes a number of common, and not so common, mistakes in HTML files. It does this by analyzing the markup in a file and comparing it to the HTML 4.01 specification. Depending on the options you specify, Tidy can fixes the problems it finds or it can generate a log detailing the errors.
The range of problems Tidy can fix is impressive. It can add missing or mis-matched end tags, correct tags that are in the wrong order, insert quotes around attributes, and can even add missing > to a tag. One of the few things Tidy can't do is add SUMMARY attributes to tables. You'll have to go in and add that attribute manually.
Total Control
Tidy's 30+ command line options are usually enough for most purposes. But they only scratch the surface of what this utility can do. You can tap into all of Tidy's power and functionality by using a configuration file. A configuration file is simply a text file listing various program options that gives you access to many of Tidy's extended features. These features include HTML to XHTML conversion, fixing the so-called HTML produced by Microsoft Word, adding ALT text to images, etc. There are even options for formatting markup, and for dealing with scripting languages. You simply specify the configuration file on the command line, and let Tidy do the rest. You can find complete descriptions of the options at the Tidy home page (see the end of this article for the link).
For all their usefulness, configuration files can be cumbersome to create. You not only have to sift through the options to find the ones you want to use, but you also have to spend time building a configuration file. And because no two sets of HTML files are exactly the same, you may have to create multiple files. Keeping track of them can be a chore.
A Touch of Style
Thanks the influence of Netscape and Microsoft, far too many Web authors use extensions like <font> and <center>. Tidy has a neat option for replacing these tags with Cascading Style Sheet (CSS) properties, making the markup compliant with the HTML 4.01 and XHTML standards.
The CSS option does a good job of replacing non-standard markup, but not with the CSS you might use. The CSS Tidy adds to a file looks something like this:
You'll undoubtedly need to do some manual editing to fit the tidied files into your format.
XML and XHTML Anyone?
Being on the cusp of Web standards, Tidy also supports XML and XHTML. Using configuration file options, you can convert an HTML file to XHTML or XML. Or at least that's what the documentation says. The XHTML conversion works very well. Tidy adds the XHTML document type and namespace to the header of an HTML file, and converts HTML tags to their XHTML equivalents. For example, tags like <br> and <hr> become <br /> and <hr />. The conversion checks out using the W3C's XHTML validator, and renders well in any browser.
I've never had much luck converting a Web document to XML, however. Instead of an XML file, the output is still HTML. This was really the only disappointing aspect of Tidy. But that doesn't mean it has no XML capabilities. Using the -asxml command line option, Tidy can fix errors in XML files. Not every error mind you -- Tidy can't cope with CDATA (text intended to be interpreted as containing no XML markup) for example -- but it catches most of the major ones.
Strengths and Weaknesses
One of Tidy's biggest strengths is its portability. Versions of Tidy are available for over 15 platforms, including Windows, DOS, Mac OS, several flavours of UNIX/Linux, and BeOS. On top of that, Tidy is an Open Source application. If there isn't a version for your favourite operating system and you program in C++, you can download the source code (the building blocks of a program) and start hacking.
Tidy is also integrated with a number of text and HTML editors, including NoteTab Pro and HTML-Kit (for Windows), as well as Bluefish and Quanta (for Linux). And this is literally just the tip of the iceberg. Several other editors have also integrated HTML Tidy, and you can even add Tidy support to a growing number of editors.
Being a command line tool, Tidy won't appeal to anyone who is used to point-and-click convenience. You can, however, create a batch file or shell script to execute Tidy with the options you commonly use. The only other downside is that you sometimes have to run Tidy two or more times to completely clean a file.
All told, Tidy is an indispensable tool for any Web author. It can save you a lot of time finding and correcting errors in your HTML. And it can ensure that your Web documents comply with standards. Just for that reason alone, Tidy is worth the download.