Application Security Testing: An Integral Part of DevOps
Producing Valid HTML
Fact: most web pages are invalid. In 2001, Dagfinn Parnas analyzed the code of 2.4 million web sites and concluded that less than 1 percent of the pages were valid HTML. More recent studies looked at the home pages of the organization members of the W3C and at the sites of well-known bloggers who write about web standards. Although you might expect the sample of individuals and organizations selected in those two studies to be more likely to use valid HTML on their sites, the studies concluded that even for this sample the percentage of valid HTML is in the low single digits.
The reason we have so much invalid HTML on the web today is that historically browsers have been going to great lengths trying to render invalid HTML. The initial goal was to make the life of the HTML author easier: even if your HTML is not really valid, the browser will not complain and will display something based on some heuristic. In most cases the browser is able to make correct assumptions, and the page comes out just as you intended. Historically, as features were added to HTML, browsers became larger pieces of software, with a lot of code implementing those heuristic dealings with invalid HTML. And with just a small percentage of web pages being valid, you can safely bet that browsers will continue to support invalid HTML as they do today for the foreseeable future.
Then what is your incentive for writing valid HTML? After all, you just want your page to be rendered by the browser the way you intended. So as long as you get the intended result, why would it matter if the HTML sent to the browser is valid or invalid? We will argue here that it does matter, and that producing valid HTML has direct benefits for you, the web developer.
We all know that there are differences between browsers: a given page might look fine under Firefox and Safari, but will have problems with Internet Explorer, or vice versa. Browsers implement the HTML specification more or less closely and may make different assumptions because there is room for interpretation in the specifications or just simply because they have bugs. But handling invalid HTML is completely outside of the scope of the HTML specification. So when it comes to invalid HTML, browsers are on their own, and in our experience you are much more likely to see differences between browsers with invalid HTML than with valid HTML. So you will benefit from generating valid HTML just for that reason.
When rendered by the browser, it will look something like this:
Now imagine that text in each list item becomes longer. To make the list easier to read you decide it makes sense to make each list item a paragraph; this way the browser will add some space between each item. You do this by modifying the HTML as follows:
Can you see the error? Yes, the paragraph should go inside the <li> element, instead of going around it. But if you write the preceding code, chances are you won't even find out about your mistake because the browser will render it just fine and give you the expected result. If this appears in a static page, and the page renders as you expect, there isn't much harm done. However, now consider that you have a button on the page that moves the second item in the list to the first position. For this you add IDs on the <li> elements:
This code essentially takes the element with ID "second" and moves it before the element with ID "first". Now one would expect the same code to work if you add a <p> element around <li> and move the ID to the <p> element, as in:
In this case your code does not work. It does not work because you have wrongly assumed that the browser saw your HTML the way you wrote it and created a DOM that looks like Figure 1.
Instead, Internet Explorer and Firefox create a DOM that looks like Figure 2. Note that because this DOM is created by the browser based on invalid HTML, it is entirely possible for other browsers to create yet another DOM, further complicating the issue.