Daniel Martin ([info]dtm) wrote,
@ 2007-04-15 11:42:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
Entry tags:html, tech

Most HTML templating languages are written incorrectly
Continuing my pattern of occasional technical posts just that my journal won't be completely dormant, here's another one:

If you do much web development at all, you probably work with a template language of some kind. You know, the kind of thing where you write HTML with various placeholders in spots that get filled in by the web application - examples include jsp pages, Django's template system, Smarty templates, PHP pages, or HTML::Mason.

Anyway, the problem with virtually every HTML templating language out there is that they make it easier for the person writing HTML templates to add an XSS hole than to avoid it. This isn't a matter of making it possible for page writers to shoot themselves in the foot - that's always going to be possible, given any reasonable system - it's a matter of making it easier to do than to avoid.

I'm going to pick as an example jsp, because it's just horrendous in this regard, especially in JSP 2.0 where they make doing the wrong thing extra easy. Let's say you have a page where you echo back to the user something they typed in. Say, you have a search page and you preface search results with something like:

Search results for “pirate monkeys&rdquo:

When the user types "pirate monkeys" into the search box. Now, in jsp you might well be tempted to code that as:

<b>Search results for &ldquo;${param.q}&rdquo</b>:

And in your testing, this would appear to work just fine (assuming that the search box wound up in a parameter named q). However, if you did that you'd be opening yourself up to an XSS attack - in short, you'd be making it possible for someone to construct a url that they can hand to your users to cause nasty things to happen. Instead, the correct way to write that is:

<b>Search results for &ldquo;<c:out value="${param.q}">&rdquo</b>:

Now, which is easier to type? Which one is therefore more likely to be typed in a hurry when you've got to get the site up by the end of the week?

Jsps have another way of including dynamic content by putting snippets of java code insed <%= %> tags, but it suffers from the same problem. The default behavior is to include the text verbatim, not doing any HTML escaping, and that's just wrong.

Other templating languages make it easier to properly esacape output, but still tend to make it easier to include stuff verbatim (a rare possibility) than it is to include stuff properly escaped. For example, a snippet on the django page linked to above gives:

{% for story in story_list %}
<h2>
<a href="{{ story.get_absolute_url }}">
{{ story.headline|upper }}
</a>
</h2>
<p>{{ story.tease|truncatewords:"100" }}</p>
{% endfor %}


Which is all fine and good if you never let your users submit story content. The XSS-proof version of that snippet is:

{% for story in story_list %}
<h2>
<a href="{{ story.get_absolute_url|escape }}">
{{ story.headline|upper|escape }}
</a>
</h2>
<p>{{ story.tease|truncatewords:"100"|escape }}</p>
{% endfor %}


They make it easier to escape variable values than jsp does, but don't make it the default. The other templating languages cited above suffer from a similar design flaw. At least HTML::Mason allows the site administrator to set a site-wide "default escape", so that at least on that site the right thing is easier than adding XSS bugs.

If you ever find yourself in the position of designing an html template language, please make the default behavior when including variables be to HTML-escape them. Note that it is much better to err on the side of over-escaping content than under-escaping, because things that are over-escaped will be caught in development/testing because they produce visible artifacts. Things that are under-escaped will not be caught until someone starts stealing your users' valuable information through an XSS hole you left on some obscure page you never really thought about.



(Post a new comment)

Flexy is the answer
(Anonymous)
2007-04-15 09:04 pm UTC (link)
For PHP at least... By default placeholders contents are htmlentitied, if you want a variable to output html you have to specify with the html filter (eg {somevalue:h} will output html with angle brackets intact, {somevalue} will produce escaped content... Does the right thing by default and more people ought to use it...

-metapundit
http://www.metapundit.net/sections/blog

(Reply to this)

RXML
[info]ecmanaut.blogspot.com
2007-04-16 04:35 am UTC (link)
The only template language I've come across that really does the right thing is RXML, the Roxen Macro Language (http://docs.roxen.com/roxen/4.5/web_developer_manual/entity/encoding.xml) (since version 2).

By default, it quotes all output to the quoting rules of the content-type the document is served in, or the attribute of some macro tag. Thus it properly handles the different attribute syntaxes understood in XML and HTML, it knows what to quote how for text/javascript, SQL queries used in data fetching for this and that database, and so on, and is generally really pleasant to work with.

The format to include an appropriately (content-type / context sensitively) quoted variable is &form.q; (equivalent of the example given above), or &roxen.version;, and you can override with your own pick of quoting using &form.q:js; for javascript quoing, or even a series of quotings applied after one another; &form.q:mysql:html;. Opting out of quoting is available with the quoting scheme "none", so where needed, &form.q:none; does the trick.

(Reply to this)

Why not just use XSL?
(Anonymous)
2007-04-16 07:25 am UTC (link)
Why not simply change the custom, quirky templating language to a standards-based, well-formed one and use XSL? Our company uses 3 different development architectures, but they all use XSL as the templating layer, and its speeded up development and roll-out of all our applications.

(Reply to this)(Thread)

Re: Why not just use XSL?
[info]whumpdotcom
2007-04-16 03:20 pm UTC (link)
The common complaint is that XSLT is "too hard," which it isn't.

But to do it right means your developers need to grok functional programming methods, and for many people trained in procedural or OO, that's not a simple context shift.

(Reply to this)(Parent)


[info]smin
2007-04-16 09:26 am UTC (link)
PHP suffers from the same issue with the <?= syntax. I considered writing an extension to add a <?~ operator which would output htmlspecialchars()'d strings but I never got starting with it. In the same vain as the statement above on JSP but in PHP, why are the two most important functions from an XSS and SQL injection perspective, htmlspecialchars() and mysql_real_escape_string(), the longest function names in the language?

(Reply to this)


[info]kragen
2007-04-21 10:30 pm UTC (link)
Nevow's Stan has what I think is a better answer. Rather than putting the information about whether a particular field is supposed to be quoted or not in the template, where it is difficult to verify that it matches the logic producing that field, it puts it in the value to be interpolated. If you put ordinary strings into your template, they are automatically quoted as HTML; if you have a variable that contains raw HTML that you don't want to quote, you have to put it into a special kind of object that has a different "flatten method."

In this way, the XSS-free-ness of the data flow is verifiable incrementally: user inputs start as strings, and if they remain strings, you're safe, because they'll be quoted. If at any stage you do something funky that will avoid a string being quoted, that decision is clearly located at one point in your program, hopefully next to the code that makes sure that string is safe to not be quoted.

This is the dynamically-typed equivalent of Joel Spolsky's suggested Hungarian solution to the problem.

(Reply to this)(Thread)


[info]dtm
2007-04-23 01:13 am UTC (link)
This is the dynamically-typed equivalent of Joel Spolsky's suggested Hungarian solution to the problem.
I'd call it rather the strongly typed equivalent of Spolsky's Hungarian notation solution. I've seen very similar things worked out in Haskell, which is statically typed, but again (like Python) strongly typed. Presumably a similar system could be worked out for Java.

And while I agree that there is strong advantage to having variables richly annotated with what kind of text they contain, you're still going to need some way in the template to say something about the context in which some piece of text is being expanded - for example, if you want to pass a variable into some bit of javascript, you really don't want your template engine to turn & into &amp;, but it would be nice to backslash quote marks and other potentially dangerous things in the string. So even with this kind of approach, you're still going to want to annotate spots in the template to describe something about where the field is being used, so that the appropriate type of quoting transformation can occurr.

That being said, with this approach there's virtually no call for some context marker that says "this here should be plaintext" (I guess maybe you might want that inside a CDATA chunk, but even there you'd want some way to quote the CDATA end sequence), so it's guaranteed that the default context is going to be the context in which the way to quote things is to html-encode them.

There's also the occasional time when you'll want multiple encodings applied, such as when substituting into a javascript variable that you then insert into the innerHTML attribute of something else on the page.

(Reply to this)(Parent)(Thread)


[info]kragen
2007-04-23 10:36 pm UTC (link)
I'd call it rather the strongly typed equivalent of Spolsky's Hungarian notation solution.

Maybe; it's pretty easy to violate the typing discipline and destroy the safety properties one might hope it would provide. You just define a class (or, in the Haskell version, a language) whose flatten method returns something unsafe --- or maybe even instantiate an existing class or language with an unsafe string.

There's also the occasional time when you'll want multiple encodings applied, such as when substituting into a javascript variable that you then insert into the innerHTML attribute of something else on the page.

There's very little excuse for that; certainly the case you're citing would have been better off with document.createTextNode, which happens automatically if you use MochiKit.DOM, which is a sort of JavaScript port of Stan.

So even with this kind of approach, you're still going to want to annotate spots in the template to describe something about where the field is being used...

Yes, you are right. I mean, you could put that into the values to be substituted as well, but that's clearly the Wrong Thing. That's what I did the last time I implemented this approach, though. (I find that I haven't yet posted that code to kragen-hacks, so I'll send it out this week.)

(Reply to this)(Parent)(Thread)


[info]kragen
2007-04-23 11:00 pm UTC (link)
The code is in http://pobox.com/~kragen/sw/laptoptable.py now, and reading over it I notice that I'm not sure whether that's what I did or not.

Because it's like Nevow, it doesn't really have a separate textual template as such --- it just has a tree of elements, each of which includes its children rendered as HTML. The <script> tag renders its children differently --- it just asks them to convert themselves to strings rather than HTML --- because it has a CDATA content model in the HTML DTD.

You could treat "Search results for #{term}:" as syntactic sugar for ["Search results for ", as_html(term), ":"] though. Hmm...

(Reply to this)(Parent)


Create an Account
Forgot your login?
Login w/ OpenID
English • Español • Deutsch • Русский…