From the Top: MIME and Content Negotiation

Content negotiation at its simplest is a conversation between your web server and a user agent (browser, search engine bot etc) to determine the preferred format or version of a resource to serve. In this, the second in my article series “From the Top” I will introduce you to the web (head) waiter that knows how to correctly serve your web page to a user agent.

MIME and Content Negotiation

Content negotiation at its simplest is a conversation between your web server and a user agent (browser, search engine bot etc) to determine the preferred format or version of a resource to serve. To achieve this, a user agent will send a Hypertext Transfer Protocol (HTTP) “Accept” header to the web server with a list of preferred Multipurpose Internet Mail Extension (MIME) types and a ranking or weighting (the Quality Value) of how well it understands a particular MIME type (the ranking is from 0 to 1 to three decimal places). If no Quality Value (q) is defined for a MIME type then q=1.0 is assumed. MIME as the name suggests was first used as an extension to email but is also used by HTTP. It is simply a way to define what type of media (resource) is being sent — be it an image, Flash, text etc. Each resource has a MIME type consisting of two parts separated by a forward-slash, “/”. The first part is called the top-level media type and the second is the subtype. For example, the MIME type for a Graphics Interchange Format (GIF) image would be image/gif.

A Mozilla accept header may look like this:
Accept: text/xml, application/xml, application/xhtml+xml, text/html;q=0.9, text/plain;q=0.8, image/png, image/jpeg, image/gif;q=0.2, */*;q=0.1

For Hypertext Markup Language (HTML) the MIME type is text/html. User agents treat this in a very forgiving way but this tag soup rendering mode is slower to display as a result.

If you are using Extensible Hypertext Markup Language (XHTML) markup the correct MIME type is application/xhtml+xml. This is because XHTML is HTML reformulated as an Extensible Markup Language (XML) application and as such must be well-formed with no overlapping elements, properly closed tags, attribute values enclosed in quotes and care taken with case-sensitivity (all elements and attributes written in lowercase). Due to these quality checks, user agents are able to handle the markup more efficiently and render quicker than tag soup.

User agent support for application/xhtml+xml

Unfortunately, Internet Explorer 6 Service Pack 2 (IE 6 SP2) and below does not understand this particular MIME type and will attempt to download the page as an XML file. The MIME type is also buggy in several other user agents although as time goes by compatibility will naturally improve. Until such time then, there are two methods to get around this showstopper. The first is to use XHTML 1.0 in what is termed backwards compatibility mode and the other is through content negotiation.

XHTML 1.0 Backwards Compatibility Mode

By following the World Wide Web Consortium (W3C) guidelines XHTML 1.0 can be served with the MIME type of text/html. This mode considers such techniques as including a space before the trailing /> when closing a tag, avoiding white space and line breaks in attribute values and encoding ampersands in content including Universal Resource Indicators (URIs) referenced in hyperlinks. A lot of web developers coding to web standards (including the author and this website at time of writing) work in this mode and it is a matter of hot debate about the correctness of this method. Although we may not be utilising XML within a particular website at launch, XHTML 1.0 for me at least offers forward compatibility with future XML applications I may be asked to implement — HTML cannot offer that. I can and do serve XHTML 1.0 as text/html without specific content negotiation and it works (obviously) but I want to do this right and so does this article series.

Gotchas of Serving application/xhtml+xml

Before we embark on serving our XHTML with the correct MIME type there are several issues that you must consider — remember you might be coding for a Content Management System (CMS) / multi-author environment. The text editor needs to be capable and configured properly.

  • Code must be well-formed. Remember XHTML is a reformulation of HTML as an application of XML. As such must it be well-formed.
  • The XML Declaration is required for character sets other than UTF-8 and UTF-16 and is referenced as part of the XML Prolog on line 1 of your code in the format <xml version="1.0" encoding="yourChosenCharset" ?>
  • Stylesheets may be referenced with an XML stylesheet Processing Instruction (PI) as part of the XML Prolog (along with the DOCTYPE). The Processing Instruction <?xml-stylesheet href="myStyle.css" type="text/css" ?> is written much the same way as the HTML 4 <link rel="stylesheet">. If you are serving alternative stylesheets then the link href="myStyle.css" title="Medium" rel="alternate stylesheet" type="text/css" becomes <?xml-stylesheet alternate="yes" href="myStyle.css" title="Medium" type="text/css"?> as a style sheet PI. The W3C have written a very clear normative recommendation on associating style sheets with XML documents including several more examples.
  • Only five named character entities are “safe”: &lt;, &gt;, &amp;, &quot; and &apos;. It should be noted however that &apos; is undefined in HTML 4 and unsupported in Internet Explorer. You will need to ensure that all other character references are numeric in nature. Lachy’s log explains character references in greater detail.
  • Anything within style or script tags are treated as XML so you must wrap content using < or & in a Character Data (CDATA) section.
  • No elements are inferred, for example, tbody.
  • Scripting with document.write doesn’t work, you must use the Document Object Model (DOM) core methods. If you use Google’s AdSense on your website then you may need to apply a Google-approved workaround (if they haven’t fixed it already).
  • Cascading Style Sheets (CSS) are applied slightly differently. For example, to apply a background colour to the body element would require the html element to be styled also as the body element doesn’t cover the whole viewport when using XHTML.
  • HTML comments in scripts or styles for example <script type="text/javascript"></script> will result in a fatal error (and the page won’t display as a result) in an XHTML document served as application/xhtml+xml. This is due to the fact that in XML the last pair of hyphens causes a well-formedness error. The correct way to write script or style blocks for XHTML when served as application/xhtml+xml is in the format <script type="text/javascript"><![CDATA[ { // do something } //]]></script>. Of course, the easiest way to avoid all this is to put your scripts and styles into external files in the first place. Lachlan Hunt has a great in-depth article that goes into the why’s and where for’s of HTML comments in scripts.

Doing the “Right Thing” ™

Rather than letting the server decide whether to serve a page as index.xhtml or index.html (note these would be separate files) based on the preferences sent in the accept header, content negotiation should be configured on the web server if you have access, or through scripting in your template. If you have an Apache server, I’ll send you off to read the manual now, as I want to concentrate on providing an overview of scripting a solution in this article.

Irrespective of using asp.NET, PHP, etc the following thought process is required:

  1. Lower the Quality of Source (qs) parameter for application/xhtml+xml on the server to account for possibly incomplete accept headers.
  2. Specifically test for the W3C validator as it doesn’t send a complete accept header.
  3. Parse the http_accept header – find out the user agent’s preference.
  4. Send the preferred MIME type.
  5. Send a Vary header to inform proxy servers that content negotiation is taking place.
  6. Send the correct DOCTYPE.
  7. Send the appropriate opening html tag — this will be discussed further in next week’s article.
  8. If application/xhtml+xml is preferred, send the XML Declaration — do not include it with text/html as this will put IE for Windows into Quirks Mode.
  9. If application/xhtml+xml is preferred, send the XML Stylesheet declaration(s).
  10. If text/html is preferred, the closing of tags with ” />” needs to be changed to “>”.
  11. For text/html it is best to define the character encoding in the HTTP header rather than hard code <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> into your pages or templates. Again, the W3C have a very straightforward document explaining server configuration techniques.

Code Example — PHP

The following code snippet is a small modification to Neil Crosby’s original work. Test and refine on a development server please, I offer no warranties on it working straight off the bat. Write a simple include at the top of your web page (or template) to reference this external file.

<?php
$charset = "utf-8";
$mime = "text/html";

function fix_code($buffer) {
   return (str_replace(" />", ">", $buffer));
}

if(stristr($_SERVER["HTTP_ACCEPT"],"application/xhtml+xml")) {
   # if there's a Q value for "application/xhtml+xml" then also
   # retrieve the Q value for "text/html"
   if(preg_match("/application\/xhtml\+xml;q=0(\.[1-9]+)/i",
                 $_SERVER["HTTP_ACCEPT"], $matches)) {
      $xhtml_q = $matches[1];
      if(preg_match("/text\/html;q=0(\.[1-9]+)/i",
                    $_SERVER["HTTP_ACCEPT"], $matches)) {
         $html_q = $matches[1];
         # if the Q value for XHTML is greater than or equal to that
         # for HTML then use the "application/xhtml+xml" mimetype
         if($xhtml_q >= $html_q) {
            $mime = "application/xhtml+xml";
         }
      }
   # if there was no Q value, then just use the
   # "application/xhtml+xml" mimetype
   } else {
      $mime = "application/xhtml+xml";
   }
}

# special check for the W3C_Validator
if (stristr($_SERVER["HTTP_USER_AGENT"],"W3C_Validator")) {
   $mime = "application/xhtml+xml";
}

# set the prolog_type according to the mime type which was determined
if($mime == "application/xhtml+xml") {
   $prolog_type = "<?xml version="1.0" encoding="$charset" ?>
<?xml-stylesheet type="text/css" href="/styles/initial.css" media="all"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-GB" lang="en-GB" dir="ltr">\n";

} else {
   ob_start("fix_code");
   $prolog_type = "<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en-GB" dir="ltr">\n";
}

# finally, output the mime type and prolog type
header("Content-Type: $mime;charset=$charset");
header("Vary: Accept");
print $prolog_type;
?>

Code Example — asp.NET

Via Roger Johansson’s article Content Negotiation, Justin Perkins provides an asp.NET example (same disclaimer as before):

string http_accept = Request.ServerVariables["HTTP_ACCEPT"];
string http_user_agent = Request.ServerVariables["HTTP_USER_AGENT"];

if (((http_accept != null) && (http_accept.ToLower().IndexOf("application/xhtml+xml") > 0)) || ((http_user_agent != null) && (http_user_agent.ToLower().IndexOf("w3c_validator") > -1))){
    Response.ContentType = "application/xhtml+xml";
    Response.Write("&lt;?xml version="1.0" encoding="iso-8859-1"?&gt;\n");
}
else{
    Response.ContentType = "text/html";
}
Response.Charset = "iso-8859-1";
Response.AddHeader("Vary", "Accept");

Conclusion

Developing now with XHTML 1.0 allows for forward compatibility but many developers only deploy websites with XHTML 1.0 in backwards compatibility mode (using the text/html MIME type). In order to get the most benefit from XHTML 1.0, developers need to properly consider the issues surrounding the application/xhtml+xml MIME type and implement content negotiation accordingly. It is both worthwhile and achievable and once the solution has been written it is available for re-use within your quality development framework.

References

Next in “From the Top”

Next week will (thankfully) be a shorter article (perhaps!) explaining why the HTML element is actually required.

Technorati tags: , .

The Complete “From the Top” Series

 

This entry was posted in Markup, Standards. Bookmark the permalink.

7 Responses to From the Top: MIME and Content Negotiation

  1. Torsten says:

    Nice article TSG. Couple of additional comments/points:

    It might also be worth mentioning that IE7 will not support application/xhtml+xml either.

    I’m not an expert on PHP, or regular expressions for that matter, but I believe the example PHP script is inappropriate. The regular expression:

    /application\/xhtml\+xml;q=0(\.[1-9]+)/i

    will only match a fractional ‘q’ value. As you said in your write up, the absence of a q value implies a q value of 1 which the example script also doesn’t cater for.

    Keep up the good work.

  2. Simon Jessey says:

    The following code snippet is a small modification to Neil Crosby’s original work.

    Actually, Neil’s work is based on an article that I wrote in October 2003 (which you cite in your references).

  3. Neil Crosby says:

    Thanks for the link through to my article Karl – it was a nice surprise to come home from work to.

    Torsten: I can’t recall my exact thinking behind using that regex in that way, but what I can say is that I’ve been using that script (or versions of it) for about two years now, and I haven’t had it blow up in my face yet.

    I’ve just had a quick play with Firefox, changing “network.http.accept.default” (vie about:config) to contain various different q-values and I couldn’t get the script to do anything unexpected. However I am not convinced that you don’t (whoa, too many negatives there) have a point, so I’m going to go back and have a proper look at the script later in the week.

  4. Neil Crosby says:

    And yes, my article is based on Simon’s original work – it wouldn’t have been written if he hadn’t done his stuff first.

  5. Torsten says:

    Hi Neil,

    As I say, I’m not an expert, but I am rather surprised that it works.

    The accept header for Firefox 1.07 under GNU/Linux is as follows –

    text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5

    …your regular expression doesn’t match anything in that string. By my understanding your script would therefore serve text/html to Firefox when, according to its accept header anyway, it prefers application/xhtml+xml.

    I just had a look at Simon’s article and his reg exp seems much better. It is in fact very similar to the one I use myself:

    $htmlmimetype = '/text\/html(?:;q=([01](?:\.\d{1,3})?))?/i';
    $xhtmlmimetype = '/application\/xhtml\+xml(?:;q=([01](?:\.\d{1,3})?))?/i';

    …so I would say that wouldn’t I ;o)

  6. Karl Dawson says:

    Update to article:

    Clarified and expanded on the fact that the XML Declaration and XML-stylesheet reference (along with the DOCTYPE) are part of the XML Prolog – and how to write each part.

    Clarified that &apos; is unsupported in HTML 4.

    Reworked HTML comments in scripts.

    Added Point 11 in “Doing the Right Thing” section.

    My thanks to Lachlan Hunt, Jukka K. Korpela, Ben Millard and Torsten (above and forum member at Accessify.com)

  7. Torsten says:

    Neil, my apologies. I’ve just looked over your script again and I realise I was reading it all wrong. Your initial stristr search for the string application/xhtml+xml is sufficient, and the reg exp only comes in to play if a q value of anything less than 1 is specified in the accept header. The logic threw me initially, but I see now that it would work perfectly well.