Aspen

A Python web framework that makes the most of the filesystem.
Simplates are the main attraction.

Unicode

According to Guido, Python “has excellent support for Unicode, and will keep getting better.” The same is true of ... oh no! Snowman is being attacked by Comet! But, look! Linear Buck, hero from beyond the Basic Multilingual Plane, is coming to his rescue! Hooray for Linear Buck!



𐂂

In designing Aspen’s Unicode handling, the following priorities have been in view:

  1. Aspen should handle Unicode securely.
  2. Aspen should observe standards.
  3. Aspen should interoperate with consumer-grade web browsers (Internet Explorer, etc.) where they diverge from standards.
  4. Aspen should enable access to raw bytestrings for advanced use cases.

This document describes Aspen’s approach to Unicode security, and then describes Aspen’s algorithms for decoding Requests and encoding Responses, with reference to the de jure standards, de facto browser behavior, and advanced use cases.

Security

The canonical reference for security issues related to Unicode is this Technical Report from the Unicode Consortium:

Most of the discussion revolves around spoofing websites by registering visually confusing domain names such as paypаl.com, where the second ‘a’ is actually from the Cyrillic and not the Latin alphabet. That’s a problem for browser vendors to solve, and for you to take advantage of, if you’re a Bad Guy like Comet (just watch out for Linear Buck!).

What Snowman has to worry about are the “Non-Visual Security Issues.” The basic idea is that any algorithm that mutates character data is a chance for Comet to game that algorithm. If Comet can sneak in an extra path separator or remove a quotation mark, then she may be able to traverse Snowman’s filesystem or inject some extra SQL. What is Snowman to do?

Validate late!

After validating your inputs, make sure that you don’t transcode the data again before using it. Here’s a simple illustration:

  1. Comet sends a request for /..%2Fetc%2Fpassword.
  2. Snowman conscientiously checks the request for path separators (“/”).
  3. Snowman doesn’t find any path separators and let’s the request through.
  4. Later on in his program, Snowman decodes the percent-encoding in this value. Guess what %2F decodes to.
  5. Finally, Snowman runs open("/../etc/password").read() and returns the result to Comet.

Obviously this is a contrived example but it makes the point. TR36 mentions seven algorithms in Unicode and goes into the details of how to game them:

The good news is that Python handles almost all of these for us, and Aspen handles the rest. If Aspen is given an HTTP Request that doesn’t decode cleanly according to the algorithm below, then it returns a 400 Bad Request.

Decoding Requests

Here are the parts of the Request with notes on how Aspen decodes them:

request
    line
        method           subset of ASCII, per spec
        uri
            path         subset of ASCII, per spec (but WSGI servers do
                          things)
            querystring  subset of ASCII, per spec (but IE sends raw
                          UTF-8)
        version
    headers              ???
    body                 ???

If a browser or other program sends anything else to Aspen, it’ll get 400 Bad Request.

Encoding Responses

The Aspen Response object takes body as a bytestring or iterable of bytestrings. If you set response.charset in a template resource then that will be added to Content-Type if your mimetype is of major type 'text'. There is no default charset for static resources, which means HTTP-conformant clients will try ISO-8859-1, but most will probably try to guess based on how the bytes smell.

Home Virtual Paths