Fewer HTTP Verbs

Brett Slatkin suggests that we reduce the number of verbs in HTTP 2.0:

Practically speaking there are only two HTTP verbs: read and write, GET and POST. The semantics of the others (put, head, options, delete, trace, connect) are most commonly expressed in headers, URL parameters, and request bodies, not request methods. The unused verbs are a clear product of bike-shedding, an activity that specification writers love.

Interestingly, HTTP 1.0 only defined GET, POST, and HEAD back in 1996.

I could get behind the idea of just having GET, POST, and HEAD. In practice these tend to be the safest verbs to use. It would also put an end to having to talk about the semantics of PUT every six months.

Those that insist that all things must be REST or they are useless won’t like this. They could find a way to get over that.

TCP Over HTTP, A.K.A. HTTP 2.0

Skimming through the HTTP 2.0 draft RFC that was posted yesterday I’m left with the distinct feeling of implementing TCP on top of HTTP:

HTTP 2.0 Framing
HTTP 2.0 Framing

I’m in the camp that believes that future versions of HTTP should continue to be a text based protocol ( with compression support ).

Most weeks I look at several raw HTTP requests and responses. Yes, there will still be tools like cURL ( which I love ) to dig into HTTP transactions, so it isn’t the end of the world. Still, I am sad to see something that is currently fairly easy to follow turn into something significantly more complex.

reddit.com HTTP Response Headers

I found an old note to myself to look at the HTTP response headers for reddit.com. So I did this:

$ curl -v -s http://www.reddit.com/ > /dev/null
* About to connect() to www.reddit.com port 80 (#0)
* Trying…
* connected
* Connected to www.reddit.com ( port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 OpenSSL/0.9.8r zlib/1.2.5
> Host: www.reddit.com
> Accept: */*
< HTTP/1.1 200 OK < Content-Type: text/html; charset=UTF-8 < Server: ‘; DROP TABLE servertypes; —
< Vary: accept-encoding < Date: Wed, 22 May 2013 14:37:25 GMT < Transfer-Encoding: chunked < Connection: keep-alive < Connection: Transfer-Encoding < { [data not shown] * Connection #0 to host www.reddit.com left intact * Closing connection #0Fun Server entry in there. Reminded me of little Bobby tables from xkcd.

I’m sure this has made the rounds in other places. Unfortunately my note didn’t indicate where I first saw this.

iOS6 Safari Caching POST Responses

With the release of iOS6 mobile Safari started caching POST responses. Mark Nottingham talks through the related RFCs to see how this lines up with the HTTP specs. Worth a read for the details, here is the conclusion:

even without the benefit of this context, they’re still clearly violating the spec; the original permission to cache in 2616 was contingent upon there being explicit freshness information (basically, Expires or Cache-Control: max-age).

So, it’s a bug. Unfortunately, it’s one that will make people trust caches even less, which is bad for the Web. Hopefully, they’ll do a quick fix before developers feel they need to work around this for the next five years.

Over the years I’ve run across a handful of services and applications that claim to be able to cache HTTP POST responses. In every case that turned out to be a bad decision.

Google’s plusone.js Doesn’t Support HTTP Compression

I was surprised to see that Google’s plusone.js doesn’t support HTTP compression. Here is a quick test with
curl -v --compressed https://apis.google.com/js/plusone.js > /dev/null

Request Headers:

> GET /js/plusone.js HTTP/1.1
> User-Agent: curl/7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 OpenSSL/0.9.8r zlib/1.2.3
> Host: apis.google.com
> Accept: */*
> Accept-Encoding: deflate, gzip

Response Headers:

HTTP/1.1 200 OK
< Content-Type: text/javascript; charset=utf-8
< Expires: Fri, 18 Nov 2011 02:35:20 GMT
< Date: Fri, 18 Nov 2011 02:35:20 GMT
< Cache-Control: private, max-age=3600
< X-Content-Type-Options: nosniff
< X-Frame-Options: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
< Server: GSE
< Transfer-Encoding: chunked

You'll notice there is no Content-Encoding: gzip header in the response.

We'll have to get Steve Souders to pester them about that.

Timing Details With cURL

Jon’s recent Find the Time to First Byte Using Curl post reminded me about the additional timing details that cURL can provide.

cURL supports formatted output for the details of the request ( see the cURL manpage for details, under “-w, –write-out <format>” ). For our purposes we’ll focus just on the timing details that are provided.

Step one: create a new file, curl-format.txt, and paste in:

            time_namelookup:  %{time_namelookup}\n
               time_connect:  %{time_connect}\n
            time_appconnect:  %{time_appconnect}\n
           time_pretransfer:  %{time_pretransfer}\n
              time_redirect:  %{time_redirect}\n
         time_starttransfer:  %{time_starttransfer}\n
                 time_total:  %{time_total}\n

Step two, make a request:

curl -w "@curl-format.txt" -o /dev/null -s http://wordpress.com/

What this does:

  • -w "@curl-format.txt" tells cURL to use our format file
  • -o /dev/null redirects the output of the request to /dev/null
  • -s tells cURL not to show a progress meter
  • http://wordpress.com/ is the URL we are requesting

And here is what you get back:

            time_namelookup:  0.001
               time_connect:  0.037
            time_appconnect:  0.000
           time_pretransfer:  0.037
              time_redirect:  0.000
         time_starttransfer:  0.092
                 time_total:  0.164

Jon was looking specifically at time to first byte, which is the time_starttransfer line. The other timing details include DNS lookup, TCP connect, pre-transfer negotiations, redirects (in this case there were none), and of course the total time.

The format file for this output provides a reasonable level of flexibility, for instance you could make it CSV formatted for easy parsing. You might want to do that if you were running this as a cron job to track timing details of a specific URL.

For details on the other information that cURL can provide using -w check out the cURL manpage.

ETag Survey

In the last few weeks I’ve had conversations with a couple of different people about their sites not using ETags correctly. This led me to wonder how many of the top sites on the web have a similar problem.

I downloaded the list of top U.S. sites from Quantcast and wrote a simple PHP script to see which of them included an ETag header in their HTTP response. I ran checks on the top 1,000 sites from that list. Of those 136 included an ETag header in the response. Here are some of the interesting points from those 136:

– 9 of them indicated they were weak validators ( W/ )
– 2 had values of “”
– 1 had a completely empty value
– most used double quotes around the entire value, 6 didn’t use quotes at all
– 1 used a date value of “Sun, 17 Jul 2011 17:14:09 -0400”

Each response was checked for an ETag header, if it had one then another request was sent with the If-None-Match header, using the value of the ETag. For sites that are using ETags correctly they will detect this and send back a “304 Not Modified” status. Ultimately however I settled on four possible results for sites using ETags:

WORKS ( ETAG_WORKS ) : does exactly what it should, returning “304 Not Modified” when appropriate
WORKS, sort of, web server farm with different ETag values ( ETAG_WORKS_FARM ) : only does the right thing if you happen to hit the same backend web server repeatedly, which you can’t really control
FAILS, the ETag value changes ( ETAG_FAILS_CHANGE ) : this is a failure where the site returns a different ETag value on every request, making it impossible to ever get a match
FAILS, ignored If-None-Match ( ETAG_FAILS_IGNORE ) : the site consistently returns the same ETag value, but always forces a re-download of the resource even when a correct If-None-Match value is provided

The server farm situation is an interesting one. To test for that each time an ETag check request fails for a site I send another dozen requests to see if any of those succeed. That isn’t a perfect solution, all of the requests come from the same IP in a short period of time, so it is reasonable that some sites will send all of those requests to the same back end server in their farm. That said, this technique did get a few hits and was very easy to implement.

Here are the numbers for each of the possible categories, remember this is out of a total of 136:

ETAG_WORKS : 54 ( 39.7% )
ETAG_WORKS_FARM : 11 ( 8% )
ETAG_FAILS_CHANGE : 24 ( 17.6% )
ETAG_FAILS_IGNORE : 47 ( 34.5% )

Not exactly stellar results. More than half of the sites using ETags completely fail at using them correctly. To make matters even worse, the first site to use ETags correctly was ranked number #62 on the Quantcast list. There were 8 other sites ranked higher than that ( #5, #17, #35, #38, #48, #49, #51, and #55 ) that all failed. The good news in all of this: there is plenty of room for improvement.

The code (which is very basic) for running the survey is available at https://github.com/josephscott/etag-survey. That also contains the Quantcast list I used (downloaded 6 Sep 2011) and the results of the run (also dated 6 Sep 2011).

I need to look at the code for httparchive.org and see if this is something that could be easily added to their test suite. I’m hoping that the number of sites correctly using ETags will go up over time.

Performance Trends For Top Sites On The Web

Steve Souders posted an update on the HTTP performance trends for top sites, based on data gathered via http://httparchive.org/. Here are the bottom line numbers:

Here’s a recap of the performance indicators from Nov 15 2010 to Aug 15 2011 for the top ~13K websites:

  • total transfer size grew from 640 kB to 735 kB
  • requests per page increased from 69 to 76
  • sites with redirects went up from 58% to 64%
  • sites with errors is up from 14% to 25%
  • the use of Google Libraries API increased from 10% to 14%
  • Flash usage dropped from 47% to 45%
  • resources that are cached grew from 39% to 42%

I was surprised by the total transfer size increase. If you followed that trend on a weekly basis, every Friday for the last 9 months you added another 2.6 kB to the total transfer size of your site. Not much for any given week, but it adds up fast.

Cookies That Won’t Die: evercookie

User tracking on the web is an interesting field. There are projects like evercookie that provide insight into the different techniques that are available given todays web client technologies.

The methods listed by evercookie that I thought were particularly curious are:

– Storing cookies in RGB values of auto-generated, force-cached PNGs using HTML5 Canvas tag to read pixels (cookies) back out

– Storing cookies in HTTP ETags

And of course the various methods used by evercookie all cause the original HTTP cookie to re-spawn.

The code is available at https://github.com/samyk/evercookie and is worth a look if you are interested in this sort of thing. The evercookie page has descriptions of how some of the techniques work, along with a sample piece of code to get started with.

HTTP Basic Auth with httplib2

While working on pressfs I ran into an issue with httplib2 using HTTP Basic Authentication.

Here is some example code:

import httplib2

if __name__ == '__main__' :
    httplib2.debuglevel = 1

    h = httplib2.Http()
    h.add_credentials( 'username', 'password' )

    resp, content = h.request( 'http://www.google.com/', 'GET' )

If you run this you’ll notice that httplib2 doesn’t actually include the HTTP Basic Auth details in the request, even though the code specifically asks it to do so. By design it will always make one request with no authentication details and then check to see if it gets an HTTP 401 Unauthorized response back. If and only if it gets a 401 response back will it then make a second request that includes the authentication data.

Bottom line, I didn’t want to make two HTTP requests when only one was needed (huge performance hit). There is no option to force the authentication header to be sent on the first request, so you have to do it manually:

import base64
import httplib2

if __name__ == '__main__' :
    httplib2.debuglevel = 1

    h = httplib2.Http()
    auth = base64.encodestring( 'username' + ':' + 'password' )

    resp, content = h.request(
        headers = { 'Authorization' : 'Basic ' + auth }

Watching the output of this you’ll see the authentication header in the request.

Someone else already opened an issue about this ( Issue 130 ), unfortunately Joe Gregorio has indicated that he has no intention of ever fixing this :-(

On the up side, working around this deficiency only takes a little bit of extra code.