In the last few weeks I’ve had conversations with a couple of different people about their sites not using ETags correctly. This led me to wonder how many of the top sites on the web have a similar problem.
I downloaded the list of top U.S. sites from Quantcast and wrote a simple PHP script to see which of them included an ETag header in their HTTP response. I ran checks on the top 1,000 sites from that list. Of those 136 included an ETag header in the response. Here are some of the interesting points from those 136:
– 9 of them indicated they were weak validators ( W/ )
– 2 had values of “”
– 1 had a completely empty value
– most used double quotes around the entire value, 6 didn’t use quotes at all
– 1 used a date value of “Sun, 17 Jul 2011 17:14:09 -0400”
Each response was checked for an ETag header, if it had one then another request was sent with the If-None-Match header, using the value of the ETag. For sites that are using ETags correctly they will detect this and send back a “304 Not Modified” status. Ultimately however I settled on four possible results for sites using ETags:
– WORKS ( ETAG_WORKS ) : does exactly what it should, returning “304 Not Modified” when appropriate
– WORKS, sort of, web server farm with different ETag values ( ETAG_WORKS_FARM ) : only does the right thing if you happen to hit the same backend web server repeatedly, which you can’t really control
– FAILS, the ETag value changes ( ETAG_FAILS_CHANGE ) : this is a failure where the site returns a different ETag value on every request, making it impossible to ever get a match
– FAILS, ignored If-None-Match ( ETAG_FAILS_IGNORE ) : the site consistently returns the same ETag value, but always forces a re-download of the resource even when a correct If-None-Match value is provided
The server farm situation is an interesting one. To test for that each time an ETag check request fails for a site I send another dozen requests to see if any of those succeed. That isn’t a perfect solution, all of the requests come from the same IP in a short period of time, so it is reasonable that some sites will send all of those requests to the same back end server in their farm. That said, this technique did get a few hits and was very easy to implement.
Here are the numbers for each of the possible categories, remember this is out of a total of 136:
– ETAG_WORKS : 54 ( 39.7% )
– ETAG_WORKS_FARM : 11 ( 8% )
– ETAG_FAILS_CHANGE : 24 ( 17.6% )
– ETAG_FAILS_IGNORE : 47 ( 34.5% )
Not exactly stellar results. More than half of the sites using ETags completely fail at using them correctly. To make matters even worse, the first site to use ETags correctly was ranked number #62 on the Quantcast list. There were 8 other sites ranked higher than that ( #5, #17, #35, #38, #48, #49, #51, and #55 ) that all failed. The good news in all of this: there is plenty of room for improvement.
The code (which is very basic) for running the survey is available at https://github.com/josephscott/etag-survey. That also contains the Quantcast list I used (downloaded 6 Sep 2011) and the results of the run (also dated 6 Sep 2011).
I need to look at the code for httparchive.org and see if this is something that could be easily added to their test suite. I’m hoping that the number of sites correctly using ETags will go up over time.