Have you ever thought about how your site’s performance compares to the web as a whole? Or maybe you’re curious how popular a particular web feature is. How much is too much JavaScript? The HTTP Archive has been keeping track of how the web is built since 2010. It enables you to find answers to questions about the state of the web past and present.
Paul Calvano explores how the HTTP Archive works, how people are using this dataset, and some ways that Akamai has leveraged data within the HTTP Archive to help its customers.
2. 2
Paul Calvano
@paulcalvano
Akamai
About Me
● Web Performance Architect @ Akamai
● HTTP Archive / BigQuery Addict :)
● Working on #WebPerf since 2000
● https://paulcalvano.com
● @paulcalvano on Twitter
5. 5
How it Works
● Alexa’s top 500,000 websites
○ Home pages
○ Desktop and emulated mobile
○ Increasing to 1,000,000 soon!
● Powered by WebPageTest
○ Records HAR trace
○ Executes custom metrics
○ Records Lighthouse audits
● httparchive.org
○ Trends and stats
○ Discussion forum
● BigQuery and Cloud Storage
○ Queryable database
○ Raw HARs
11. 11
A Peek Inside the Databerg…
DataSet Description Size (GB) Rows
summary_pages
Summary of all Desktop and
Mobile Pages
~340MB
Desktop: ~460K
Mobile: ~450K
summary_requests
Summary of all HTTP Requests for
Desktop and Mobile
~45 GB
Desktop: ~48 million
Mobile: ~44 million
pages
JSON-encoded parent document
HAR data
~5 GB
Desktop: ~460K
Mobile: ~450K
requests
JSON encoded subresource HAR
data
~290 GB
Desktop: ~48 million
Mobile: ~44 million
response_bodies
JSON encoded response bodies
for textual subresources
~915 GB
Desktop: ~18 million
Mobile: ~14 million
lighthouse
JSON encoded Lighthouse Report.
Mobile only
~140 GB Mobile: ~450K
* rows and size stats are based on 5/15/18 run
13. 13
goo.gl/kxgzM1HTTP Archive referenced in research papers
. . .
In this article we utilize
the httparchive.org
[9] publicly available
dataset of captured
web performance
metrics
. . .
Desktop and mobile web page
comparison: characteristics,
trends, and implications
IEEE Communications Magazine (
Volume: 52, Issue: 9, September 2014 )
. . .
Recent stats from
httparchive.org show
that the top 300K URLs
in the world need on
average 38(!) TCP
connections to display
the site
. . .
HTTP2 explained
Computer Communication Review 44.3
(2014): 120-128.
. . .
We make extensive use
of the [...] data
available at HTTP
Archive to expose the
characteristics of 3rd
Party assets embedded
into the top 16,000
Alexa webpages
. . .
Are 3rd Parties Slowing Down the
Mobile Web?
Proceedings of the Eighth Wireless of
the Students, by the Students, and for
the Students Workshop. ACM, 2016.
19. 19
Last Mile Acceleration (LMA)
● Akamai feature to gzip compress content at the CDN edge
○ Helps out when origins do not compress certain resources
● Compression is based on HTTP Content Type
● Old defaults were not sufficient and usually required updating…
● We updated this a few years ago, using HTTP Archive data
20. 20
SELECT mimeType, count(*) total,
SUM(IF(resp_content_encoding = "gzip",1,0)) gzip,
SUM(IF(resp_content_encoding = "deflate",1,0)) deflate,
SUM(IF(resp_content_encoding IN("gzip","deflate"),0,1)) NoCompression,
ROUND(
SUM(
IF(resp_content_encoding IN("gzip", "deflate"),1,0)
) / COUNT(*),2) CompressedPercentage
FROM httparchive.summary_requests.2018_05_15_desktop
GROUP BY mimeType
HAVING total > 1000
ORDER BY gzip DESC
bit.ly/2y1fKNIQuerying the HTTP Archive for Compression Stats
1
2
3
4
5
6
7
8
9
10
11
12
21. 21
Some New LMA Defaults
● text/javascript
● font/ttf
● application/javascript
● text/xml
● application/json
● application/xml
● ...
Many Content-Types that
did not match the original
defaults!
23. 23
What About Brotli?
● New compression algorithm developed by Google researchers
● 5% - 25% Reduction over Gzip Compression
● Supported by most browsers
● Let’s extend the previous query to include Brotli compression
24. 24
SELECT mimeType, count(*) total,
SUM(IF(resp_content_encoding = "gzip",1,0)) gzip,
SUM(IF(resp_content_encoding = "br",1,0)) brotli,
SUM(IF(resp_content_encoding = "deflate",1,0)) deflate,
SUM(IF(resp_content_encoding IN("gzip","deflate","br"),0,1)) NoCompression,
ROUND(
SUM(
IF(resp_content_encoding IN("gzip", "deflate", "br"),1,0)
) / COUNT(*),2) CompressedPercentage,
ROUND(
SUM(
IF(resp_content_encoding = "br",1,0)
) / COUNT(*),2) BrotliCompressedPercentage
FROM httparchive.summary_requests.2018_05_15_desktop
GROUP BY mimeType
HAVING total > 1000
ORDER BY brotli DESC
bit.ly/2Mcawl1Compression Stats - gzip and brotli
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
25. 25
Examining Compression By Content Type - With Brotli
Brotli use has been growing, but where is it most prevalent?
26. 26
Brotli Usage: Mostly JavaScript, CSS and HTML Resources
When we exclude Google and Facebook content, the bulk of Brotli encoded content is JS and CSS
27. 27
Compression Level = Overhead
Data on compression speeds from quixdb.github.io/squash-benchmark/#results
Most byte savings are
obtained by using the
highest compression level.
28. 28
Resource Optimizer: Automated Brotli Compression at the Edge
● Automatically compresses CSS and JS with Brotli
● Resources are compressed offline and then cached
● Brotli compression level 11, without the overhead!
+ =
29. 29
Case Study #2
Server Technologies
1. Akamai Varnish Connector
2. Security Vulnerability Research
31. 31
How Many Akamai Customers Use Varnish at the Origin?
● HTTP Archive data helped determine which Akamai’s customers were using Varnish at the origin.
● Akamai Product Management was able to discuss desirable functionality with existing customers.
32. 32
Investigating Security Threats - 0 Day Vulnerability
● HTTP Archive
○ Investigate other sites that contain similar characteristics
■ Server, Via, Url Regex Patterns, Other Headers
○ Export a list of sites that appear vulnerable
○ Cross Reference with Akamai Account Data
■ Notify 24x7 Security Contacts,
■ Help customers proactively protect themselves
● No CVE or Historical Attack Patterns
○ Target attack observed and mitigated (Kona Managed Security)
○ Akamai WAF rules prepared
○ Vendor notified
● Example: 0 Day Vulnerability on an Ecommerce App Server
33. 33
Another 0-Day: How Can We Identify Sites Running Drupal
Drupal announced that a “highly critical”
security release would be happening
within a week
Expectation is that it would give sites time
to prepare for an emergency security
patch before 0 Day exploits begin…
https://www.drupal.org/sa-core-2018-002
34. 34
Identifying Sites Running Drupal with HTTP Archive
First Try:
● WHERE url LIKE "%drupal.js%"
● Found 97 sites using Akamai and Drupal
Second Try:
● Expires header = 'Sun, 19 Nov 1978 05:00:00 GMT'
○ https://www.ostraining.com/blog/drupal/5-ways-drupal/
● Found more sites using Akamai and Drupal
○ ~26K total requests with this expires header!
What Did We do With this Data?
● Customer Outreach (Are you aware and prepared to patch?)
● Prepared WAF rules for those not able to apply patches immediately.
35. 35
Investigating Security Threats - CryptoCurrency Miners
● Do any of my customers have cryptocurrency miners?
○ Are they aware?
○ Do they know how it got there?
https://discuss.httparchive.org/t/the-performance-impact-of-cryptocurrency-mining-on-the-web/1126/
36. 36
Now Easier with Wappalyzer!
● Wappalyzer is a Cross Platform utility that
uncovers technologies used on websites.
● Integrated into HTTP Archive since April
2018
https://discuss.httparchive.org/t/using-wappalyzer-to-analyze-cpu-times-across-js-frameworks/1336/
37. 37
Investigating Security Threats - What Domains are Serving Malware?
● Akamai’s ETP Service
○ Millions of malicious domains
and IP addresses.
○ Are any of my customers
serving 3rd party content
from known malware hosts?
● HTTP Archive parsed against the
ETP DB
○ Notified accounts if they
served content to the HTTP
Archive from known malware
hosts
38. 38
Case Study #3
Third Party Research
1. How 3rd Parties Influence Render Time?
2. Researching a specific 3rd party
39. 39
Do Third Parties Impact Load Time?
https://discuss.httparchive.org/t/analyzing-3rd-party-performance-via-http-archive-crux/1359
● CrUX = Chrome User
Experience Report
● JOINed’ w/ HTTP
Archive data for Alexa
Ranks
● Load times are faster
for sites with less third
party content.
40. 40bit.ly/2sN2TJZWhich 3rd Party Content Types Load Before Render Start?
SELECT mimeType,
COUNT(*) num_requests,
SUM(
IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),1,0)
) BeforeRenderStart,
SUM(
IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),0,1)
) AfterRenderStart
FROM httparchive.summary_requests.2017_09_01_desktop req
JOIN (
SELECT rank, NET.HOST(url) hostname, url, pageid, startedDateTime, renderStart
FROM httparchive.summary_pages.2017_09_01_desktop
) pages ON pages.pageid = req.pageid
WHERE NET.HOST(req.url) != pages.hostname AND rank > 0 AND rank < 100000
GROUP BY mimeType
HAVING num_requests > 1000
ORDER BY BeforeRenderStart DESC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
41. 41
What 3rd Party Content Loads Before RenderStart?
discuss.httparchive.org/t/which-3rd-party-content-loads-before-render-start/1084
42. 42bit.ly/2glVquXWhich 3rd Parties Load Before Render Start?
SELECT NET.HOST(req.url) thirdparty,
mimeType,
COUNT(*) num_requests,
SUM(
IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),1,0)
) BeforeRenderStart,
SUM(
IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),0,1)
) AfterRenderStart
FROM httparchive.summary_requests.2017_09_01_desktop req
JOIN (
SELECT rank, NET.HOST(url) hostname, url, pageid, startedDateTime, renderStart
FROM httparchive.summary_pages.2017_09_01_desktop
) pages ON pages.pageid = req.pageid
WHERE NET.HOST(req.url) != pages.hostname AND rank > 0 AND rank < 100000
GROUP BY thirdparty, mimeType
HAVING num_requests > 100
ORDER BY BeforeRenderStart DESC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
44. 44bit.ly/2LuD1JqWhich Websites Load <3rd Party> Before vs After Render Time?
SELECT rank, site,
SUM(
IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),1,0)
) BeforeRenderStart,
SUM(
IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),0,1)
) AfterRenderStart
FROM httparchive.summary_requests.2017_09_01_desktop req
INNER JOIN (
SELECT rank, NET.HOST(url) site, pageid, startedDateTime, renderStart
FROM httparchive.summary_pages.2017_09_01_desktop
) pages ON pages.pageid = req.pageid
WHERE NET.HOST(req.url) LIKE "%ensighten.com%" AND rank > 0
GROUP BY rank, site
HAVING AfterRenderStart>0
ORDER BY rank ASC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
45. 45
Which Sites Load a Specific 3rd Party Before RenderStart?
● This query outputs a summary
containing the following
information:
○ Which sites use the 3rd
party?
○ How many resources are
served by it?
○ How many of them are
loaded before/after the page
renders?
● Results can help sites learn from
each other’s best practices
○ Even across industries!!!