New Insights on the Ubiquity of Third-Party Web Tracking

NYU Center for Data Science
3 min readJul 11, 2019

--

Analysis of 3.5 billion web pages identifies trackers on 90% of privacy-noncritical sites and 60% of privacy-critical sites

To some extent, Internet users expect to be tracked online. Anyone who uses social media cedes some degree of privacy when they accept Terms and Conditions agreements. But new research from Sebastian Schelter, CDS Moore-Sloan Data Science Fellow, and Jérôme Kunegis, University of Namur, Belgium, reveals the full extent to which third-party tracking pervades the Internet. The researchers’ work “leverages the largest empirical web tracking dataset collected so far,” which allows them “to derive a ranking measure for tracker occurrences based on aggregated network centrality rather than simple domain counts.”

While websites have always had the capability to track the pages read by their users, third-party tracking, since the advent of Web 2.0, allows companies to track user activity through embedded links or photos. From a browser’s “fingerprint,” third-party trackers can determine the actual identity and approximate geolocation of any user. In their research, the authors define third-party trackers as entities whose “main purpose or the business model of their owning company depends on collecting browsing data of users.”

To determine the predominant tracking companies on the Web and how many websites they track, Schelter and Kunegis processed over 3.5 billion web pages (more than 200 terabytes of data) from the publicly available CommonCrawl 2012 dataset; they identified over 140 million third-party embeddings from 355 tracking services in over 41 million domains. Their detailed analysis determined that third-party trackers are embedded in about 90% of privacy-noncritical websites in their corpus. The overwhelming majority of third-party tracking was conducted by Google, Facebook, and Twitter; Google alone had a tracking presence on over 50% of the sites in the dataset. This confirms the already established power of Google Analytics.

But what about privacy-critical sites, which may reveal sensitive user information? The researchers compared privacy-critical sites from four categories (health, addiction, sexuality, gender identity) with sites from four privacy-noncritical categories (cooking & food, soccer, television, video games) selected from the DMOZ database, “a large, human-edited directory of the web which provides an extensive labeling of websites.” According to the content-specific analysis, third-party tracking still covers about 60% of privacy-critical sites. While this lower incidence of tracking signals that these websites take measures to avoid trackers, Schelter and Kunegis note that the more prevalent trackers on privacy-critical sites were smaller companies less likely to be perceived as trackers.

The researchers also performed an analysis by country, which showed that Google, Facebook, and Twitter emerged in the top ten trackers by rankshare for 46 of the 50 countries involved in the CommonCrawl 2012 dataset. The four outliers: China, Russia, Iran, and Ukraine. (With regard to China, however, “findings support a previous study which concluded that Google still operates tracking services on Chinese websites, despite its proclaimed retreat from the Chinese market in 2010.) Schelter and Kunegis leveraged a large number of additional datasets with political and economic information to investigate these outliers. Ultimately, their findings “indicate that a positive characteristic such as freedom of the press is accompanied by a potentially very negative characteristic: the recording of people’s browsing behavior by companies outside of the legal control of their countries institutions.”

By Paul Oliver

--

--

NYU Center for Data Science
NYU Center for Data Science

Written by NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.

No responses yet