LinkedIn announced some upgrades to their data collection abilities on their Engineering Blog. The post was pretty tech heavy, so here’s the simplified version for those that don’t consider database architecture water cooler conversation.
The importance of data for any advertising platform
Before I explain-ish what LinkedIn is doing, let me explain why it matters. Data collection is crucial to advertising platforms, especially since Facebook and Google have the same data suction power of black holes. Demographic targeting and audience preferences are essential to successful ads, so data is nearly the Alpha and Omega of a platform that relies on advertising dollars. In Lin’s words:
Bringing all these external and internal datasets together into one central data repository for analytics (HDFS) allows for the generation of some really interesting and powerful insights that drive marketing, sales and member-facing data products.
LinkedIn collects a LOT of data
According to LinkedIn, in addition to all the internal datasets generated by user actions on the site, which includes things like member profile updates, posts to the news feed, comments and clicks, LinkedIn also collects outside data. External data sources include platforms like Google, Facebook and Twitter as well as what they gather for marketing purposes. The internal data alone represents hundreds of terabytes every day. External data is not as voluminous, but LinkedIn has no control over how this data is shared, so it is still challenging to collect and sort.
This very confusing picture will give you an even better idea of what and how they’re dealing with data collection challenges.
Improved sorting leads to improved collection and analysis
(Very) long story short, LinkedIn discovered that there were several commonalities in their data sets, which allowed them to sort and filter information gathered in a more efficient manner. As a result of figuring this out, they were better able to sort and streamline data collection, resulting in Gobblin, a single data collection system instead of many. As LinkedIn puts it:
Gobblin is targeted at “gobbling in” all of LinkedIn’s internal and external datasets through a single framework.
Goblin is now successfully dealing with tens of terabytes of data on a daily basis.