Big Data vs. Sampling

BIGdata

Merriam-Webster defines sampling as:

  1. the act, process, or technique of selecting a suitable sample; specifically :  the act, process, or technique of selecting a representative part of a population for the purpose of determining parameters or characteristics of the whole population

  2. a small part selected as a sample for inspection or analysis ask a sampling of people which candidate they favor

In statistics, sampling is the practice of viewing or polling a representative subset of a population. Sampling generates statistical error so that results are often published as some value +/- some error range, i.e. 47% +/- 5%. It’s possible to produce accurate but useless statistical results in this manner. For example, if the result of a political poll in a race between two candidates is 47% +/- 5%, one interpretation is “it’s too close to call.” On second thought, that’s not useless. But it may be less useful for a candidate who desires to spend large sums of money to prepare for a victory celebration.

The Vs. Part…

Sampling informs analysts and data scientists of an approximation and a range of potential error. Sampling says we don’t need to poll every individual to achieve a result that is “good enough.”

“Big data” attempts to poll every individual.

The Problem I am Trying To Solve

Is more data better? In his 2012 book, Antifragile, Nassim Nicholas Taleb (fooledbyrandomness.com | @nntaleb) – the first data philosopher I encountered – states:

“The fooled-by-data effect is accelerating. There is a nasty phenomenon called ‘Big Data’ in which researchers have brought cherry-picking to an industrial level. Modernity provides too many variables (but too little data per variable), and the spurious relationships grow much, much faster than real information, as noise is convex and information is concave.” – Nassim Nicholas Taleb, Antifragile, p. 416

According to Taleb, there’s a bias for error embedded in big data; more is not better, it’s worse. I’ve experienced this with business intelligence solutions and spoken about data quality in data warehouse solutions, saying:

“The ratio of good:bad data in a useless / inaccurate data warehouse is surprisingly high; almost always north of 95% and often higher than 99%.”

The Solution

The solution to bad data has always been (and remains) data quality. “Just eliminate the inaccurate data” sounds simple but it’s not an easy problem to solve. In data science, data quality is the next-to-the-longest long pole.  (Data integration is the longest long pole.) The solution for the first and second longer poles in data science is the same: automation.

At Enterprise Data & Analytics, we’re automating data quality. I mention it not by way of advertisement (although a geek’s gotta eat), but to inform you of another research focus area in our enterprise. Are we the only people trying to solve this problem? Goodness no. (That would be tragic!) We are focused on automating data science. You can learn more about our solutions for automating data wrangling at DILM Suite.

:{>

Learn More:
Biml in the Enterprise Data Integration Lifecycle (Password: BimlRocks)
From Zero to Biml – 19-22 Jun 2017, London 
IESSIS1: Immersion Event on Learning SQL Server Integration Services – 2-6 Oct 2017, Chicago

Tools:
SSIS Framework Community Edition
Biml Express Metadata Framework
SSIS Catalog Compare
DILM Suite

Andy Leonard

andyleonard.blog

Christian, husband, dad, grandpa, Data Philosopher, Data Engineer, Azure Data Factory, SSIS guy, and farmer. I was cloud before cloud was cool. :{>

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.