Since Internet networking was integrated in daily lives, data has been collected over massive applications on a scale that few can fathom at this point. This data collection has resulted in data sets so massive, that they’ve been dubbed ‘Big Data’. While this phrase is enough to interest a room of businessmen, many engineers and statisticians have been baffled as to how to dismantle such a complex data set. Computer Engineering and Systems Group’s Dr. Nick Duffield, an expert on Big Data, has made significant progress on the issue with his research. In his latest seminar, Dr. Duffield details where exactly the issues surface with large data sets, and suggests some methods to help avoid the constraints within interpretation of Big Data.
Data comes from more places than one might think these days. An enormous amount of data pours in everyday from physical measurements, the medical field, user Internet activity, and business data. With all this data, there come some major challenges that Dr. Duffield broke down in his presentation. The most obvious challenge is simply storing what could be petabytes or exabytes of data. Although it may be possible, the amount of resources that a company like UPS or Google would have to collect to store such data would be uneconomical. Not to mention the resources in both manpower and time to crunch through Big Data would be insurmountable for many companies. And even if that was all possible for a very wealthy, large company, Dr. Duffield said that the data is often incomplete or has, “statistical properties that are difficult to model.”
Therefore, Big Data must somehow be compressed or summarized so that it can be handled accurately. Dr. Duffield suggests sampling large data sets to create manageable data where useful analysis can take place. Sampling is a desirable way act as a middle ground between all the constraints that have been mentioned. Unfortunately, sampling in itself is not the complete answer. There has to be an understanding of the underlying distributions in the data to perform sampling properly. These complexities give rise to a number of methodologies that Dr. Duffield expounded on in his talk. While Bernoulli and Poisson sampling are more familiar methods for interpreting data, he also explained how non-uniform methods like the Horvitz-Thompson estimator can be utilized to digest Big Data. ISPs have also sought out graph sampling with sub graph counting to analyze data, but obtaining a representative graph from a large graph is extremely difficult.
The point for Big Data at the moment is that there is, “a need to work across disciplines,” according to Dr. Duffield. He would like to attract students and collaborators at the union of education and application. “The highest impact work is done in the intersection(s) of groups: as in statistics, mathematics…etc.” Dr. Duffield said.
Dr. Duffield is a Professor in the Department of Electrical and Computer Engineering at Texas A&M University. From 1995 until 2013, he worked at AT&T Labs-Research where he was a Distinguished Member of Technical Staff and an AT&T Fellow. He has twice received the ACM SIGMETRICS Test of Time award in 2012 and 2013. He is also a Fellow of the IEEE.