Author: Marcel van den Bosch Date: 25-jan-2015
Nowadays more companies are testing out new Big Data concepts. During their experimental efforts, questions about how to approach these various types analyses may arise. In my work as a data scientist, I noticed an important question that influences the way the analysis process is set-up. This question also determines how the Data Analytics environment would look like.
Going beyond descriptive and diagnostic questions
raditional Business Intelligence (BI) focuses on answering descriptive (e.g. How many sales did we have in 2014?) and diagnostic questions (e.g. What happened to the sales figures after our promotion campaign in Arnhem?). For most BI consultants, this is business as usual.
When I look at the BI roadmaps of some organizations, I see a common path: We can make a bigger data warehouse and make them faster using clever in-memory technologies.
If we want to draw the line a bit further, I believe that the predictive (e.g. What is likely to happen?) and prescriptive (e.g. What is the best thing I should do?) questions are next in line.
Approaching the limits: Machine learning and prediction
We can use machine learning techniques – such as Artificial Neural Networks – and other prediction methods to give organizations an answer to these kinds of questions.
However, most of the software packages that are used to perform those kinds of predictions require the entire dataset to be accessible in one place. This is usually the internal memory.
If you are like me and want to go BIG, you might want to work with ALL the data to get the maximum value out of it. The memory limit can impose some practical limitations.
Bringing the data to the analysis
In order to answer questions about the present situation (descriptive, diagnostic) we have really cool BI environments that first pull the data from our data warehouses, do some aggregation/filtering and visually present the results. Even my favorite statistical programming language – R – works by loading the data in memory and then do the analysis on it.
Bringing the analysis to the data
If you want to build a predictive model upon terabytes of data, you will not be able to do this on a single machine. You will need to do this in an environment with technologies that are able to handle prediction and machine learning in a distributed and parallel way. Scalability is an important aspect to take into account.
When working with a distributed environment, the most difficult part to scale is the network and disk I/O latency. If we need to copy several TB’s of data across several processing nodes, this would definitively give us a performance punishment. It could even make our analysis practically impossible.
Most modern big data platforms provide us with a way to bring the analysis to the data. We have to use a framework – such as Hadoop – and algorithms that allows us to intelligently work on parts of the data and let each worker node only process a chunk of the entire dataset. This is the opposite of getting all the data transferred to a single processing unit.
The Big data analysis design question
When beginning on a big data analytics journey, it is important to consider the question: “Should we bring the data to the analysis? Or should we bring the analysis to the data?”
The answer to this question would definitively provide you with guidance on how to structure your analysis and how your Big Data Analytics platform should look like…