The National Science Foundation defines Big Data in their paper “Core Techniques and Technologies for Advancing Big Data Science & Engineering” as large, diverse, complex, longitudinal, and/or distributed data sets generated from instruments, sensors, Internet transactions, email, video, click streams, and/or all other digital sources available today and in the future. And just so we all understand the size (according to Intel) of the Big Data phenomenon, from the dawn of time until 2003 mankind generated 5 Exabytes of data, in 2012 it generated 2.7 Zettabytes1, that’s 500 times more data, and by 2015 it is estimated that figure will have grown to 8 Zettabyes, so how does this mass of available information affect the scientific community?
Well firstly you can thank the scientific community for the development of Big Data emanating as it does from Big Science. Scientists made the link between having an enormous amount of data to work with and the mathematically huge probabilities of actually finding anything useful, spawning projects such as astronomy images (planet detection), physics research (supercollider data analytics), medical research (drug interaction), weather prediction and others.
It is also the scientific community, which is at the forefront of the new technologies making Big Data Analytics possible and cost-effective. Major projects are underway to evaluate core technologies and tools that take advantage of collections of large data sets to accelerate progress in science projects, such as:
- Data collection and management (DCM) – including new data storage, I/O systems, and architectures for continuously generated data, as well as shared and widely-distributed static and real-time data
- Data analytics (DA) – with the development of new algorithms, programming languages, data structures, and data prediction tools; new computational models and the underlying mathematical and statistical theory needed to capture important performance characteristics of computing over massive data sets
- E-science collaboration environments (ESCE) – novel collaboration environments for diverse and distant groups of researchers and students to coordinate their work (e.g., through data and model sharing and software reuse, tele-presence capability, crowd-sourcing, social networking capabilities) with greatly enhanced efficiency and effectiveness for scientific collaboration; along with automation of the discovery process (e.g., through machine learning, data mining, and automated inference)
A very relevant example of a technology created to address this emerging market is the Hadoopis framework, from the Apache Software Foundation. Hadoop redefines the way data is managed and analysed by leveraging the power of a distributed grid of computing resources using a simple programming model to enable distributed processing of large data sets on clusters of computers. Its technology stack includes common utilities, a distributed file system, analytics, data storage platforms, an application layer that manages distributed processing, parallel computation, workflow, and configuration management. This approach allows Hadoop to deliver the high availability, massive scalability and response times needed for Big Data Analytics.
It is also the scientific community which may well determine the direction of Big Data in the business world, in a recent briefing about Big Data given to US Congress by Farnam Jahanian, the assistant director for Computer and Information Science and Engineering for the National Science Foundation, he explained the implications of Big Data and how it will influence our thinking as we move forward:
Firstly, insights and more accurate predictions from large and complex collections of data have important implications for the economy, access to this information is transforming traditional business and creating new markets. Big Data is driving the creation of new IT products and services based on business intelligence and data analytics is boosting the productivity of firms that use it to make better decisions and identify new business trends.
Second, advances in Big Data are critical to accelerate the pace of discovery of almost every science and engineering discipline, from new insights about protein structure, biomedical research, clinical decision making, and climate modelling to new ways to mitigate and respond to natural disasters and new strategies for effective learning and education, there are enormous opportunities for data driven discovery.
Finally, Big Data also has the potential to solve some of the world’s most pressing challenges – in science, education, environment and sustainability, medicine, commerce and cyber and national security – with huge social benefits and laying the foundation for a nation’s competitiveness for many decades to come.
Big Data and its associated Analytics offer a wealth of opportunities to the scientific community to advance numerous fields of research and take the lead in one of the most significant IT trends in decades with the potential of a paradigm shift leading to Big Science 2.