Taming big data with ApacheSpark and Python

Cèsar Ferrando Sanchis

2021-11-29

Enginyer tècnic at Instalaciones Eléctricas Pedro López, S.L.

Smartphones offer us to upload all data to the cloud, and large companies like Google offer us to use their ecosystems. Simply put, we live in the Big Data era. But what does this really mean?

What is Big Data?
What are the characteristics of Big Data?

Taming big data with ApacheSpark and Python - hands on! Callosa Digital — How Big Data works: how is big data collected and stored?

What is Big Data?

Big Data or big data are structured or unstructured large amounts of data. They are processed with special automated tools to be used for statistics, analysis, predictions and decision making.

The term “big data” itself was coined by Nature editor Clifford Lynch in a 2008 special issue. He talked about the explosive growth in the volume of information in the world. Lynch attributed any array of heterogeneous data more than 150 GB per day to big data, but there is still no single criterion (see details in the Taming big data with ApacheSpark and Python - hands on! book by Frank Kane).

Until 2011, big data was analyzed only within the framework of scientific and statistical research. But by the beginning of 2012, the volumes of data had grown to a huge scale, and there was a need for their systematization and practical application.

Since 2014, the world's leading universities have paid attention to Big Data, where they teach applied engineering and IT specialties. Then IT corporations joined in the collection and analysis - such as Microsoft, IBM, Oracle, EMC, and then Google, Apple, Facebook and Amazon. Today, big data is used by large companies in all industries, as well as by government agencies.

What are the characteristics of Big Data?

Meta Group offered the main characteristics of big data:

Volume - data volume: from 150 GB per day;
Velocity - the speed of accumulation and processing of data arrays. Big data is updated regularly, so intelligent technology is needed to process it online;
Variety is a variety of data types. Data can be structured, unstructured, or partially structured. For example, in social networks, the data flow is not structured: it can be text posts, photos or videos.

Today, three more signs are added to these three:

Veracity - the reliability of both the dataset itself and the results of its analysis;
Variability - variability. Data streams have their peaks and troughs, influenced by seasons or social events. The more unstable and volatile the data stream, the more difficult it is to analyze it;
Value - value or significance. Like any information, big data can be simple or difficult to understand and analyze. An example of simple data is posts on social networks, complex data is banking transactions.

Taming big data with ApacheSpark and Python - hands on!

Cèsar Ferrando Sanchis

Contents

What is Big Data?

What are the characteristics of Big Data?

Apply to become Callosa Digital author: