Big Data Analysis with Apache Hadoop and Spark This article introduces the analysis of large volumes of, Big Data using Apache Hadoop and Apache Spark, two of the most powerful Big Data tools today. Let’s go there at Big Data Analysis with Apache Hadoop and Spark.

Parallel Programming Big Data Analysis with Apache Hadoop and Spark

Because we are in a time of boom in digital transformation, everything is likely to be use as data. In addition, technological systems increasingly have greater capacity.

This means that the amount of data available is massive, which gives rise to Big Data, which the so-called 4 Vs characterize :

  • Volume: an immense amount of data is produce (generate by operational applications, user interactions, external data…).
  • Speed: The rate of data entry is increasing and is continuous. So, every second, the amount of data increases massively.
  • Variety: the data sources are extensive, from operational databases to social networks or HTML pages.
  • Veracity: not all data is verifiable, and only the correct ones have to be taken into account.

Sequential algorithms on the same machine are not the best option to analyze this volume of information because they would be too inefficient and expensive. These algorithms are based on vertical scalability, improving performance by adding more resources to the same node.

The approach to be able to scale in Big Data ecosystems is to use parallel algorithms. So that several machines or nodes can perform more straightforward tasks simultaneously to address the fundamental analysis, these algorithms are based on horizontal scalability, better performance by adding more nodes.

MapReduce Paradigm

The MapReduce paradigm emerge to carry out parallel algorithms in large amounts of data systems. This paradigm is based on two main functions :

  • Map: On each node, get the data from the disk in key-value form, and perform a key-value transformation of it.
  • Reduce: performs the aggregation tasks of the outputs generate by the map.

Application example of the MapReduce paradigm

Let’s see a simple example of the MapReduce paradigm :

Initial state

From 3 data sources that contain the stock of oils in a supermarket, we want to obtain the supply of each type of oil:

  • Node 1:

Sunflower oil, Brand A, 8.

Sunflower oil, Brand B, 10.

Olive oil, Brand A, 5.

  • Node 2:

Sunflower oil, Brand A, 1.

Sunflower oil, Brand B, 0.

Olive oil, Brand A, 96.

  • Node 3:

Sunflower oil, Brand Z, 100.

Olive oil, Brand A, 0.

Map Operation

In this case, after applying the map operation, the following key-value pairs would be obtain (the key consists of the type of oil and the value in the number of units):

  • [Node 1]: <Sunflower, 8>, <Sunflower, 10>, <Olive, 5>
  • [Node 2]: <Sunflower, 1>, <Sunflower, 0>, <Olive, 96>
  • [Node 3]: <Sunflower, 100>, <Olive, 0>

Operation Reduces

And after applying the reduce operation, the sum by the key of each of the types would be perform:

  • <Sunflower, 119>
  • <Olive, 101>

Apache Hadoop

Apache Hadoop is the open-source implementation of MapReduce.

How does MapReduce apply Apache Hadoop?

Apache Hadoop performs MapReduce as follows:

  • Files are distribute similarly between your storage systems (usually HDFS).
  • All nodes perform the same map task. There is no difference between the nodes; they all know how to achieve the same job on the duplicate files.
  • The results of the map operation for each of the nodes are sorted by key, using a process called shuffling. Key-value pairs with the same key are sent to the same node to apply the reduce operation.
  • The map results are sent constantly, giving rise to partitions process by the reduce nodes.
  • Once the reduce nodes are finishes, the output is stored on a disk.

Disadvantages of Apache Hadoop

Although it was state of the art a few years ago, some drawbacks have been found: the need to adapt all solutions to the division between map and reduce tasks, which is not always easy, and using disk reads and writes it does not be too fast.

Hadoop working example

The flowchart in Apache Hadoop calculates the stock of oils in a supermarket.

Apache-Spark

The apache Spark is a system for processing distribute data horizontally scalable and not requiring as much conceptual complexity as with the MapReduce ecosystem. It was create to solve the problems seen in Apache Hadoop, but it is not always optimal for all use cases.

RDD in Spark

The fundamental piece that Spark relies on to perform the shutdown is called RDD or Resilient Distribute Dataset. It is a collection of immutable objects (that is, read-only once create), distribute, cacheable, and contains a list of references to partitions of the data on the system.

How does Spark Partition?

Spark is in charge of making these partitions, and it does it in the following way:

  • There is a default block size, usually 64 or 128MB.
  • Spark performs the partitions with the predefine block size for larger files, and they are assigner based on criteria such as the traffic between nodes or where the data is physically located.

How can RDDs be Loaded in Apache Spark?

  • Using the parallelize function to create a parallel object from a collection.
  • From external file sources, such as Amazon S3.
  • By transforming an existing RDD.

Operations for RDDs

There are two types of processes that can be applies to RDDs: transformations or actions.

  • However the transformations are use to obtain an RDD from the original to carry out the necessary actions. You can use Map, Flat Map, Map Partition, Filter. Sample or Union. All transformations are lazy; they are only perform when they need to be use.
  • On the other hand, the actions are applies once the transformations have been done to obtain the desire results. Among others: reduce, collect, count. First, for each or fellow citizens.

The operations can be of two types, depending on whether they need to be in the same partition or not:

  • Narrow – Only data from the same partition should be use. For example, filter, sample or flat Map.
  • Broad: data from more partitions need to be use. Therefore, all the data on the cells must be join before operating—for example, group By Key and reduce By Key.

How Does Spark Perform Operations?

Through a DAG graph (Direct Acyclic Graph or direct acyclic graph), the order of the operations that have been define is determine. This is because they will not always be execute in the order in which they have been describe but in the optimal order for Spark.

However the narrow operations will be perform first (those that can be perform on the same partition). Then the comprehensive procedures will be perform, maintaining the order determine by Spark after analyzing the DAG graph.

Conclusion

This article has approach the analysis of massive data through Apache Hadoop and Apache Spark, two fundamental technologies in Big Data today. Specifically, we have reviewee the concepts of parallel programming and the MapReduce paradigm. We have dissected the operation of Hadoop and Spark through an example in which we have calculate the stock of oils in a supermarket.