Big Data analysis with Map-Reduce
Technical tutorial presented at Data Science Conference (DSC 2023) in Belgrade
In Data Science, we often deal with big amounts of data. In those cases, many standard approaches won't work as expected, and to process big data, we need to apply a different technique called MapReduce. The main problem when dealing with big data is that the data size is so large that filesystem access times become a dominant factor in the execution time. Because of that, it is not efficient to process big data on a standard MPI cluster machines. With distributed computing solutions like Hadoop and Spark clusters, which rely on the MapReduce approach, big volumes of data are processed and created by diving work into independent tasks, performing the job in parallel. For the first time, the MapReduce approach was formalized by Google, in the paper MapReduce: Simplified Data Processing on Large Clusters when they encountered a problem in indexing all pages on the WWW.
map() and reduce() in python
Before we further explain the MapReduce approach, we will review two functions from Python, map()
and reduce()
, introduced in functional programming. In imperative programming, computation is carried through statements, where the execution of code changes the state of variables. Unlike in functional programming, the state is no longer necessary because functional programming works on immutable data types; functions create and work on new data sets while the original dataset is intact.
- map() applies the function to each element in the sequence and returns the resulting sequence. map() also returns the memory address of the returned map generator object and has to be called with a list() or through the loop.
- reduce() function returns a single value. It applies the function to the sequence elements from left to right.
Word Count in python
Counting the number of words in the document is the simplest example when learning the MapReduce technique. In this tutorial, we will work on the Moby Dick book. In python, we can use the following code:
import re
WORD_REGEX = re.compile(r"[\w]+")
# remove any non-words and split lines into separate words
# finally, convert all words to lowercase
def splitter(line):
line = WORD_REGEX.findall(line)
return map(str.lower, line)
sums = {}
try:
in_file = open('pg2701.txt', 'r')
for line in in_file:
for word in splitter(line):
sums[word] = sums.get(word, 0) + 1
in_file.close()
except IOError:
print("error performing file operation")
else:
M = max([x for x in sums], key=lambda k: sums[k])
print("max: %s = %d" % (M, sums[M]))
max: the = 14620
A program written like this runs only on one processor, and we expect that the time necessary to process the whole text is proportional to the size of the text. Also, as the size of the dictionary grows, the performance degrades. When the size of the dictionary reaches the size of RAM or even swap space, the program will be stopped.
In the map-reduce approach, we can avoid memory issues we may encounter. Those aproaches are scalable, and can be tested on the smaller datasets on local machine. When data becomes big, we'll need to process and store data in distributed frameworks like Hadoop or Spark or online computing services like AWS or Azure, where we can use the same principles and patterns we will learn in this tutorial.
Map-Reduce
MapReduce consists of 3 steps:
- Map step which produces the intermediate results
- Shuffle step, which groups intermediate results with the same output key
- Reducing step that processes groups of intermediate results with the same key
This approach works on data sets that consist of data records. The idea is to define the calculation in terms of two steps - a map step and a reduce step. The map operation is applied to every data record in the set and returns a list of key-value pairs. Those pairs are then collected, sorted by key, and passed into the reduce operation, which takes a key and a list of values associated with that key and then computes the final result for that key, returning it as a key-value pair.