Used Datasets and packages
Datasets
- Moby Dick book, we will get file
pg2701.txt
- Tennis WTA matches can be downloaded from the github repository
.csv files
with WTA matches from 1968 until 2023.
- Iris Flowers Dataset can be downloaded from many sources, in this tutorial I used one from Kaggle
Requirements
- python3
- mrjob
pip install mrjob
- pyspark
pip install pyspark
To use pyspark you need previosly installed java. - networkx
pip install networkx
. - matplotlib
pip install matplotlib
- pandas
pip install pandas
Literature
- MapReduce: Simplified Data Processing on Large Clusters
-
Page Rank paper, The PageRank Citation Ranking: Bringig Order to the Web
-
Wolohan, J. (2020). Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code. United States: Manning.
- Radtka, Z., & Miner, D. (2015). Hadoop with Python. O'Reilly Media.
- Tutorial BigData with PySpark by New York University
-
CornellEdu BigData Technologies course