GitHub - peppelan/spark-course: Playground for Udemy course "Taming Big Data with Apache Spark and Python

To prepare the dev environment (Docker-based Spark installation), after installing Docker, run, from the top folder:

docker run --rm -it --name sparkdev -v `pwd`:/SparkCourse -p 4040:4040 gettyimages/spark:2.1.1-hadoop-2.7

For checking the correct installation (as suggested in the course), you can open a PySpark shell by running: docker exec -it sparkdev /usr/spark-2.1.1/bin/pyspark, where the following lines:

dd = sc.textFile("README.md")
rdd.count()

will print the line count of that file (104 in this case)

Datasets:

ml-100k from grouplens.org

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.gitignore		.gitignore
1800.csv		1800.csv
Assignment1.py		Assignment1.py
Assignment2.py		Assignment2.py
Book.txt		Book.txt
README.md		README.md
customer-orders.csv		customer-orders.csv
fakefriends.csv		fakefriends.csv
friends-by-age.py		friends-by-age.py
max-temperatures.py		max-temperatures.py
min-temperatures.py		min-temperatures.py
popular-movies-nicer.py		popular-movies-nicer.py
popular-movies.py		popular-movies.py
ratings-counter.py		ratings-counter.py
total-spent-by-customer-sorted.py		total-spent-by-customer-sorted.py
total-spent-by-customer.py		total-spent-by-customer.py
word-count-better-sorted.py		word-count-better-sorted.py
word-count-better.py		word-count-better.py
word-count.py		word-count.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

peppelan/spark-course

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages