Big Data Analytics Using Spark

This course is part of the Data Science MicroMasters program provided by University of California San Diego. To earn the course certificate, I had to successfully complete eight assignments and pass the proctored exam.

The course taught me the following:

how to perform statistical analysis of very large datasets that do not fit on a single computer
some of the most popular tools for performing this type of analysis:
- Apache Spark using Pyspark
- XGBoost
- TensorFlow.
how to use Spark to minimize bottlenecks in massive parallel computation and to understand underlying computer architecture and the programming abstractions
how to perform supervised an unsupervised machine learning on massive datasets using the Machine Learning Library (MLlib)
how to perform data loading and cleaning using Spark and Parquet
how to model data through statistical and machine learning methods to:
- perform large scale analysis
- identify statistically significant patterns
- visualize statistical summaries.
how to use these tools through Jupyter Notebooks combining narrative, code and graphics to create convincing analytical documents

Credentials:

Certificate