This course is part of the Data Science MicroMasters program provided by University of California San Diego. To earn the course certificate, I had to successfully complete eight assignments and pass the proctored exam.
The course taught me the following:
- how to perform statistical analysis of very large datasets that do not fit on a single computer
- some of the most popular tools for performing this type of analysis:
- Apache Spark using Pyspark
- XGBoost
- TensorFlow.
- how to use Spark to minimize bottlenecks in massive parallel computation and to understand underlying computer architecture and the programming abstractions
- how to perform supervised an unsupervised machine learning on massive datasets using the Machine Learning Library (MLlib)
- how to perform data loading and cleaning using Spark and Parquet
- how to model data through statistical and machine learning methods to:
- perform large scale analysis
- identify statistically significant patterns
- visualize statistical summaries.
- how to use these tools through Jupyter Notebooks combining narrative, code and graphics to create convincing analytical documents
Credentials: