In this analysis, Google Colab notebooks were used to perform the ETL process completely in the cloud. This way I was able to run PySpark ETL commands and load two chosen large datasets of Amazon reviews (“Kitchen purchase reviews” and “Home purchase reviews”) into an AWS RDS PostgreSQL instance. SQL was then used to perform a statistical analysis of selected data.
Many of Amazon’s shoppers depend on product reviews to make a purchase. Amazon Vine program is an invitation-only club for a small percentage of elite most trusted reviewers, selected by Amazon. The program aims to provide customers with more information including honest and unbiased reviews. Our task was to investigate whether Vine reviews are free of bias and if they are truly trustworthy. Amazon makes the review datasets publicly available. However, they are quite large and can exceed the capacity of local machines to handle. One dataset alone contains several million rows, and this can be quite challenging on the average local computer.
Copies of Colab notebooks and SQL query files contain detailed coding and additional descriptions.
- Tools/techniques used: Apache Spark, PostgreSQL, Google Colab, AWS RDS