Abstract
One of the challenges our society faces is the ever increasing amount of data, requiring systems to analyze large data sets without compromising their performances and humans to navigate through a deluge of irrelevant material. Among existing platforms that address the system requirements, Hadoop is a framework widely used to store and analyze Big Data. On the human side, one of the aids to finding the things people really want is recommendation systems. This thesis evaluates approaches to highly scalable parallel algorithms for recommendation systems with application to very large datasets. A particular goal is to evaluate an open source Java message passing library for parallel computing called MPJ Express, which has been integrated with Hadoop. We also use MapReduce a core component of Hadoop to partition our data and implement a parallel distribution of our datasets. These last have been acquired from well-known recommender systems such as MovieLens and Yahoo Music. Based on these datasets we generate our own synthetic dataset aiming for a larger size in order to test the scalability of the model. We name that dataset \SyntheD".As a demonstration we use MPJ Express to implement collaborative filtering on various datasets using the algorithm ALSWR (Alternating-Least-Squares with Weighted-λ-Regularization). We benchmark the performance and demonstrate parallel speedup on Movielens, Yahoo Music and SyntheD datasets. We then compare our results with other frameworks such as: Mahout, Spark and Giraph then measure the accuracy of the program with a suitable error metric. Our results indicate that MPJ Express implementation of ALSWR has a very competitive performance and scalability in comparison with the other frameworks we evaluated.
Date of Award | Sept 2019 |
---|---|
Original language | English |
Awarding Institution |
|
Supervisor | Bryan Carpenter (Supervisor), Mohamed Bader-El-Den (Supervisor) & Mo Adda (Supervisor) |