Performance Evaluation of a Big Data Application on Apache Spark

Alcantara, Jeanne

doi:10.32920/ryerson.14651544.v1

Alcantara_Jeanne.pdf (2.56 MB)

Performance Evaluation of a Big Data Application on Apache Spark

thesis

posted on 2021-05-23, 09:18 authored by Jeanne Alcantara

Apache Spark enables a big data application—one that takes massive data as input and may produce massive data along its execution—to run in parallel on multiple nodes. Hence, for a big data application, performance is a vital issue. This project analyzes a WordCount application using Apache Spark, where the impact on the execution time and average utilization is assessed. To facilitate this assessment, the number of executor cores and the size of executor memory are varied across different sizes of data that the application has to process, and the different number of nodes in the cluster that the application runs on. It is concluded that different pairs (data size, number of nodes in the cluster) require different number of executor cores and different size of executor memory to obtain optimum results for execution time and average node utilization.

History

Language

eng

Degree

Master of Engineering

Program

Electrical and Computer Engineering

Granting Institution

Ryerson University

LAC Thesis Type

Thesis

Year

2019

Usage metrics

Keywords

SPARK (Electronic resource)Data mining -- Computer programs Big data Database Management

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Performance Evaluation of a Big Data Application on Apache Spark

History

Language

Degree

Program

Granting Institution

LAC Thesis Type

Year

Usage metrics

Categories

Keywords

Licence

Exports