Major
Data Science
Research Abstract
In this research, we explore the technical and computational merits of a machine learning algorithm on a large data set, employing distributed systems. Using 167 million(10 GB) energy consumption observations collected by smart meters from residential consumers in London, England, we predict future residential energy consumption using a Random Forest machine learning algorithm. Distributed systems such as AWS S3 and EMR, MongoDB and Apache Spark are used. Computational times and predictive accuracy are evaluated. We conclude that there are significant computational advantages to using distributed systems when applying machine learning algorithms on large-scale data. We also observe that distributed systems can be computationally burdensome when the amount of data being processed is below a threshold at which it can leverage the computational efficiencies provided by distributed systems.
Faculty Mentor/Advisor
Diane Woodbridge
Included in
Forecasting Smart Meter Energy Usage Using Distributed Systems and Machine Learning
In this research, we explore the technical and computational merits of a machine learning algorithm on a large data set, employing distributed systems. Using 167 million(10 GB) energy consumption observations collected by smart meters from residential consumers in London, England, we predict future residential energy consumption using a Random Forest machine learning algorithm. Distributed systems such as AWS S3 and EMR, MongoDB and Apache Spark are used. Computational times and predictive accuracy are evaluated. We conclude that there are significant computational advantages to using distributed systems when applying machine learning algorithms on large-scale data. We also observe that distributed systems can be computationally burdensome when the amount of data being processed is below a threshold at which it can leverage the computational efficiencies provided by distributed systems.