NIPS 2011 Big Learning - Algorithms, Systems, & Tools Workshop: Spark: In-Memory Cluster... February 14, 2012

Favorite:

Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale at NIPS 2011
Invited Talk: Spark: In-Memory Cluster Computing for Iterative and Interactive Applications by Matei Zaharia

Matei Zaharia is a fifth year graduate student at UC Berkeley, working with Scott Shenker and Ion Stoica on topics in cloud computing, operating systems and networking. He is also a committer on Apache Hadoop. He is funded by a Google PhD fellowship. Before joining Berkeley, Matei got his undergraduate degree at the University of Waterloo in Canada.

Abstract: MapReduce and its variants have been highly successful in supporting large-scale data-intensive cluster applications. However, these systems are inefficient for applications that share data among multiple computation stages, including many machine learning algorithms, because they are based on an acyclic data flow model. We present Spark, a new cluster computing framework that extends the data flow model with a set of in-memory storage abstractions to efficiently support these applications. Spark outperforms Hadoop by up to 30x in iterative machine learning algorithms while retaining MapReduce's scalability and fault tolerance. In addition, Spark makes programming jobs easy by integrating into the Scala programming language. Finally, Spark's ability to load a dataset into memory and query it repeatedly makes it especially suitable for interactive analysis of big data. We have modified the Scala interpreter to make it possible to use Spark interactively as a highly responsive data analytics tool.

At Berkeley, we have used Spark to implement several large-scale machine learning applications, including a Twitter spam classifier and a real-time automobile traffic estimation system based on expectation maximization. We will present lessons learned from these applications and optimizations we added to Spark as a result.