Apache Spark Blueprint for self-managed CentOS Cloud Servers

A fast and general engine for large-scale data processing.


CenturyLink has integrated Spark into a Blueprint that will install and configure Apache Spark on an existing unmanaged CentOS Cloud Server.

Features

Apache Spark is an open source, parallel data processing framework that complements Apache Hadoop or can be run as a standalone solution. Our Blueprint installs a standalone unmanaged instance of Spark on a CentOS VM. It’s designed to facilitate the development of fast, unified Big Data applications combining batch, streaming, and interactive analytics on all your data. Spark is an ideal tool for analysts and data scientists who rely on iterative algorithms (such as clustering and classification.) With processing speeds of 10-100x faster than MapReduce, Spark expedites time to insight on more data, resulting in better business decisions.

Ease of Use

Spark provides over 80 high-level operators for building parallel applications. Developers can quickly build and execute apps in familiar programming languages such as Java, even using it interactively in Scala, Python and R shells.

Sophisticated Analytics

Spark offers out-of-the-box support for SQL queries, streaming data and complex analytics. It powers a stack of libraries, including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. And with Spark you can combine these libraries seamlessly in the same application.

Fast Real Time Processing

Process and manipulate data in real time with Spark streaming at 100X faster in memory than MapReduce and 10X faster than when processing data on disk.

Contributors

Since 2009, more than 800 developers from over 200 companies have contributed to making Apache Spark what it is today. If you’d like to participate in Spark, or contribute to the libraries on top of it, learn how to contribute.

Community

Spark’s highly-engaged community is distributed across a wide range of organizations. As of early 2015, over 500 companies are using Spark in production where large datasets need to be processed and analyzed. Check out some real-life examples on Spark’s Powered By page. There are many ways to reach the community:

  • Use the mailing lists to pose questions to the community at large
  • Attend live events such as the annual Spark Summit and the Bay Area Spark Meetup
Resource
Self-Managed Apache Spark

free add-on for existing VMs

Included

A Managed Apache Spark solution is available on CenturyLink Cloud with Cloudera Enterprise Data Hub.

Get your free trial

Contact us

Sign up