By James Kobielus, IBM Data Science Evangelist
Open-source approaches continue to disrupt established markets and foster amazing innovations. In the IT world, open source software is penetrating every pore of established vendor ecosystems and recrystallizing all platforms, tools, and applications into newer, more agile configurations.
Open-source software platforms are successful not because they’re perfect. They’re successful because, as I stated in here, they boost productivity throughout the economy by accelerating reuse, sharing, collaboration and innovation within entire industry ecosystems.
The centerpiece of open data science is Spark, which has matured rapidly into an open analytics platform for robust in-memory machine learning, graph, and streaming analytics. Last year, I blogged about Spark in the larger context of open-source initiatives, including platforms, ecosystems, languages, tools, APIs, expertise, and data. Last year, I blogged on it in the context of Spark and the Hadoop Open Data Platform.
In 2016, we have seen Spark continue its trajectory toward mainstream adoption in diverse big data, advanced analytics, data science, Internet of Things, and other application domains. In addition to Spark, the core components of the deepening open analytics stack include Hadoop and the R programming language. Taken together, these open-source tools constitute the core workbenches being used by data scientists to craft innovative applications in every sector of the economy.
Data scientists are the core developers in this new era, and they have strong feelings about the open analytics stacks at their disposal. Their productivity depends directly on the ease of use, performance, and integration of their core open analytics development platforms, tools, and libraries, including Spark, R, and Hadoop.
For working data scientists, a key innovation came to market last summer with the release of Apache Spark 1.4. Here’s the overview of that announcement in Computerworld, noting high up that its chief new feature is SparkR, a language binding for R programming in Spark projects. And here’s a related blog, written by yours truly at Spark Summit, summarizing the Apache Spark community’s plans to roll out additional language bindings beyond R, Python, Java, and Scala. The SparkR binding, based on the DataFrame API, is a very significant addition to the core Spark codebase. It lets R developers access the environment’s scale-out parallel runtime, leverage Spark’s input and output formats, and call directly into Spark SQL.
In this way, R, which was designed to work only on a single computer, can now run large jobs across multiple cores in single machines and/or across massively parallel server clusters. As a result, R has become a full-blown big-data analytics development tool for the era of Spark-accelerated machine learning, in-memory, streaming, and graph analytics.
On June 6 at Galvanize in San Francisco, IBM will be making important announcements for making R, Spark, and open data science a business reality. At the Apache Spark Maker Community Event, IBM will host a stimulating evening featuring of keen interest to data scientists, data application developers, and data engineers. The event will feature special announcements, a keynote, a panel, and a hall of innovation. Leading industry figures who have already committed to participate include John Akred, CTO Silicon Valley Data Science; Ritika Gunnar, Vice President of Offering Management, IBM Analytics; Todd Holloway, Director of Content Science and Algorithms, Netflix; Matthew Conley, Data Scientist, Tesla Motors, and Siddha Ganju, Computational Data Scientist, Carnegie Mellon University.