Galvanize recently attended the Dato Data Science Summit in San Francisco, a gathering of more than 1,000 data scientists and researchers from industry and academia to discuss and learn about the most recent advances in data science, applied machine learning, and predictive applications.
Here are eight Python tools that our Data Science Immersive instructors think data scientists will be using in the coming months and years:
One of the biggest announcements out of the Dato Data Science Summit was that SFrame and SGraph will be going open source, available for anyone with a BSD license. SFrame (short for Scaleable Data Frame) is a disk-backed columnar data structure optimized for memory efficiency and performance with a DataFrame like interface. SGraph has a similar ethos but for representing Graphs efficiently. One of the biggest advantage of these two data structures is that they enable a data scientist to do “out of core” analytics with data on datasets that do not fit in memory.
This is a watershed moment for Dato and the Python data community, as the open sourcing of these two libraries signals Dato’s commitment to supporting an open source Python ecosystem around data. There has been a common misconception from the community, since Dato has an enterprise version, that by using the free version they’ll get tied in and end up having to pay. By moving to open source, it’s clear that this sort of bait-and-switch is definitely not Dato’s goal, and now that these two libraries have moved to open source, we’ll hopefully see other developers start adopting their use in their own libraries (I’m looking at you Pandas) to break away from memory limitations.
Bokeh is a Python interactive visualization library that lets you display elaborate, interactive graphics in your web browser, with or without a server. It’s capable of handling very large or even streaming datasets (such as a live spectrogram feed), and is fast, embeddable, and can display novel visualizations such as hover callbacks. It’s useful for anyone who wants to quickly and easily create interactive plots, dashboards, and data applications.
Dask is an out-of-core scheduler for Python. It helps you do block-based parallelism on large computations by dividing your data up into chunks and scheduling the computation over however many cores you have. Dask is written in pure Python and leverages the Python ecosystem, primarily targeting parallel computations that run on a single machine.
There are two main ways to interact with dask. Dask users will primarily use dask collections, which are similar to popular libraries such as NumPy and Pandas, but generate graphs internally. Dask developers, on the other hand, will primarily be making graphs directly. Dask graphs encode algorithms using Python dicts, tuples, and functions, and can be used in isolation from the Dask collections.
There are currently a lot of libraries in the Python ecosystem—many of which are coming out of Continuum—that may seem to do the same thing. But these libraries—Blaze, Dask, and Numba—rather than being conflicting libraries, they’re meant to work together at different levels of data processing. By analogy, you can think of Blaze as being similar to a query optimizer in a relational database management system (RDBMS), whereas Dask can be thought of as the execution engine. In this context, Blaze optimizes the symbolic expressions of a query or command, whereas Dask can be used to optimize the execution of it on your hardware.
If you’re a data scientist, chances are you use Python on a daily basis. But for everything it’s great at, Python does have its limitations. One of its biggest problems is that Python doesn’t scale very well. It’s great for small data sets, but requires sampling or aggregations for larger data, and using distributed tools can compromise your outcome in various ways.
A new project from Cloudera Labs, Ibis is a data analysis framework that aims to provide the same Python experience data scientists and engineers are used to on any node and data size. It mirrors the single-node Python experience without a compromise in functionality or usability, delivering the same interactive experience and full-fidelity analysis while dealing at the big data scale.
Ibis allows for a 100% Python end-to-end user workflow, allowing for integration with the existing Python data ecosystem (Pandas, Scikit-learn, NumPy, etc). A preview of Ibis is available for installation now, and will be expanding to include more features—such as integration with advanced analytics, machine learning, and other performance computing tools—in the future.
Petuum is a distributed machine learning framework that aims to provide a generic algorithmic and systems interface to large-scale machine learning. It provides distributed programming tools that can assist with the challenges of running machine learning at scale. Petuum is designed specifically for machine learning, which means that it takes advantage of data correlation, staleness, and other statistical properties to maximize performance.
Petuum has a number of core features: Bösen is a bounded-asynchronous distributed key-value store for data-parallel machine learning programming. It uses the Stale Synchronous Parallel consistency model, which allows asynchronous-like performance without sacrificing algorithm correctness. Another feature is Strads, a dynamic scheduler for model-parallel machine learning programming. It performs fine-grained scheduling of machine learning update operations, prioritizing computation on parts of the program that need it most while avoiding unsafe parallel operations that could hurt performance.
Apache Flink is an open source platform for scalable batch and stream data processing. The core of Flink is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. It’s very similar to Apache Spark, given that one of its primary goals is to serve as a replacement for MapReduce, the aging heart of Hadoop.
The APIs of Spark and Flink are rather similar, but they have a few major differences in how they process data. When Spark processes a stream, it actually uses micro-batching, a fast-batch operation that works on a small part of incoming data during a unit of time. This is an approximation of stream-processing, and normally it’s fine, but it can cause problems and slowdowns in low-latency situations. Flink, on the other hand, is primarily a stream processing framework that can also do batch processing. In other words, instead of being able to do the easy job (batch processing) and an approximation of the hard one (stream processing), Flink was made to do the more difficult job, and can also handle the easier task.