Interview with Paco Nathan, Machine Learning Expert

exc-54ca9b6ae4b011fb2b972c39

Paco Nathan is a “player/coach” who has led innovative Data teams building large-scale apps for several years, with expertise in distributed systems, machine learning, functional programming, cloud computing. Paco is a machine learning expert as well as an O’Reilly author, Apache Spark open source evangelist with Databricks, an advisor for Amplify Partners, and a GalvanizeU Academic Advisor. He received his BS Math Sci and MS Comp Sci degrees from Stanford University, and has 30+ years technology industry experience ranging from Bell Labs to early-stage start-ups. He recently joined the Galvanize team as an academic advisor for GalvanizeU.

What’s your role at Galvanize? What will you be working on?

I’ve just joined the advisory board, specifically as an academic advisor. I look forward to helping refine the curriculum, doing guest lectures, hosting meetups/conferences, and perhaps mentoring some students or companies. Also helping some with business development, introductions, etc. In terms of guest lectures, I’m looking forward to previewing some of the more advanced training material that’s in the works for Apache Spark: use cases and examples for machine learning at scale, natural language processing, graph algorithms, leveraging notebooks, etc.

Why Galvanize? What inspired you to join the team?

What inspired me most was feedback after the launch party last October – especially talking with Mike, Peter, Nir, Charisse, Bruna, Ryan, plus Katie and others coming in through the Zipfian acquisition – I recognized that we had very much in common in terms of industry perspectives, priorities, and values. Their insights and approach to Data Science education and practice fits very closely with mine. I wanted to become part of this.

Galvanize’s future expansion into Seattle is an excellent move: I tend to beat a path between SF, Boulder, and Seattle, and really look forward to visiting those other campuses.

How did you get involved in machine learning and data science?

Machine Learning came first… my grad advisor at Stanford in the early 1980s was Douglas Lenat, a pioneer in machine learning. I’d been a student intern on an AI project at IBM Research along with a friend from Stanford, Janet Feigenbaum – daughter of AI pioneer Ed Feigenbaum. Janet talked my ear off about neural networks, and eventually went into neuroscience. After grad school, I dove into neural networks research then spent 7 years on NN commercialization projects. For one of those, I was CTO of a publicly-traded firm that built video recognition for industrial robots. It’s great to see how much that field has advanced, now with Deep Learning use cases making big impact in industry.

Data Science came later… right after the Dotcom crash, I was consulting for an electronics firm in Austin. One of their clients – a popular chip manufacturer – was competing against Intel, and the latter had arranged for key features to get deprecated by their mutual vendor for Verilog compilers. Our client had less than six months to identify and remove small RC components from their library of 10,000+ standard cell designs – or risk major lawsuit. An internal team of engineers tried, but failed given that the manual tasks would take years. A team at our consultancy tried and gave up. I got handed the contract as a long-shot. Pairing with a friend who was an expert circuit designer, we tried a vastly different approach: data mining on the Verilog source code for the circuits, to automate the problem. It took about 10 lines of Perl code plus an Excel spreadsheet to identify where to make the necessary changes. When I turned in an invoice for 10 hours, the manager said “Go away for a few weeks, but keep billing – just so you, me, and the customer don’t all look foolish.” So I went to visit my cousin in Hawaii. Relaxing on a beach during a recession, I decided that this data thing had legs to it.

After that, I spent 10 years leading data teams on large scale problems, mostly involving ML: network security, banking, NLP, social networks, ad-tech, custom search, ecommerce, anti-fraud, etc.

What do you think are some of the interesting trends and changes happening in education right now?

I taught more than 3000 people in 2014, through guest lectures, professional workshops, conference tutorials, webcasts, etc. Add another 40,000 through videos. Education is near and dear to my heart, and current trends are compelling.

In terms of technology, many people look to MOOCs – and my team runs two MOOCs through edX, University of California, and Databricks. However, MOOCs do not address the full scope of education. I’ve also had this conversation with Taylor Martin, who’s now at Stanford – highly recommended work. Analytics plays a key role. For an example in industry, take a look at what Pearson is doing with Spark, Kafka, Cassandra, GraphX, etc., to surface real-time data insights for their learning platform globally.

More long-term, our notions of “text books” is changing dramatically. Moving beyond the current toolset of books, videos, epubs, etc., there is a project at O’Reilly Media that leverages cloud-based notebooks, along with containers and microservices. I’ve been working with that project, and also at Databricks we have a notebook product that leverages similar elements of advanced cloud architectures.

The instrumentation and insights gained from that data are compelling. However, the challenge of teaching people to “think” in terms of notebooks is even more fascinating – more of a paradigm shift in some ways than the introduction of spreadsheets. Back to Taylor, et al., I’m leveraging an instructional rubric called Computational Thinking. That work began at CMU and was subsequently picked up by Google for its training, and is now gaining traction within O’Reilly. To illustrate where cloud-based notebooks and computational thinking intersect, I like this quote by artist David Hockney: “The way we depict space has a great deal to do with how we behave in it.”  In my guest lectures at Galvanize, I look forward to teaching how to accelerate data science work through notebooks.

Your background is in building scalable machine learning solutions – what are some of the most exciting projects you’ve worked on?

My current role is arguably the most exciting, as Director of Community Evangelism at Databricks, working on Apache Spark. It’s quite a rocket ride, given how much Spark has grown so quickly and how expert my colleagues at Databricks are in distributed systems and machine learning. There’s also the breadth of covering use cases for ETL, streaming, machine learning, and graph algorithms through Scala, Python, SQL, etc., plus the challenge of teaching people to think in terms of functional programming and cloud-based notebooks, then seeing the results deployed throughout industry.

Previously, I’d say one of the most exciting projects was Symbiot. As VP of R&D, I wrote the original SIMS prototype then led an engineering team to build out the product. We were runner-up for an Apple Design Award in 2004, and got deployed for real-time network security at US House of Representatives, Wildwell, etc., all of which were seriously under attack 24/7. Our sales rep had to leave one time during a visit at Wildwell, because the company had an emergency deployment with US Marines to the Persian Gulf. Running our automated product under those kinds of conditions was a big tech challenge. Large amounts of data, for the time, too.

Your recent book, “Just Enough Math” covers math for businesses that want to solve problems with data. What skills are essential for data scientists to learn in order to work effectively with business people?

Let me approach that from a contrapositive position: the worst possible is to buy the line that “data scientists are better at programming than statisticians and better at statistics than programmers.” Those are necessary, but not sufficient in data science. We live in a large, rich, complex world, and the problems ahead require substantial understanding of how to model the physical world – exemplified by the rise of IoT. Meanwhile, the decisions get made by people who think in terms of business and governance.

My advice is to become fluent in those respective areas. Get some physics under your belt to learn the modeling. Also become adept at how to articulate and persuade based on data and quantitative analysis and repeatable results. That implies learning how to visualize data expertly, how to write clearly, how to speak convincingly, and also having enough business experience and leadership skills to make your insights and guidance matter. It won’t all happen at once, it will take years – with lots of rewarding work throughout. Galvanize is an excellent place to start and gain the right foundations, plus real industry experience and connections. The curriculum is balanced, has excellent breadth, and provides a wealth of hands-on practice.

Data science and machine learning are often seen exclusively through the lens of Silicon Valley. What are some of the interesting applications for data science outside of the tech world?

Definitely, I’d rather not simply look through the lens of Silicon Valley. There are immense data problems to be resolved in industry: manufacturing, healthcare, education, agriculture. Near term, I have a hunch that manufacturing is well-positioned to benefit from the IoT advances, in terms of instrumenting factories and supply chains, then leveraging that data with machine learning at scale. Commodity clusters and some of the more advanced approaches available now as open source (e.g., streaming analytics, approximation algorithms, machine learning at scale) represent game changers for an industry where many big players are still using FORTRAN and mainframes.

Agriculture – and along with that a range of environmental and social action globally – is to me one of the most interesting long-term areas for data science applications. Considering there are over a half billion small farms worldwide, most are family-run farms that rely on rain-fed agriculture. The “Green Revolution” of the 1960s gave rise to Monsanto-esque approaches and our current dependencies: based on ill-conceived genomics hacks, dwindling sources for nitrogen (petrochemical or wild catch seafood) or phosphorus (limited mining), let alone available fresh water and topsoil (2-3% loss annually) – none of which will sustain for more than a few decades.

Keep in mind that this sector is responsible for more than $15T in annual GDP globally. We have technologies within reach for a very different kind of revolution in agriculture, this time by actual Greens. We will need that.

What are some of the best resources and tools for prospective data scientists?

Dive deeply into the math: especially the linear algebra, graph theory, optimization theory, and meanwhile get a good footing in probability theory, bayesian statistics, probabilistic data structures and approximation algorithms, etc.

Learn to visualize insights in the data. Learn the elements of effective design. Learn to write well. Learn to teach and mentor, those are essential for data science because you will be mentoring executives. Lightning talks are great practice; use lots of illustrations and take risks, make mistakes. Write examples for blog posts. Present at meetups.

Meanwhile, in terms of tools, we’re a generation beyond Hadoop and MapReduce now – given that work was based on hardware from a decade ago. There’s so much available in open source projects based on Python, Scala, Clojure, Go, etc. Of course I’m a big fan of Spark, but also the other parts of the puzzle such as Cassandra, Kafka, Tachyon, Mesos, Weave, Elasticsearch. We could go on about containers and microservices.

I would add this caution: don’t get caught into thinking like an application developer. When all you have is a hammer, too many things begin to look like nails. The reality is that APIs and IDEs don’t solve hard problems in data science. They help you build scalable pipelines – delivering better quality data is crucial, and perhaps Google makes the point that having more data is core to their strategy. However, creating good features out of that data is much more important than its plumbing or scale or the machine learning algorithms that consume featurized data. Feature engineering is hard. Perhaps we’ll get some automation to help, in terms of deep learning, symbolic regression, etc. Evaluating the results of a machine learning pipeline is even more important, and even harder to do well. Meanwhile, domain knowledge trumps all – which is to say that real value and expertise is closest to the use cases, not the APIs.

What’s the most exciting problem right now in data science?

Looking at firms like Planet Labs – as well as related start-ups acquired by Google, Facebook, Alibaba, etc. – high-resolution satellite data will become available soon, as commodity data. A big challenge is to be able to integrate that with sensors closer the ground – IoT sensors in most vehicles, sensor arrays on major infrastructure such as roads, pipelines, etc. Companies such as Keen.io and Virdata come to mind… For example, this work can address difficult problems in energy, transportation, environment, etc.; however, the data science requirements are daunting. Much higher data rates, along with “needle in a haystack” problems that are amplified way beyond what we’ve seen with social networks and ecommerce.

I also look toward very interesting advances in health care, with companies such as Enlitic, Lumiata, Ayasdi, and IBM Watson. Deep learning and cognitive computing get applied to tackle problems in cancer detection, more targeted diabetes diagnosis, improved clinical analysis, etc. Lighting-fast personal genomics will likely play a big role there. Advanced wearables such as Spire, Misfit, Jawbone… will also likely begin to play an important role too. Silicon Valley tried to go after medical “killer apps” a few years ago – rather unsuccessfully – by attempting to re-engineer medicine based on techniques from social networks, mobile apps, etc. This time, actual domain experts are in the game, leveraging advanced work in machine learning with excellent results. Those are huge problem areas to tackle, but quite exciting to see the work that’s in progress.

I think the lesson is that domain expertise trumps all, but with a firm foundation in data science one can branch out into complex areas such as planetary science or medicine and work alongside domain experts to begin to resolve really hard problems.

Want more data science? Learn about the GalvanizeU curriculum and sign up for the Gradient, our weekly data science newsletter.