Data science is the hot topic of the moment, with its predictive modeling, machine learning, and data mining. But much of what data scientists do would not be possible, especially on a large scale, without data engineering. You could say that if data scientists are astronauts, data engineers built the rocket.
Data engineers design and build software to pull, clean, and normalize data, clearing the path for data scientists to explore that data and build models. In turn, data engineers deploy these models into production and apply them to live data.
Here are seven of the most important Big Data technologies that every data engineer should know:
The ideas behind Hadoop were first invented at Google, when the company published a series of papers in 2003 and 2004 describing how it stores large amounts of data and processes. In 2006, Doug Cutting and Mike Cafarella reverse-engineered Hadoop based on Google’s papers. However, they made it open source. Cutting named the technology after his son’s yellow toy elephant.
Hadoop is used when you have data in the terabyte or petabyte range—too large to fit on a single machine. It’s made up of HDFS, which lets you store data on a cluster of machines, and MapReduce, which lets you process data stored in HDFS. It lets you treat a cluster made up of hundreds or thousands of machines as a single machine. HDFS is the disk drive for this large machine, and MapReduce is the processor.
Hadoop’s use is widespread for processing Big Data, though recently Spark has started replacing MapReduce.
Spark was created by Matei Zaharia at UC Berkeley’s AMPLab in 2009 as a replacement for MapReduce. Like MapReduce, Spark lets you process data distributed across tens or hundreds of machines, but Spark uses more memory in order to produce faster results. Spark also has a simpler and cleaner API.
The only cases where MapReduce is still used are either because someone has an existing application that they don’t want to rewrite, or if Spark is not scaling. This would be because Spark is a newer technology, and it sometimes can fail on extremely large data sets.
Kafka handles the case of real-time data, meaning data that is coming in right now. Most other technologies handle batch scenario, which is when you have data sitting in a cluster. Kafka represents a different way of looking at data. Where as Hadoop and HDFS look at data as something that is stationary and at rest, Kafka looks at data as in motion.
Kafka is like TiVo for real-time data. It can buffer the data when it spikes so that the cluster can process it without becoming overwhelmed. If data is coming in faster than it can be processed, Kafka will store it.
In this way, Kafka is like other queuing systems, such as RabbitMQ and ActiveMQ. But Kafka can store a lot more data (it can store Big Data) because it is distributed across many machines. Kafka is also used for fault-tolerance. It can store data for a week (by default), which means if an application that was processing the data crashes, it can replay the messages from where it last stopped. It can also be used as a multiplexer. When the same data needs to be consumed by different applications in the system, Kafka can take incoming data and send it to all the applications that have subscribed.
Kafka was created by Jay Kreps and his team at LinkedIn, and was open sourced in 2011. The technology is relatively unique—there are other queuing systems, but not any intended for the Big Data case, as they are not able to handle the same volumes of data.
Storm is used for real-time processing. While Kafka stores real-time data and passes it onto systems that want to process it, Storm defines the logic to process events.
Storm processes records (called events in Storm) as they arrive into the system. For example, every time a credit card transaction is sent into a bank, a Storm application can analyze it and then decide whether to approve it or deny it.
Storm was the first system for real-time processing on Hadoop, but it has recently seen several other open-source competitors arise. Spark Streaming is the primary competitor, which offers exactly-once semantics—meaning each message is processed exactly one time. Storm only offers at-least-once semantics, meaning a message may be processed more than once if a machine fails.
Storm is used instead of Spark Streaming if you want to have the event processed as soon as it comes in. Spark Streaming processes incoming events in batches, so it can take a few seconds before it processes an event. When immediate processing is essential, Storm is superior to Spark Streaming. Other new systems that provide real-time processing are Flink and Apex.
HBase is a NoSQL database that lets you store terabytes and petabytes of data. Like HDFS, HBase is intended for Big Data storage, but unlike HDFS, HBase lets you modify records after they are written.
This means HBase is used to store data that is changing, such as a store’s current inventory. It is not used for static data, such as every transaction that occurred in the past—that sort of data is more likely to be stored in HDFS. HBase has very fast read and write times, as compared to HDFS.
HBase is based on the Bigtable architecture which was published by Google in its papers. Cassandra is another technology based on BigTable, and frequently these two technologies compete with each other. HBase can scan faster than Cassandra, because it keeps data sorted, while Cassandra can write faster because of this. Cassandra is also a standalone technology, and does not require Hadoop. Because of this, HBase is often chosen when a company is already using Hadoop, whereas Cassandra is often preferred when a company needs a datastore that is easy to deploy without having to use Hadoop.
Hive is used for processing data stored in HDFS. It translates SQL to MapReduce, which makes it easier to query data. Instead of waiting for Java programmers to write MapReduce equations, data scientists can use Hive to run SQL directly on their Big Data.
Hive is now the primary way to query data and convert SQL to MapReduce, but this process is very popular and thus there are many alternatives. MapReduce itself is used when the algorithm is too low-level to be implemented in SQL, while Pig is used when the data is highly unstructured. Spark SQL is another alternative. Instead of running on MapReduce, it runs on Spark, and is therefore faster. However, Hive is more reliable and has a richer SQL, therefore Hive remains popular.
Yet another alternative is Impala, which also lets you query HDFS data using SQL. However, it does not use MapReduce and directly reads the data from HDFS. Impala is also much faster than Hive, however, it is again not as reliable.
Impala and Spark SQL are used for interactively exploring data, whereas Hive is used for batch processing data in nightly batch jobs. It is reliable and fault tolerant and therefore won’t stop if there is a machine crash.
As mentioned above, Pig is similar to Hive because it lets data scientists write queries in a higher-level language instead of Java, enabling these queries to be much more concise. Pig translates a high-level scripting language called Pig Latin into MapReduce jobs.
Hive expects data to have more structure. For example. it expects that all the data in a column will be the same type. Pig, on the other hand, does not require this kind of strictness. Pig’s motto is “Pigs eat everything.”
Pig is used when the data is unstructured and the records have different types. It’s also popular with people who don’t know SQL, such as developers, data engineers, and data administrators. Pig Latin is relatively similar to Perl or Bash, which are languages they are likely more comfortable in.
As an added bonus, the Pig community has a great sense of humor, as seen in the terrifically bad puns used to name most Pig projects. The Pig shell is called Grunt, for example, and the Pig library website is called PiggyBank. Netflix also released a web UI for Pig called Lipstick.