If you are dealing with data analytics, you have probably heard about or are using Apache Hadoop*. By leveraging the compute power of multiple nodes in a cluster, Hadoop can analyze practically unlimited volumes of data—both structured and unstructured—with lightning-fast speed.
But did you know you can optimize Hadoop to deliver even better performance on Intel architecture? The key is to tune the underlying Java so that it takes advantage of capabilities in Intel hardware. When you do that, you can expect to see up to 70 percent faster performance on Hadoop sort operations. Keep reading to learn how.
Understanding Hadoop’s Java Foundation
A Hadoop cluster comprises a master node and multiple slave nodes, where data is stored and where analytics processing occurs. An incoming analytics request invokes several Java* services that enable efficient replication and large-scale analytics across nodes in a cluster.
Processing begins with two Java services on the master node. These two services then communicate with additional Java services on the slave node where the data are stored—see the image below for a simple view of what’s going on.
When MapReduce* completes all the assigned analytics tasks, it returns the results to the master node, which compiles results from multiple nodes and returns them to the requester.
Important note: Hadoop spawns a new Java Virtual Machine* (JVM) for each MapReduce function on each slave node. This means that a large analytics job can result in the creation of thousands of individual JVMs. Because Hadoop does not share memory resources across nodes, each JVM and Java service must perform optimally. Reduced performance on any single node can hamper data analytics performance across the cluster.
Tune Java to Optimize Hadoop Performance
Given this foundation of Java services, it is easy to see that optimizing Java for Intel® architecture can deliver significant Hadoop performance enhancements. Optimizations are built into Java and Intel architecture, which make these improvements easy to achieve. When Intel releases a new microarchitecture and platform, Intel and Oracle software engineers work together to tune the JVM to take advantage of the new hardware advances. These optimizations can provide faster Hadoop performance on each Intel-based node as well as on the entire cluster.
How much faster? Up to 50 percent faster on a TeraSort* benchmark test and 70 percent faster on the Hadoop Sort benchmark. Read the white paper, Optimizing Java and Apache Hadoop for Intel Architecture, to see the details.
Since 2007, Oracle and Intel have improved Java performance up to 14 times by tying specific Java optimizations to advancements in the underlying hardware. Read the paper to see a complete list of optimizations, but here’s a taste of some that are important to Hadoop performance:
- Fast CRC increases file checksum and compression/decompression checksum, which increases Hadoop network and file system performance.
- Large-page usage increases performance of large analytics jobs.
- Intel® Advanced Vector Extensions improve the performance of MapReduce operations that contain array and string manipulation, such as sub-string or character searches; also improves integer and floating point calculations.
- Intel® Integrated Performance Primitives (Intel® IPP) compression increases compression performance, which reduces network and disk input/output (I/O) across the Hadoop cluster.
If you’re using Hadoop, or analytics on large data sets are part of your job, then check out the white paper for a full description of the Java optimizations that are available to you to enhance Hadoop performance on Intel architecture.
Follow me @TimIntel.
