In this tutorial, I will show you how to enable LZO compression on Hadoop, Pig and Spark. I suppose that you have set up a basic hadoop installation successfully (if not, please refer to other tutorials for Hadoop installation).
You reach this page possibly because you encounter the same problem as I encountered, usually starting with Java exception:
As the Apache and Cloudera distributions are two of the most popular distributions, configurations for both contexts are shown. Briefly, three main steps would be walked through towards the final point.
The native-lzo library) is required for the installation of
hadoop-lzo. You can install them manually or by facilitating the Package Manager (NOTE: Make sure all nodes in the cluster have
- On Mac OS:
- On RH or CentOS:
- On Debian or ubuntu:
As the LZO is GPL’ed, it not shipped with official Hadoop distribution which takes Apache Software License. I recommend the Twitter version which is a forked version of hadoop-gpl-compression with remarkable improvements. If you are running the official Hadoop, installation is guieded in the documentation.
In Cloudera’s CDH,
hadoop-lzo is shipped to customers as parcels and you can download and distribute it conviniently using the Cloudera Manager. By default, the
hadoop-lzo will be installed in
/opt/cloudera/parcels/HADOOP_LZO. Here we show the configuration on our cluster (Cloudera CDH 5 HADOOP_LZO version 0.4.15).
The basic configuration is for Apache Hadoop, while Pig is piggying upon its functionality. First, set compression codecs libraries in
Then set MapReduce compression configuration in
You can use the Cloudera Manager to enable the same previous settings via GUI interface. For MapReduce, change the corresponding values as above in the configuration tab.
At last, restart dependent services in right order and deploy the configurations among all nodes. That’s it!!. Then you can test the functionality with command and get successful messages similar to below:
This consumes me much time because there are less information in previous posts. But the solution is strightforward with previous experience. No matter the Spark is installed via tar or the Cloudera Manager, you need merely to append two path values to
A comparison of LZO performance is given in another place. A related question is also asked on StackOverflow but there are no solutions about this up to the finish of this tutorial. You maybe also interested in how to use the LZO Parcel from Cloudera.