![]() ![]() Even this blog by IBM states that Snappy is splittable (link below).īut thankfully the new documentation by CDH (link below) clarifies the statements and gives a precise answer to our question "For MapReduce, if you need your compressed data to be splittable, BZip2 and LZO formats can be split. While Hadoop's book 'Hadoop the Definitive Guide' states that it isn't Splittable. Splittability is not relevant to HBase data." (Link to the documentation). This blog by Cloudera states that " For MapReduce, if you need your compressed data to be splittable, BZip2, LZO, and Snappy formats are splittable, but GZip is not. There are common misconceptions about it being Splittable or not. And when talking about compression codecs we talk about qualities like speed, compression ratio, splittable or not (especially in case of MapReduce/Hadoop) which means that after using the codec the file could be split into chunks for different Mappers as this helps in parallel processing and can really affect the throughput. Speaking of its uses, Snappy is used in MapReduce along with Big Table, MongoDB etc. This makes it a good choice in case you're not quite experienced enough to know which codec to use. Snappy is a compression/decompression library developed by Google and unlike other libraries that work at extremes of the tradeoff i.e either focusing on speed of compression or the compression efficiency, Snappy works somewhere in between as it significantly improves the speed of compression while also maintaining a reasonable compression which is crucial for improving the throughput for Hadoop jobs. If you're reading this you probably already know what Snappy is and can skip this next part. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |