Hadoop: The Definitive Guide

Tom White

出版社

O'Reilly Media

出版时间

2015-04-11

ISBN

9781491901632

评分

★★★★★

书籍介绍

Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. You’ll learn about recent changes to Hadoop, and explore new case studies on Hadoop’s role in healthcare systems and genomics data processing.

Learn fundamental components such as MapReduce, HDFS, and YARN

Explore MapReduce in depth, including steps for developing applications with it

Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN

Learn two data formats: Avro for data serialization and Parquet for nested data

Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer)

Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop

Learn the HBase distributed database and the ZooKeeper distributed configuration service

Hadoop Fundamentals

Chapter 1Meet Hadoop

Data!

Data Storage and Analysis

Querying All Your Data

显示全部

用户评论

看前两部分就行，相关的pig hive spark如果不实践也不需要深入。本科上课读过那google三篇论文，扫这本书还是很快的。

读完了，第一次接触大数据相关的内容。这本书的内容相当全面，第一部分讲原理，中间详细介绍基于hadoop的project，最后有具体的应用举例。很多地方理解的还不是很透彻，需要进一步的阅读。

2016 NO.4 深入浅出，原理讲的非常透彻。核心是 Hadoop Fundamentals 和 MapReduce 两章，但是后面的 Related Projects 也写的言简意赅，能够突出重点。比如 Flume 这一章会提到一些在 Flume 官网教程上也没提到的要点。

还好我用的时候不需要写 Java（

真尼玛长。介绍了生态圈里的大部分工具，用来总结回顾比较适合，没有实践过的读者看前两部分mr和yarn核心，扫一遍后面所有工具是做什么用的就可以了。

不必读得太详细，Hadoop生态现在很少直接上MapReduce编程了，Hadoop-Spark-Flink。

刚开始看没多少（Part I 一半不到），各种方面写得都相当清楚，不愧是基金会 member 讲自己参与设计的系统……连如何安装和配置 Hadoop cluster 写得都比垃圾官方文档详尽（…）真是高下立判啊 🤣 （Update Dec 5, 2020）跳过了关于 Pig Hive 等等 Apache 生态组件的介绍还有 Case Study。产生了已经完全掌握 Hadoop 了的错觉。不过，学了没人用的东西还真是对不起啊（半恼）

仔细读了 Part I Hadoop Fundamentals，作为新手收获挺大的。跳读了 Part IV Related Projects 大概了解了一下周边。期待理论讲得更深一些，现在真的好喜欢 System Design。

很棒

经典