Tải bản đầy đủ

1028 hadoop in practice

IN

PRACTICE

Alex Holmes

MANNING
www.it-ebooks.info


Hadoop in Practice

www.it-ebooks.info


www.it-ebooks.info


Hadoop in Practice
ALEX HOLMES


MANNING
SHELTER ISLAND

www.it-ebooks.info


For online information and ordering of this and other Manning books, please visit
www.manning.com. The publisher offers discounts on this book when ordered in quantity.
For more information, please contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road
PO Box 261
Shelter Island, NY 11964
Email: orders@manning.com

©2012 by Manning Publications Co. All rights reserved.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by means electronic, mechanical, photocopying, or otherwise, without prior written
permission of the publisher.

Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in the book, and Manning
Publications was aware of a trademark claim, the designations have been printed in initial caps
or all caps.

Recognizing the importance of preserving what has been written, it is Manning’s policy to have
the books we publish printed on acid-free paper, and we exert our best efforts to that end.
Recognizing also our responsibility to conserve the resources of our planet, Manning books
are printed on paper that is at least 15 percent recycled and processed without the use of
elemental chlorine.

Manning Publications Co.
20 Baldwin Road
PO Box 261
Shelter Island, NY 11964

Development editor:
Copyeditors:


Proofreader:
Typesetter:
Illustrator:
Cover designer:

ISBN 9781617290237
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 – MAL – 17 16 15 14 13 12

www.it-ebooks.info

Cynthia Kane
Bob Herbtsman, Tara Walsh
Katie Tennant
Gordan Salinovic
Martin Murtonen
Marija Tudor


To Michal, Marie, Oliver, Ollie, Mish, and Anch

www.it-ebooks.info


www.it-ebooks.info


brief contents
PART 1

BACKGROUND AND FUNDAMENTALS . .............................1
1

PART 2

PART 3

PART 4



Hadoop in a heartbeat 3

DATA LOGISTICS..........................................................25
2



Moving data in and out of Hadoop 27

3



Data serialization—working with text and beyond

83

BIG DATA PATTERNS ..................................................137
4



Applying MapReduce patterns to big data 139

5



Streamlining HDFS for big data 169

6



Diagnosing and tuning performance problems 194

DATA SCIENCE ...........................................................251
7



Utilizing data structures and algorithms 253

8



Integrating R and Hadoop for statistics and more 285

9



Predictive analytics with Mahout 305

vii

www.it-ebooks.info


viii

PART 5

BRIEF CONTENTS

TAMING THE ELEPHANT .............................................333
10



Hacking with Hive 335

11



Programming pipelines with Pig 359

12



Crunch and other technologies 394

13



Testing and debugging 410

www.it-ebooks.info


contents
preface xv
acknowledgments xvii
about this book xviii

PART 1 BACKGROUND AND FUNDAMENTALS ......................1

1

Hadoop in a heartbeat 3
1.1
1.2
1.3

What is Hadoop? 4
Running Hadoop 14
Chapter summary 23

PART 2 DATA LOGISTICS.................................................25

2

Moving data in and out of Hadoop 27
2.1
2.2

Key elements of ingress and egress
Moving data into Hadoop 30
TECHNIQUE 1
TECHNIQUE 2
TECHNIQUE 3
TECHNIQUE 4
TECHNIQUE 5

29

Pushing system log messages into HDFS with
Flume 33
An automated mechanism to copy files into
HDFS 43
Scheduling regular ingress activities with Oozie
Database ingress with MapReduce 53
Using Sqoop to import data from MySQL 58

ix

www.it-ebooks.info

48


x

CONTENTS

TECHNIQUE 6
TECHNIQUE 7

2.3

Moving data out of Hadoop 73
TECHNIQUE 8
TECHNIQUE 9
TECHNIQUE 10
TECHNIQUE 11

2.4

3

HBase ingress into HDFS 68
MapReduce with HBase as a data source 70
Automated file copying from HDFS 73
Using Sqoop to export data to MySQL 75
HDFS egress to HBase 78
Using HBase as a data sink in MapReduce 79

Chapter summary

81

Data serialization—working with text and beyond
3.1
3.2

83

Understanding inputs and outputs in MapReduce
Processing common serialization formats 91

84

TECHNIQUE 12 MapReduce and XML 91
TECHNIQUE 13 MapReduce and JSON 95

3.3

Big data serialization formats
TECHNIQUE 14
TECHNIQUE 15
TECHNIQUE 16
TECHNIQUE 17

3.4

99

Working with SequenceFiles 103
Integrating Protocol Buffers with MapReduce 110
Working with Thrift 117
Next-generation data serialization with
MapReduce 120

Custom file formats

127

TECHNIQUE 18 Writing input and output formats for CSV 128

3.5

Chapter summary

136

PART 3 BIG DATA PATTERNS .........................................137

4

Applying MapReduce patterns to big data
4.1

139

Joining 140
TECHNIQUE 19 Optimized repartition joins 142
TECHNIQUE 20 Implementing a semi-join 148

4.2

Sorting

155

TECHNIQUE 21 Implementing a secondary sort 157
TECHNIQUE 22 Sorting keys across multiple reducers

4.3

Sampling 165

4.4

Chapter summary

162

TECHNIQUE 23 Reservoir sampling 165

5

168

Streamlining HDFS for big data 169
5.1

Working with small files

170

5.2

Efficient storage with compression

TECHNIQUE 24 Using Avro to store multiple small files

www.it-ebooks.info

178

170


xi

CONTENTS

TECHNIQUE 25 Picking the right compression codec for your
data 178
TECHNIQUE 26 Compression with HDFS, MapReduce, Pig, and
Hive 182
TECHNIQUE 27 Splittable LZOP with MapReduce, Hive, and
Pig 187

5.3

6

Chapter summary

193

Diagnosing and tuning performance problems
6.1
6.2

194

Measuring MapReduce and your environment 195
Determining the cause of your performance woes 198
TECHNIQUE 28 Investigating spikes in input data 200
TECHNIQUE 29 Identifying map-side data skew problems 201
TECHNIQUE 30 Determining if map tasks have an overall low
throughput 203
TECHNIQUE 31 Small files 204
TECHNIQUE 32 Unsplittable files 206
TECHNIQUE 33 Too few or too many reducers 208
TECHNIQUE 34 Identifying reduce-side data skew problems 209
TECHNIQUE 35 Determining if reduce tasks have an overall low
throughput 211
TECHNIQUE 36 Slow shuffle and sort 213
TECHNIQUE 37 Competing jobs and scheduler throttling 215
TECHNIQUE 38 Using stack dumps to discover unoptimized user
code 216
TECHNIQUE 39 Discovering hardware failures 218
TECHNIQUE 40 CPU contention 219
TECHNIQUE 41 Memory swapping 220
TECHNIQUE 42 Disk health 222
TECHNIQUE 43 Networking 224

6.3

Visualization

226

TECHNIQUE 44 Extracting and visualizing task execution times

6.4

Tuning

229

TECHNIQUE 45
TECHNIQUE 46
TECHNIQUE 47
TECHNIQUE 48
TECHNIQUE 49
TECHNIQUE 50
TECHNIQUE 51

6.5

Profiling your map and reduce tasks 230
Avoid the reducer 234
Filter and project 235
Using the combiner 236
Blazingly fast sorting with comparators 237
Collecting skewed data 242
Reduce skew mitigation 243

Chapter summary

249

www.it-ebooks.info

227


xii

CONTENTS

PART 4 DATA SCIENCE ..................................................251

7

Utilizing data structures and algorithms 253
7.1

Modeling data and solving problems with graphs 254
TECHNIQUE 52 Find the shortest distance between two users
TECHNIQUE 53 Calculating FoFs 263
TECHNIQUE 54 Calculate PageRank over a web graph 269

7.2

Bloom filters

275

TECHNIQUE 55 Parallelized Bloom filter creation in
MapReduce 277
TECHNIQUE 56 MapReduce semi-join with Bloom filters

7.3

8

256

Chapter summary

281

284

Integrating R and Hadoop for statistics and more 285
8.1
8.2
8.3

Comparing R and MapReduce integrations
R fundamentals 288
R and Streaming 290

286

TECHNIQUE 57 Calculate the daily mean for stocks 290
TECHNIQUE 58 Calculate the cumulative moving average for
stocks 293

8.4

Rhipe—Client-side R and Hadoop working together

297

TECHNIQUE 59 Calculating the CMA using Rhipe 297

8.5

RHadoop—a simpler integration of client-side R and
Hadoop 301

8.6

Chapter summary

TECHNIQUE 60 Calculating CMA with RHadoop

9

302

304

Predictive analytics with Mahout 305
9.1

Using recommenders to make product suggestions

306

TECHNIQUE 61 Item-based recommenders using movie ratings

9.2

Classification

314

TECHNIQUE 62 Using Mahout to train and test a spam classifier

9.3

Clustering with K-means
Chapter summary

321

325

TECHNIQUE 63 K-means with a synthetic 2D dataset

9.4

311

327

332

PART 5 TAMING THE ELEPHANT ....................................333

10

Hacking with Hive 335
10.1
10.2

Hive fundamentals 336
Data analytics with Hive 338

www.it-ebooks.info


xiii

CONTENTS

TECHNIQUE 64 Loading log files 338
TECHNIQUE 65 Writing UDFs and compressed partitioned
tables 344
TECHNIQUE 66 Tuning Hive joins 350

10.3

11

Chapter summary 358

Programming pipelines with Pig
11.1
11.2

Pig fundamentals 360
Using Pig to find malicious actors in log data
TECHNIQUE 67
TECHNIQUE 68
TECHNIQUE 69
TECHNIQUE 70
TECHNIQUE 71
TECHNIQUE 72
TECHNIQUE 73
TECHNIQUE 74

11.3

359
362

Schema-rich Apache log loading 363
Reducing your data with filters and projection 368
Grouping and counting IP addresses 370
IP Geolocation using the distributed cache 375
Combining Pig with your scripts 378
Combining data in Pig 380
Sorting tuples 381
Storing data in SequenceFiles 382

Optimizing user workflows with Pig

385

TECHNIQUE 75 A four-step process to working rapidly with big
data 385

11.4

Performance 390

11.5

Chapter summary 393

TECHNIQUE 76 Pig optimizations

12

390

Crunch and other technologies 394
12.1
12.2

What is Crunch? 395
Finding the most popular URLs in your logs

401

TECHNIQUE 77 Crunch log parsing and basic analytics

12.3

Joins

TECHNIQUE 78 Crunch’s repartition join

12.4
12.5

13

405

Cascading 407
Chapter summary 409

Testing and debugging
13.1

402

405

Testing

410

410

TECHNIQUE 79 Unit Testing MapReduce functions, jobs, and
pipelines 413
TECHNIQUE 80 Heavyweight job testing with the
LocalJobRunner 421

13.2

Debugging user space problems

424

TECHNIQUE 81 Examining task logs 424
TECHNIQUE 82 Pinpointing a problem Input Split

www.it-ebooks.info

429


xiv

CONTENTS

TECHNIQUE 83 Figuring out the JVM startup arguments for a
task 433
TECHNIQUE 84 Debugging and error handling 433

13.3

MapReduce gotchas 437
TECHNIQUE 85 MapReduce anti-patterns

13.4
appendix A
appendix B
appendix C
appendix D

438

Chapter summary 441
Related technologies 443
Hadoop built-in ingress and egress tools 471
HDFS dissected 486
Optimized MapReduce join frameworks 493
index 503

www.it-ebooks.info


preface
I first encountered Hadoop in the fall of 2008 when I was working on an internet
crawl and analysis project at Verisign. My team was making discoveries similar to those
that Doug Cutting and others at Nutch had made several years earlier regarding how
to efficiently store and manage terabytes of crawled and analyzed data. At the time, we
were getting by with our home-grown distributed system, but the influx of a new data
stream and requirements to join that stream with our crawl data couldn’t be supported by our existing system in the required timelines.
After some research we came across the Hadoop project, which seemed to be a
perfect fit for our needs—it supported storing large volumes of data and provided a
mechanism to combine them. Within a few months we’d built and deployed a MapReduce application encompassing a number of MapReduce jobs, woven together with
our own MapReduce workflow management system onto a small cluster of 18 nodes. It
was a revelation to observe our MapReduce jobs crunching through our data in minutes. Of course we couldn’t anticipate the amount of time that we’d spend debugging
and performance-tuning our MapReduce jobs, not to mention the new roles we took
on as production administrators—the biggest surprise in this role was the number of
disk failures we encountered during those first few months supporting production!
As our experience and comfort level with Hadoop grew, we continued to build
more of our functionality using Hadoop to help with our scaling challenges. We also
started to evangelize the use of Hadoop within our organization and helped kick-start
other projects that were also facing big data challenges.
The greatest challenge we faced when working with Hadoop (and specifically
MapReduce) was relearning how to solve problems with it. MapReduce is its own

xv

www.it-ebooks.info


xvi

PREFACE

flavor of parallel programming, which is quite different from the in-JVM programming
that we were accustomed to. The biggest hurdle was the first one—training our brains
to think MapReduce, a topic which the book Hadoop in Action by Chuck Lam (Manning Publications, 2010) covers well.
After you’re used to thinking in MapReduce, the next challenge is typically related
to the logistics of working with Hadoop, such as how to move data in and out of HDFS,
and effective and efficient ways to work with data in Hadoop. These areas of Hadoop
haven’t received much coverage, and that’s what attracted me to the potential of this
book—that of going beyond the fundamental word-count Hadoop usages and covering some of the more tricky and dirty aspects of Hadoop.
As I’m sure many authors have experienced, I went into this project confidently
believing that writing this book was just a matter of transferring my experiences onto
paper. Boy, did I get a reality check, but not altogether an unpleasant one, because
writing introduced me to new approaches and tools that ultimately helped better my
own Hadoop abilities. I hope that you get as much out of reading this book as I did
writing it.

www.it-ebooks.info


acknowledgments
First and foremost, I want to thank Michael Noll, who pushed me to write this book.
He also reviewed my early chapter drafts and helped mold the organization of the
book. I can’t express how much his support and encouragement has helped me
throughout the process.
I’m also indebted to Cynthia Kane, my development editor at Manning, who
coached me through writing this book and provided invaluable feedback on my work.
Among many notable “Aha!” moments I had while working with Cynthia, the biggest
one was when she steered me into leveraging visual aids to help explain some of the
complex concepts in this book.
I also want to say a big thank you to all the reviewers of this book: Aleksei Sergeevich, Alexander Luya, Asif Jan, Ayon Sinha, Bill Graham, Chris Nauroth, Eli Collins,
Ferdy Galema, Harsh Chouraria, Jeff Goldschrafe, Maha Alabduljalil, Mark Kemna,
Oleksey Gayduk, Peter Krey, Philipp K. Janert, Sam Ritchie, Soren Macbeth, Ted Dunning, Yunkai Zhang, and Zhenhua Guo.
Jonathan Seidman, the primary technical editor, did a great job reviewing the
entire book shortly before it went into production. Many thanks to Josh Wills, the creator of Crunch, who kindly looked over the chapter that covers that topic. And more
thanks go to Josh Patterson, who reviewed my Mahout chapter.
All of the Manning staff were a pleasure to work with, and a special shout-out goes
to Troy Mott, Katie Tennant, Nick Chase, Tara Walsh, Bob Herbstman, Michael Stephens, Marjan Bace, and Maureen Spencer.
Finally, a special thanks to my wife, Michal, who had to put up with a cranky husband
working crazy hours. She was a source of encouragement throughout the entire process.

xvii

www.it-ebooks.info


about this book
Doug Cutting, Hadoop’s creator, likes to call Hadoop the kernel for big data, and I’d
tend to agree. With its distributed storage and compute capabilities, Hadoop is fundamentally an enabling technology for working with huge datasets. Hadoop, to me, provides a bridge between structured (RDBMS) and unstructured (log files, XML, text)
data, and allows these datasets to be easily joined together. This has evolved from traditional use cases, such as combining OLTP and log files, to more sophisticated uses,
such as using Hadoop for data warehousing (exemplified by Facebook) and the field
of data science, which studies and makes new discoveries about data.
This book collects a number of intermediary and advanced Hadoop examples and
presents them in a problem/solution format. Each of the 85 techniques addresses a
specific task you’ll face, like using Flume to move log files into Hadoop or using
Mahout for predictive analysis. Each problem is explored step by step and, as you work
through them, you’ll find yourself growing more comfortable with Hadoop and at
home in the world of big data.
This hands-on book targets users who have some practical experience with
Hadoop and understand the basic concepts of MapReduce and HDFS. Manning’s
Hadoop in Action by Chuck Lam contains the necessary prerequisites to understand
and apply the techniques covered in this book.
Many techniques in this book are Java-based, which means readers are expected to
possess an intermediate-level knowledge of Java. An excellent text for all levels of Java
users is Effective Java, Second Edition, by Joshua Bloch (Addison-Wesley, 2008).

xviii

www.it-ebooks.info


ABOUT THIS BOOK

xix

Roadmap
This book has 13 chapters divided into five parts.
Part 1 contains a single chapter that’s the introduction to this book. It reviews
Hadoop basics and looks at how to get Hadoop up and running on a single host. It
wraps up with a walk-through on how to write and execute a MapReduce job.
Part 2, “Data logistics,” consists of two chapters that cover the techniques and
tools required to deal with data fundamentals, getting data in and out of Hadoop,
and how to work with various data formats. Getting data into Hadoop is one of the
first roadblocks commonly encountered when working with Hadoop, and chapter 2
is dedicated to looking at a variety of tools that work with common enterprise data
sources. Chapter 3 covers how to work with ubiquitous data formats such as XML
and JSON in MapReduce, before going on to look at data formats better suited to
working with big data.
Part 3 is called “Big data patterns,” and looks at techniques to help you work effectively with large volumes of data. Chapter 4 examines how to optimize MapReduce
join and sort operations, and chapter 5 covers working with a large number of small
files, and compression. Chapter 6 looks at how to debug MapReduce performance
issues, and also covers a number of techniques to help make your jobs run faster.
Part 4 is all about “Data science,” and delves into the tools and methods that help
you make sense of your data. Chapter 7 covers how to represent data such as graphs
for use with MapReduce, and looks at several algorithms that operate on graph data.
Chapter 8 describes how R, a popular statistical and data mining platform, can be integrated with Hadoop. Chapter 9 describes how Mahout can be used in conjunction
with MapReduce for massively scalable predictive analytics.
Part 5 is titled “Taming the elephant,” and examines a number of technologies
that make it easier to work with MapReduce. Chapters 10 and 11 cover Hive and Pig
respectively, both of which are MapReduce domain-specific languages (DSLs) geared
at providing high-level abstractions. Chapter 12 looks at Crunch and Cascading, which
are Java libraries that offer their own MapReduce abstractions, and chapter 13 covers
techniques to help write unit tests, and to debug MapReduce problems.
The appendixes start with appendix A, which covers instructions on installing both
Hadoop and all the other related technologies covered in the book. Appendix B covers low-level Hadoop ingress/egress mechanisms that the tools covered in chapter 2
leverage. Appendix C looks at how HDFS supports reads and writes, and appendix D
covers a couple of MapReduce join frameworks written by the author and utilized in
chapter 4.

Code conventions and downloads
All source code in listings or in text is in a fixed-width font like this to separate it
from ordinary text. Code annotations accompany many of the listings, highlighting
important concepts.

www.it-ebooks.info


xx

ABOUT THIS BOOK

All of the text and examples in this book work with Hadoop 0.20.x (and 1.x), and
most of the code is written using the newer org.apache.hadoop.mapreduce MapReduce
APIs. The few examples that leverage the older org.apache.hadoop.mapred package are
usually the result of working with a third-party library or a utility that only works with
the old API.
All of the code used in this book is available on GitHub at https://github.com/
alexholmes/hadoop-book as well as from the publisher’s website at www.manning
.com/HadoopinPractice.
Building the code depends on Java 1.6 or newer, git, and Maven 3.0 or newer. Git is
a source control management system, and GitHub provides hosted git repository services. Maven is used for the build system.
You can clone (download) my GitHub repository with the following command:
$ git clone git://github.com/alexholmes/hadoop-book.git

After the sources are downloaded you can build the code:
$ cd hadoop-book
$ mvn package

This will create a Java JAR file, target/hadoop-book-1.0.0-SNAPSHOT-jar-with-dependencies.jar. Running the code is equally simple with the included bin/run.sh.
If you’re running on a CDH distribution, the scripts will run configuration-free. If
you’re running on any other distribution, you’ll need to set the HADOOP_HOME environment variable to point to your Hadoop installation directory.
The bin/run.sh script takes as the first argument the fully qualified Java class name
of the example, followed by any arguments expected by the example class. As an
example, to run the inverted index MapReduce code from chapter 1, you’d run the
following:
$ hadoop fs -mkdir /tmp
$ hadoop fs -put test-data/ch1/* /tmp/
# replace the path below with the location of your Hadoop installation
# this isn't required if you are running CDH3
export HADOOP_HOME=/usr/local/hadoop
$ bin/run.sh com.manning.hip.ch1.InvertedIndexMapReduce \
/tmp/file1.txt /tmp/file2.txt output

The previous code won’t work if you don’t have Hadoop installed. Please refer to
chapter 1 for CDH installation instructions, or appendix A for Apache installation
instructions.

www.it-ebooks.info


xxi

ABOUT THIS BOOK

Third-party libraries
I use a number of third-party libraries for the sake of convenience. They’re included
in the Maven-built JAR so there’s no extra work required to work with these libraries.
The following table contains a list of the libraries that are in prevalent use throughout
the code examples.
Common third-party libraries
Library

Link

Details

Apache
Commons IO

http://commons.apache.org/io/

Helper functions to help work with
input and output streams in Java.
You’ll make frequent use of the
IOUtils to close connections and to
read the contents of files into strings.

Apache
Commons Lang

http://commons.apache.org/lang/

Helper functions to work with strings,
dates, and collections. You’ll make
frequent use of the StringUtils
class for tokenization.

Datasets
Throughout this book you’ll work with three datasets to provide some variety for the
examples. All the datasets are small to make them easy to work with. Copies of the
exact data used are available in the GitHub repository in the directory https://
github.com/alexholmes/hadoop-book/tree/master/test-data. I also sometimes have
data that’s specific to a chapter, which exists within chapter-specific subdirectories
under the same GitHub location.
NASDAQ FINANCIAL STOCKS

I downloaded the NASDAQ daily exchange data from Infochimps (see http://
mng.bz/xjwc). I filtered this huge dataset down to just five stocks and their start-ofyear values from 2000 through 2009. The data used for this book is available on
GitHub at https://github.com/alexholmes/hadoop-book/blob/master/test-data/
stocks.txt.
The data is in CSV form, and the fields are in the following order:
Symbol,Date,Open,High,Low,Close,Volume,Adj Close

APACHE LOG DATA

I created a sample log file in Apache Common Log Format (see http://mng.bz/
L4S3) with some fake Class E IP addresses and some dummy resources and response
codes. The file is available on GitHub at https://github.com/alexholmes/hadoopbook/blob/master/test-data/apachelog.txt.

www.it-ebooks.info


xxii

ABOUT THIS BOOK

NAMES

The government’s census was used to retrieve names from http://mng.bz/LuFB and
is available at https://github.com/alexholmes/hadoop-book/blob/master/test-data/
names.txt.

Getting help
You’ll no doubt have questions when working with Hadoop. Luckily, between the wikis
and a vibrant user community your needs should be well covered.
The main wiki is located at http://wiki.apache.org/hadoop/, and contains useful
presentations, setup instructions, and troubleshooting instructions.
The Hadoop Common, HDFS, and MapReduce mailing lists can all be found on
http://hadoop.apache.org/mailing_lists.html.
Search Hadoop is a useful website that indexes all of Hadoop and its ecosystem
projects, and it provides full-text search capabilities: http://search-hadoop.com/.
You’ll find many useful blogs you should subscribe to in order to keep on top of
current events in Hadoop. This preface includes a selection of my favorites:
 Cloudera is a prolific writer of practical applications of Hadoop: http://

www.cloudera.com/blog/.
 The Hortonworks blog is worth reading; it discusses application and future

Hadoop roadmap items: http://hortonworks.com/blog/.
 Michael Noll is one of the first bloggers to provide detailed setup instructions
for Hadoop, and he continues to write about real-life challenges and uses of
Hadoop: http://www.michael-noll.com/blog/.
There are a plethora of active Hadoop Twitter users who you may want to follow,
including Arun Murthy (@acmurthy), Tom White (@tom_e_white), Eric Sammer
(@esammer), Doug Cutting (@cutting), and Todd Lipcon (@tlipcon). The Hadoop
project itself tweets on @hadoop.

Author Online
Purchase of Hadoop in Practice includes free access to a private web forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the author and other users. To access and subscribe to
the forum, point your web browser to www.manning.com/HadoopinPractice or
www.manning.com/holmes/. These pages provide information on how to get on the
forum after you are registered, what kind of help is available, and the rules of conduct
on the forum.
Manning’s commitment to our readers is to provide a venue where a meaningful
dialogue between individual readers and between readers and the author can take
place. It’s not a commitment to any specific amount of participation on the part of the
author, whose contribution to the book’s forum remains voluntary (and unpaid). We
suggest you try asking him some challenging questions, lest his interest stray!

www.it-ebooks.info


ABOUT THIS BOOK

xxiii

The Author Online forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.

About the author
ALEX HOLMES is a senior software engineer with over 15 years of experience developing large-scale distributed Java systems. For the last four years he has gained expertise
in Hadoop solving big data problems across a number of projects. He has presented at
JavaOne and Jazoon and is currently a technical lead at VeriSign.
Alex maintains a Hadoop-related blog at http://grepalex.com, and is on Twitter at
https://twitter.com/grep_alex.

About the cover illustration
The figure on the cover of Hadoop in Practice is captioned “A young man from Kistanja,
Dalmatia.” The illustration is taken from a reproduction of an album of Croatian traditional costumes from the mid-nineteenth century by Nikola Arsenovic, published by
the Ethnographic Museum in Split, Croatia, in 2003. The illustrations were obtained
from a helpful librarian at the Ethnographic Museum in Split, itself situated in the
Roman core of the medieval center of the town: the ruins of Emperor Diocletian’s
retirement palace from around AD 304. The book includes finely colored illustrations
of figures from different regions of Croatia, accompanied by descriptions of the costumes and of everyday life.
Kistanja is a small town located in Bukovica, a geographical region in Croatia. It is
situated in northern Dalmatia, an area rich in Roman and Venetian history. The word
mamok in Croatian means a bachelor, beau, or suitor—a single young man who is of
courting age—and the young man on the cover, looking dapper in a crisp, white linen
shirt and a colorful, embroidered vest, is clearly dressed in his finest clothes, which
would be worn to church and for festive occasions—or to go calling on a young lady.
Dress codes and lifestyles have changed over the last 200 years, and the diversity by
region, so rich at the time, has faded away. It is now hard to tell apart the inhabitants
of different continents, let alone of different hamlets or towns separated by only a few
miles. Perhaps we have traded cultural diversity for a more varied personal life—certainly for a more varied and fast-paced technological life.
Manning celebrates the inventiveness and initiative of the computer business with
book covers based on the rich diversity of regional life of two centuries ago, brought
back to life by illustrations from old books and collections like this one.

www.it-ebooks.info


www.it-ebooks.info


Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay

×