2014. 10. 19. 00:28 - Phil lee

Hadoop Components ! HBase, Pig, Hive, Avro

 Hadoop open source of MapReduce Paradigm.

 Components !


 Higher level abstraction!


 HBase - Column Oriented Data warehousing system on Top of Hadoop.

 MapReduce good for Bacth Jobs ex) Inverted index at the first ! read huge web pages, extract info, inverting index

 - 현실세계에서는 웹페이지가 만약 하나 바뀐다면 다 수정해야한다. 하둡에서는! no facilty for updating


 - HBase는  Adds randome-access read / write operation!

 - Based on Google big table for high scalability and flexibility 

 - It is not fixed shema unlike RDMS

 -Adding Column Familiy Freely !



  Pig -Basic High Level Process


 맵리듀스로 프로그래밍 힘드니깐,-> 문제점

 non-bolerplate -> non-template

 gruntwork-> 귀찮은일

 Boilerplates -> fixed code ex) MapReduce

 MapReduce tasks requires more than MR job

 ============ Pig's goal is to simplifying programming skill

 Pig has Data Structures (multi valued, nested)

 Pig-latin: name of Pig Language -> SQL과 다른점은 Step by Step으로 Query Plan를 만들어야한다.

 스키마 is Tuple data type (ex) charrarray)




 Hive - external table -> temporary table

 -> SQL이랑 비슷하다.  

 External if you drop -> file still exists, only meta data만 지워진다.   so, only for testing 

 Usual if you drop -> table 다 지워진다.


 It means HIve has own storage.!


 Tips) Avro is Google Proto Buffer  ( How to encode Structure Data 구조체 to transfer on the Network, Independent Language)

Tips) Hadoop Archive 

Feature of user filesystem

you can store and look at list in HAR


-> Compress Archive 

-> Store as directory on HDFS 

because of weakness of HDFS -> Name node should store large meta data in memory.

That's hwy  fild archiving facility that packs files into HDFS block effieicity.

HDFS의 네임노드는 모든 파일 리스트를 저장해야하므로 

metadata는 mainmeory 에서 그 많은 양들을 저장할 수 없다.

HAR 등장, 효과적으로 저장가능. as a zip file !

하지만 zip file은 안에 리스트를 볼 수 없지만 HAR는 one directory로 한 파일로 저장되지만

안에 내용물을 볼 수 있다! in a single file ! by packing multiple files ! avoid memory problem.

without decompressing we can look at inside the file !


1) it doesn't support compress, just packing

2) immutable; not adding, removing

regardless of usual file system, 

you can specify HAR file instead of HDFS !


1. Data integrity -> Checksum

2. Compression-> two benefits 1) save the disk  2) speed up on network because of reduced size

Summary of Big Data File

Chapter 3 Big Data.pdf

Chapter 4 Big Data.pdf

Chapter 5 Big Data.pdf

Chapter 6 Big Data.pdf


'Distributed System Information' 카테고리의 다른 글

Hadoop General Algorithms  (0) 2014.11.12
Hadoop Configuration Troubleshooting: reduce & heap size 20141110  (0) 2014.11.10
YARN  (0) 2014.10.18
Analysis Hadoop Source Code on Eclipse  (0) 2014.10.16
hbase zookeep ycsb  (0) 2014.10.09