Loading
2014. 10. 19. 00:28 - Phil lee

Hadoop Components ! HBase, Pig, Hive, Avro

 Hadoop open source of MapReduce Paradigm.

 Components !

 

 Higher level abstraction!

 =============================================================

 HBase - Column Oriented Data warehousing system on Top of Hadoop.

 MapReduce good for Bacth Jobs ex) Inverted index at the first ! read huge web pages, extract info, inverting index

 - 현실세계에서는 웹페이지가 만약 하나 바뀐다면 다 수정해야한다. 하둡에서는! no facilty for updating

 

 - HBase는  Adds randome-access read / write operation!

 - Based on Google big table for high scalability and flexibility 

 - It is not fixed shema unlike RDMS

 -Adding Column Familiy Freely !

 

 ============================================================

  Pig -Basic High Level Process

 

 맵리듀스로 프로그래밍 힘드니깐,-> 문제점

 non-bolerplate -> non-template

 gruntwork-> 귀찮은일

 Boilerplates -> fixed code ex) MapReduce

 MapReduce tasks requires more than MR job

 ============ Pig's goal is to simplifying programming skill

 Pig has Data Structures (multi valued, nested)

 Pig-latin: name of Pig Language -> SQL과 다른점은 Step by Step으로 Query Plan를 만들어야한다.

 스키마 is Tuple data type (ex) charrarray)

  

 Cascading

 

 Hive - external table -> temporary table

 -> SQL이랑 비슷하다.  

 External if you drop -> file still exists, only meta data만 지워진다.   so, only for testing 

 Usual if you drop -> table 다 지워진다.

 

 It means HIve has own storage.!

=========================================================================

 Tips) Avro is Google Proto Buffer  ( How to encode Structure Data 구조체 to transfer on the Network, Independent Language)



Tips) Hadoop Archive 

Feature of user filesystem

you can store and look at list in HAR


HAR VS HDFS

-> Compress Archive 

-> Store as directory on HDFS 

because of weakness of HDFS -> Name node should store large meta data in memory.

That's hwy  fild archiving facility that packs files into HDFS block effieicity.


HDFS의 네임노드는 모든 파일 리스트를 저장해야하므로 

metadata는 mainmeory 에서 그 많은 양들을 저장할 수 없다.


HAR 등장, 효과적으로 저장가능. as a zip file !

하지만 zip file은 안에 리스트를 볼 수 없지만 HAR는 one directory로 한 파일로 저장되지만

안에 내용물을 볼 수 있다! in a single file ! by packing multiple files ! avoid memory problem.


without decompressing we can look at inside the file !


Limitations

1) it doesn't support compress, just packing

2) immutable; not adding, removing


regardless of usual file system, 

you can specify HAR file instead of HDFS !

===============================================================================================


1. Data integrity -> Checksum

2. Compression-> two benefits 1) save the disk  2) speed up on network because of reduced size



Summary of Big Data File


Chapter 3 Big Data.pdf


Chapter 4 Big Data.pdf


Chapter 5 Big Data.pdf


Chapter 6 Big Data.pdf





 

'Distributed System Information' 카테고리의 다른 글

Hadoop General Algorithms  (0) 2014.11.12
Hadoop Configuration Troubleshooting: reduce & heap size 20141110  (0) 2014.11.10
YARN  (0) 2014.10.18
Analysis Hadoop Source Code on Eclipse  (0) 2014.10.16
hbase zookeep ycsb  (0) 2014.10.09