Hadoop open source of MapReduce Paradigm.
Components !
Higher level abstraction!
=============================================================
HBase - Column Oriented Data warehousing system on Top of Hadoop.
MapReduce good for Bacth Jobs ex) Inverted index at the first ! read huge web pages, extract info, inverting index
- 현실세계에서는 웹페이지가 만약 하나 바뀐다면 다 수정해야한다. 하둡에서는! no facilty for updating
- HBase는 Adds randome-access read / write operation!
- Based on Google big table for high scalability and flexibility
- It is not fixed shema unlike RDMS
-Adding Column Familiy Freely !
============================================================
Pig -Basic High Level Process
맵리듀스로 프로그래밍 힘드니깐,-> 문제점
non-bolerplate -> non-template
gruntwork-> 귀찮은일
Boilerplates -> fixed code ex) MapReduce
MapReduce tasks requires more than MR job
============ Pig's goal is to simplifying programming skill
Pig has Data Structures (multi valued, nested)
Pig-latin: name of Pig Language -> SQL과 다른점은 Step by Step으로 Query Plan를 만들어야한다.
스키마 is Tuple data type (ex) charrarray)
Cascading
Hive - external table -> temporary table
-> SQL이랑 비슷하다.
External if you drop -> file still exists, only meta data만 지워진다. so, only for testing
Usual if you drop -> table 다 지워진다.
It means HIve has own storage.!
=========================================================================
Tips) Avro is Google Proto Buffer ( How to encode Structure Data 구조체 to transfer on the Network, Independent Language)
Tips) Hadoop Archive
Feature of user filesystem
you can store and look at list in HAR
HAR VS HDFS
-> Compress Archive
-> Store as directory on HDFS
because of weakness of HDFS -> Name node should store large meta data in memory.
That's hwy fild archiving facility that packs files into HDFS block effieicity.
HDFS의 네임노드는 모든 파일 리스트를 저장해야하므로
metadata는 mainmeory 에서 그 많은 양들을 저장할 수 없다.
HAR 등장, 효과적으로 저장가능. as a zip file !
하지만 zip file은 안에 리스트를 볼 수 없지만 HAR는 one directory로 한 파일로 저장되지만
안에 내용물을 볼 수 있다! in a single file ! by packing multiple files ! avoid memory problem.
without decompressing we can look at inside the file !
Limitations
1) it doesn't support compress, just packing
2) immutable; not adding, removing
regardless of usual file system,
you can specify HAR file instead of HDFS !
===============================================================================================
1. Data integrity -> Checksum
2. Compression-> two benefits 1) save the disk 2) speed up on network because of reduced size
Summary of Big Data File
'Distributed System Information' 카테고리의 다른 글
Hadoop General Algorithms (0) | 2014.11.12 |
---|---|
Hadoop Configuration Troubleshooting: reduce & heap size 20141110 (0) | 2014.11.10 |
YARN (0) | 2014.10.18 |
Analysis Hadoop Source Code on Eclipse (0) | 2014.10.16 |
hbase zookeep ycsb (0) | 2014.10.09 |