Loading
2016. 10. 15. 19:20 - Phil lee

Zeppelin + Python Spark SQL + Hive + Hadoop

 

Well, after entering into Kiwi Plus, I have a bunch of tasks to handle as everyone expects.

In fact, I really want to go through these tasks to improve my skillsets.

Who knows? I feel like I would have wanted to do my best for these distributed frameworks and data analysis a long time ago.

 

Pseudo Code in Zepplin with PySpark

Import datetime

From pyspark.sql import HiveContext

Import mysql.connector

 

Yesterday = datetime.date(today() – 1)

Sql = "select agg_date, count(id) from location_log where between from and to"

 

sqlContext = HiveContext(sc)

df = sqlContext.sql(sql).toPandas()

df.set_index(agg_date)

 

mysqlConnector

for agg_date in df.index

    …

 

Zeppelin

  • Provides spark and jdbs interpreter to access Spark execution and Warehouse

Spark-SQL

  • provides Hive version for HiveQL compatibility.
  • Supports HiveContext with useful functions to handle a table

Hive

  • Provides HiveContext with efficient configuration and functions
  • After setting configuration 'hive-site.xml' for warehouse location, HiveContext in Spark can access hive tables

Reference - http://tomining.tistory.com/89