用hadoop hive协同scribe log用户行为分析方案
hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供完整的sql查询功能,可以将sql语句转换为 MapReduce任务进行运行。54chen使用手记见:http://www.54chen.com/_linux_/hive-hadoop-how-to-install.html
下面来讲述二者合成的使用办法:
创建和scribe格式相符的hive table
bin/hive
> create table log(active string,uuid string,ip string,dt string) row format delimited fields terminated by ‘,’ collection items terminated by “\n” stored as textfile;
加载数据
>LOAD DATA LOCAL INPATH ‘/opt/soft/hadoop-0.20.2/hive-0.7.0/data/log-2011-04-13*’ OVERWRITE INTO TABLE log;
查询
>select count(*) from log group by uuid;
进入mapreduce计算,过了一会儿,结果出来了。
修改已经定义数据格式
cutter.py 数据自定义脚本,从标准输入拿到数据后输出到标准输出
cd bin/
./hive
>add file /opt/soft/hadoop-0.20.2/hive-0.7.0/bin/hive-shell/cutter.py;
>select transform (active,uuid,ip,dt) using ‘python cutter.py’ as (active,uuid,ip,dt) from log limit 1;
得到格式化后的结果
>create table log_new(active string,uuid string,ip string,dt string) row format delimited fields terminated by ‘,’ collection items terminated by “\n” stored as textfile;
>INSERT OVERWRITE TABLE log_new select transform (active,uuid,ip,dt) using ‘python cutter.py’ as (active,uuid,ip,time) from log;
以hive server运行(thrift的server)
bin/hive -service hiveserver
默认以thrift service在10000启动服务。
用标准的thrift-jdbc来连接hive
public class HiveJdbcClient {
private static String driverName = “org.apache.hadoop.hive.jdbc.HiveDriver”;/**
* @param args
* @throws SQLException
*/
public static void main(String[] args) throws SQLException {
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
e.printStackTrace();
System.exit(1);
}
Connection con = DriverManager.getConnection(“jdbc:hive://192.168.100.52:10000/default”, “”, “”);
Statement stmt = con.createStatement();ResultSet res = stmt.executeQuery(“select count(distinct uuid) from usage_new where active=’user_login_succ’”);
if (res.next()) {
System.out.println(res.getString(1));
}
}}
依赖的jar包(maven pom)
<dependency>
<groupId>hadoop</groupId>
<artifactId>hive-jdbc</artifactId>
<version>0.7.0</version>
</dependency>
<dependency>
<groupId>hadoopl</groupId>
<artifactId>hive-metastore</artifactId>
<version>0.7.0</version>
</dependency><dependency>
<groupId>hadoop</groupId>
<artifactId>hive-exec</artifactId>
<version>0.7.0</version>
</dependency><dependency>
<groupId>hadoop</groupId>
<artifactId>hive-service</artifactId>
<version>0.7.0</version>
</dependency>
<dependency>
<groupId>org.apache.thrift</groupId>
<artifactId>thrift</artifactId>
<version>0.5.0-xiaomi</version>
</dependency>
<dependency>
<groupId>facebook</groupId>
<artifactId>thrift-fb303</artifactId>
<version>0.5.0</version>
</dependency><dependency>
<groupId>hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20.2</version>
</dependency><dependency>
<groupId>xerces</groupId>
<artifactId>xercesImpl</artifactId>
<version>2.9.1</version>
</dependency>
<dependency>
<groupId>xalan</groupId>
<artifactId>xalan</artifactId>
<version>2.7.1</version>
</dependency>
建议继续学习:
- 分布式日志系统scribe使用手记 (阅读:8049)
- 如何获取hive建表语句 (阅读:6695)
- Hive源码解析-之-词法分析器 parser (阅读:5813)
- HIVE中UDTF编写和使用 (阅读:5275)
- Hive的入口 -- Hive源码解析 (阅读:4805)
- Hive源码解析-之-语法解析器 (阅读:4289)
- 几个HIVE的streaming (阅读:3408)
- 写好Hive 程序的五个提示 (阅读:3177)
- Impala与Hive的比较 (阅读:2956)
- Hive 随谈(一) (阅读:2853)
扫一扫订阅我的微信号:IT技术博客大学习
- 作者:54chen 来源: 五四陈科学院-坚信科学,分享技术
- 标签: hive scribe 用hadoop 行为分析
- 发布时间:2011-06-02 23:10:53
- [46] 界面设计速成
- [43] Oracle MTS模式下 进程地址与会话信
- [42] IOS安全–浅谈关于IOS加固的几种方法
- [42] 视觉调整-设计师 vs. 逻辑
- [41] android 开发入门
- [40] 图书馆的世界纪录
- [39] 【社会化设计】自我(self)部分――欢迎区
- [39] 如何拿下简短的域名
- [37] 程序员技术练级攻略
- [35] 读书笔记-壹百度:百度十年千倍的29条法则