Hive 五种数据导入方式介绍
问题导读:
1、Hive五种数据导入方式有哪些?
2、导入表命令中有无LOCAL关键字的区别?
3、使用OVERWRITE或INTO关键字的作用及区别?
4、INPATH路径的限制?
5、什么是动态分区插入?
6、动态分区插入需要做哪些设置?
内容:
既然Hive没有行级别的数据插入、更新和删除操作,那么往表中装载数据的唯一途径就是使用一种”大量“的数据装载操作。我们以如下格式文件演示五种数据导入Hive方式
[html] view plain copy- Tom 24 NanJing Nanjing University
- Jack 29 NanJing Southeast China University
- Mary Kake 21 SuZhou Suzhou University
- John Doe 24 YangZhou YangZhou University
- Bill King 23 XuZhou Xuzhou Normal University
数据格式以 分隔,分别表示:姓名、年龄、地址、学校
一、从本地文件系统中导入数据
(1) 创建test1测试表
[html] view plain copy- hive> CREATE TABLE test1(name STRING,age INT, address STRING,school STRING)
- > ROW FORMAT DELIMITED FIELDS TERMINATED BY " "
- > STORED AS TEXTFILE ;
- OK
- Time taken: 0.078 seconds
(2) 从本地加载数据
[html] view plain copy- hive> LOAD DATA LOCAL INPATH "/home/hadoopUser/data/test1.txt"
- > INTO TABLE test1;
- Copying data from file:/home/hadoopUser/data/test1.txt
- Copying file: file:/home/hadoopUser/data/test1.txt
- Loading data to table hive.test1
- Table hive.test1 stats: [numFiles=1, numRows=0, totalSize=201, rawDataSize=0]
- OK
- Time taken: 0.364 seconds
- hive> select * from test1;
- OK
- Tom 24 NanJing Nanjing University
- Jack 29 NanJing Southeast China University
- Mary Kake 21 SuZhou Suzhou University
- John Doe 24 YangZhou YangZhou University
- Bill King 23 XuZhou Xuzhou Normal University
- Time taken: 0.533 seconds, Fetched: 5 row(s)
二、从HDFS文件系统加载数据到Hive
(1) 清空之前创建的表中数据
[html] view plain copy- insert overwrite table test1 select * from test1 where 1=0; //清空表,一般不推荐这样操作
- hive> LOAD DATA INPATH "/input/test1.txt"
- > OVERWRITE INTO TABLE test1;
- Loading data to table hive.test1
- rmr: DEPRECATED: Please use "rm -r" instead.
- Deleted hdfs://secondmgt:8020/hive/warehouse/hive.db/test1
- Table hive.test1 stats: [numFiles=1, numRows=0, totalSize=201, rawDataSize=0]
- OK
- Time taken: 0.355 seconds
- hive> select * from test1;
- OK
- Tom 24.0 NanJing Nanjing University
- Jack 29.0 NanJing Southeast China University
- Mary Kake 21.0 SuZhou Suzhou University
- John Doe 24.0 YangZhou YangZhou University
- Bill King 23.0 XuZhou Xuzhou Normal University
- Time taken: 0.054 seconds, Fetched: 5 row(s)
注意:此处没有LOCAL关键字,表示分布式文件系统中的路径,这就是和第一种方法的主要区别,同时由日志可以发现,因为此处加了OVERWRITE关键字,执行了Deleted操作,即先删除之前存储的数据,然后再执行加载操作。
同时,INPATH子句中使用的文件路径还有一个限制,那就是这个路径下不可以包含任何文件夹。
三、通过查询语句向表中插入数据
(1) 创建test4测试表
[html] view plain copy- hive> CREATE TABLE test4(name STRING,age FLOAT,address STRING,school STRING)
- > ROW FORMAT DELIMITED
- > FIELDS TERMINATED BY " "
- > STORED AS TEXTFILE ;
- OK
- Time taken: 0.251 seconds
(2) 从查询结果中导入数据
[html] view plain copy- hive> INSERT INTO TABLE test4 SELECT * FROM test1;
- Total jobs = 3
- Launching Job 1 out of 3
- Number of reduce tasks is set to 0 since there"s no reduce operator
- Starting Job = job_1419317102229_0032, Tracking URL = http://secondmgt:8088/proxy/application_1419317102229_0032/
- Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job -kill job_1419317102229_0032
- Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
- 2014-12-29 10:49:25,409 Stage-1 map = 0%, reduce = 0%
- 2014-12-29 10:49:36,900 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.83 sec
- MapReduce Total cumulative CPU time: 2 seconds 830 msec
- Ended Job = job_1419317102229_0032
- Stage-4 is selected by condition resolver.
- Stage-3 is filtered out by condition resolver.
- Stage-5 is filtered out by condition resolver.
- Moving data to: hdfs://secondmgt:8020/hive/scratchdir/hive_2014-12-29_10-49-10_009_3039854442660229613-1/-ext-10000
- Loading data to table hive.test4
- Table hive.test4 stats: [numFiles=1, numRows=5, totalSize=211, rawDataSize=206]
- MapReduce Jobs Launched:
- Job 0: Map: 1 Cumulative CPU: 2.83 sec HDFS Read: 415 HDFS Write: 278 SUCCESS
- Total MapReduce CPU Time Spent: 2 seconds 830 msec
- OK
- Time taken: 28.495 seconds
注意:新建表的字段数,一定要和后面SELECT中查询的字段数一样,且要注意数据类型。如test4包含四个字段:name、age、address和school,则SELECT查询出的结果也应该对应这四个字段。
(3) 查看导入结果
[html] view plain copy- hive> select * from test4;
- OK
- Tom 24.0 NanJing Nanjing University
- Jack 29.0 NanJing Southeast China University
- Mary Kake 21.0 SuZhou Suzhou University
- John Doe 24.0 YangZhou YangZhou University
- Bill King 23.0 XuZhou Xuzhou Normal University
- Time taken: 0.066 seconds, Fetched: 5 row(s)
分区插入有两种,一种是静态分区,另一种是动态分区。如果混合使用静态分区和动态分区,则静态分区必须出现在动态分区之前。现分别介绍这两种分区插入
(1) 静态分区插入
①创建分区表
[html] view plain copy- hive> CREATE TABLE test2(name STRING,address STRING,school STRING)
- > PARTITIONED BY(age float)
- > ROW FORMAT DELIMITED FIELDS TERMINATED BY " "
- > STORED AS TEXTFILE ;
- OK
- Time taken: 0.144 seconds
此处创建了一个test2的分区表,以年龄分区
②从查询结果中导入数据
[html] view plain copy- hive> INSERT INTO TABLE test2 PARTITION (age="24") SELECT * FROM test1;
- FAILED: SemanticException [Error 10044]: Line 1:19 Cannot insert into target table because column number/types are different ""24"": Table insclause-0 has 3 columns, but query has 4 columns.
- hive> INSERT INTO TABLE test2 PARTITION (age="24") SELECT name,address,school FROM test1;
- hive> select * from test2;
- OK
- Tom NanJing Nanjing University 24.0
- Jack NanJing Southeast China University 24.0
- Mary Kake SuZhou Suzhou University 24.0
- John Doe YangZhou YangZhou University 24.0
- Bill King XuZhou Xuzhou Normal University 24.0
- Time taken: 0.079 seconds, Fetched: 5 row(s)
(2) 动态分区插入
静态分区需要创建非常多的分区,那么用户就需要写非常多的SQL!Hive提供了一个动态分区功能,其可以基于查询参数推断出需要创建的分区名称。
① 创建分区表,此过程和静态分区创建表一样,此处省略
② 参数设置
[html] view plain copy- hive> set hive.exec.dynamic.partition=true;
- hive> set hive.exec.dynamic.partition.mode=nonstrict;
③ 数据动态插入
[html] view plain copy- hive> insert into table test2 partition (age) select name,address,school,age from test1;
- Total jobs = 1
- Launching Job 1 out of 1
- Number of reduce tasks not specified. Estimated from input data size: 1
- In order to change the average load for a reducer (in bytes):
- set hive.exec.reducers.bytes.per.reducer=<number>
- In order to limit the maximum number of reducers:
- set hive.exec.reducers.max=<number>
- In order to set a constant number of reducers:
- set mapreduce.job.reduces=<number>
- Starting Job = job_1419317102229_0029, Tracking URL = http://secondmgt:8088/proxy/application_1419317102229_0029/
- Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job -kill job_1419317102229_0029
- Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
- 2014-12-28 20:45:07,996 Stage-1 map = 0%, reduce = 0%
- 2014-12-28 20:45:21,488 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.67 sec
- 2014-12-28 20:45:32,926 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.32 sec
- MapReduce Total cumulative CPU time: 7 seconds 320 msec
- Ended Job = job_1419317102229_0029
- Loading data to table hive.test2 partition (age=null)
- Loading partition {age=29.0}
- Loading partition {age=23.0}
- Loading partition {age=21.0}
- Loading partition {age=24.0}
- Partition hive.test2{age=21.0} stats: [numFiles=1, numRows=1, totalSize=35, rawDataSize=34]
- Partition hive.test2{age=23.0} stats: [numFiles=1, numRows=1, totalSize=42, rawDataSize=41]
- Partition hive.test2{age=24.0} stats: [numFiles=1, numRows=2, totalSize=69, rawDataSize=67]
- Partition hive.test2{age=29.0} stats: [numFiles=1, numRows=1, totalSize=40, rawDataSize=39]
- MapReduce Jobs Launched:
- Job 0: Map: 1 Reduce: 1 Cumulative CPU: 7.32 sec HDFS Read: 415 HDFS Write: 375 SUCCESS
- Total MapReduce CPU Time Spent: 7 seconds 320 msec
- OK
- Time taken: 41.846 seconds
④ 查看插入结果
[html] view plain copy
- hive> select * from test2;
- OK
- Mary Kake SuZhou Suzhou University 21.0
- Bill King XuZhou Xuzhou Normal University 23.0
- John Doe YangZhou YangZhou University 24.0
- Tom NanJing Nanjing University 24.0
- Jack NanJing Southeast China University 29.0
在实际情况中,表的输出结果可能太多,不适于显示在控制台上,这时候,将Hive的查询输出结果直接存在一个新的表中是非常方便的,我们称这种情况为CTAS(create
table .. as select)
(1) 创建表
[html] view plain copy- hive> CREATE TABLE test3
- > AS
- > SELECT name,age FROM test1;
- Total jobs = 3
- Launching Job 1 out of 3
- Number of reduce tasks is set to 0 since there"s no reduce operator
- Starting Job = job_1419317102229_0030, Tracking URL = http://secondmgt:8088/proxy/application_1419317102229_0030/
- Kill Command = /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/bin/hadoop job -kill job_1419317102229_0030
- Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
- 2014-12-28 20:59:59,375 Stage-1 map = 0%, reduce = 0%
- 2014-12-28 21:00:10,795 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.68 sec
- MapReduce Total cumulative CPU time: 2 seconds 680 msec
- Ended Job = job_1419317102229_0030
- Stage-4 is selected by condition resolver.
- Stage-3 is filtered out by condition resolver.
- Stage-5 is filtered out by condition resolver.
- Moving data to: hdfs://secondmgt:8020/hive/scratchdir/hive_2014-12-28_20-59-45_494_6763514583931347886-1/-ext-10001
- Moving data to: hdfs://secondmgt:8020/hive/warehouse/hive.db/test3
- Table hive.test3 stats: [numFiles=1, numRows=5, totalSize=63, rawDataSize=58]
- MapReduce Jobs Launched:
- Job 0: Map: 1 Cumulative CPU: 2.68 sec HDFS Read: 415 HDFS Write: 129 SUCCESS
- Total MapReduce CPU Time Spent: 2 seconds 680 msec
- OK
- Time taken: 26.583 seconds
- hive> select * from test3;
- OK
- Tom 24.0
- Jack 29.0
- Mary Kake 21.0
- 上一篇: 程序员你会修电脑吗?
- 下一篇:没有了