I am dealing with a database (2.5 GB) having some tables only 40 row to some having 9 million rows data.
when I am doing any query for large table it takes more time.
I want results in less time
small query on table which have 90 rows only–>
hive> select count(*) from cidade;
Time taken: 50.172 seconds
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.block.size</name>
<value>131072</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
does these setting affects performance of hive?
dfs.replication=3
dfs.block.size=131072
can i set it from hive prompt as
hive>set dfs.replication=5
Is this value remains for a perticular session only ?
or Is it better to change it in .xml file ?
The important thing is that
select count(*)will cause hive start a map reduce job.You may think this is very fast like mysql query.
But even a simplest map reduce job in hadoop, the total time is consist of submit to job tracker, assign task to task tracker and etc. So the total time at lease several ten secs.
try
select count(*)on a big table. The time will not increase to much.So, you need understand hive and hadoop deal big data.