I am using hive for work. When I created some external tables today, I forgot to type the EXTERNAL keyword, and the HiveQL is like:
CREATE TABLE year_2012_main (
some BIGINT,
fields BIGINT,
should BIGINT,
beee BIGINT,
here STRING,
buttt STRING,
Iveee STRING,
decide STRING,
tohide STRING,
them BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY ' '
MAP KEYS TERMINATED BY ':'
STORED AS TEXTFILE location '/data/content/year_2012_main';
Then I tried select count (*) from year_2012_main; , and it worked well.
So, just out of curious, what’s the difference with or without EXTERNAL?
A Hive table that’s not external is called a managed table. One of the main differences between an external and a managed table in Hive is that when an external table is dropped, the data associated with it (in your case /data/content/year_2012_main) doesn’t get deleted, only the metadata (number of columns, type of columns, terminators, etc.) gets dropped from the Hive metastore. When a managed table gets dropped, both the metadata and data get dropped. I have so far always preferred making tables external because if the schema of my Hive table changes, I can just drop the external table and re-create another external table over the same HDFS data with the new schema. However, most (if not all) of the changes to schema can now be made through
ALTER TABLEor similar commands so my recommendation/preference to use external tables over managed ones might be more of a legacy concern than a contemporary one.You can learn more about the terminologies here.