I’d like to know how to find the mapping between Hive tables and the actual HDFS files (or rather, directories) that they represent. I need to access the table files directly.
Where does Hive store its files in HDFS?
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
The location they are stored on the HDFS is fairly easy to figure out once you know where to look. 🙂
If you go to
http://NAMENODE_MACHINE_NAME:50070/in your browser it should take you to a page with aBrowse the filesystemlink.In the
$HIVE_HOME/confdirectory there is thehive-default.xmland/orhive-site.xmlwhich has thehive.metastore.warehouse.dirproperty. That value is where you will want to navigate to after clicking theBrowse the filesystemlink.In mine, it’s
/usr/hive/warehouse. Once I navigate to that location, I see the names of my tables. Clicking on a table name (which is just a folder) will then expose the partitions of the table. In my case, I currently only have it partitioned ondate. When I click on the folder at this level, I will then see files (more partitioning will have more levels). These files are where the data is actually stored on the HDFS.I have not attempted to access these files directly, I’m assuming it can be done. I would take GREAT care if you are thinking about editing them. 🙂
For me – I’d figure out a way to do what I need to without direct access to the Hive data on the disk. If you need access to raw data, you can use a Hive query and output the result to a file. These will have the exact same structure (divider between columns, ect) as the files on the
HDFS. I do queries like this all the time and convert them to CSVs.The section about how to write data from queries to disk is https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries
UPDATE
Since Hadoop 3.0.0 – Alpha 1 there is a change in the default port numbers. NAMENODE_MACHINE_NAME:50070 changes to NAMENODE_MACHINE_NAME:9870. Use the latter if you are running on Hadoop 3.x. The full list of port changes are described in HDFS-9427