Thursday, May 24, 2012

Hive - digging deeper into metastore

We usually use information_schema or database metadata tables to query the tables, columns, indexes ..in the case of traditional databases. How about the same in Hive?

In the case of Hive there is a metastore which acts as a metadata for the databases, Hive uses this database to store the tables, partitions, databases, serde in this database. Say you want to know the tables in a database or physical(HDFS) location of the tables or similar column names across the tables..

I have no idea about the default derby DB.. so letme talk about metastore in a traditional database like Mysql..( If you dont have your metastore in an external DB you can follow this link to do so.. https://ccp.cloudera.com/display/CDHDOC/Hive+Installation#HiveInstallation-ConfiguringtheHiveMetastore)

Hive metastore has minimal tables when compared to the metadata layer of a traditional database but Im sure the metadata schema will get bigger and complex in future.

Lets go through the most desirable tables ...

COLUMNS
DBS  -- the list of schemas
PARTITIONS
TBLS - tables

Querying these tables will give a basic idea and the datamodel of Hive metastore ( I couldnt find any over the internet), understanding of these tables are essential if you are seriously into Hive and have some production data.

for eg.. In a weird case we found some partitions of one database is being loaded into HDS location of another database (may be a code issue ) but using the following query Im able to narrow down to the affected tables..

select distinct a.tbl_id,a.tbl_name
    from TBLS a
        join PARTITIONS c on (a.TBL_ID=c.TBL_ID)
        join SDS b on (c.SD_ID=b.SD_ID)
 where b.location like '%sbx.db%' and a.DB_ID=1;