A Quick introduction to Hadoop Hive on Azure and Querying Hive using LINQ in C#

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems .

Installing the libraries

 

To start with, you can fire up Visual Studio, Create a console project, and install Microsoft.Hadoop.Hive libraries via Nuget.

 

install-package Microsoft.Hadoop.Hive -pre

 

Also, head over to http://hadoopoonazure.com and create a new cluster. And you are now set.

 

Creating the typed wrappers

 

To access Hive, you need to create a strongly typed wrapper – as of now, you need to roll this out your own, as there is no automated generation support. When you provision a Hadoop cluster, the Hive will be pre populated with a sample table (hivesampletable), and I’m using the same for the below example for brevity.  You can connect to the Hive via ODBC and see the hive tables in Excel.

 

So, let us go ahead and create a hive connection (much like an EF data context) and a typed representation for a row in the table. HiveConnection and HiveTable types are in the Microsoft.Hadoop.Hive namespace.

 

    
    //Our concrete hive connection
    
    public class SampleHiveConnection : HiveConnection
    {
        public SampleHiveConnection(string hostName, int port) 
            : base(hostName, port, null, null) { }

        public SampleHiveConnection(string hostName, int port, 
                            string username, string password) 
            : base(hostName, port, username, password) { }

        public HiveTable<DeviceInfo> DeviceInfoTable
        {
            get
            {
                return this.GetTable<DeviceInfo>("hivesampletable");
            }
        }
    }

    //A typed row. Property names based on field names hivesampletable
    
    public class DeviceInfo : HiveRow
    {
        public string DevicePlatform { get; set; }
        public string DeviceMake { get; set; }
        public int ClientId { get; set; }
    }

 

Querying the Hive using LINQ

 

Now,  you may perform LINQ queries against your Hive context, thanks to the Hadoop SDK we installed via Nuget. Just make sure to substitute the connection string, username and password with your own.

 

class Program
    {
        static void Main(string[] args)
        {


            //Create a hive connection
            //I've my cluster in https://www.hadooponazure.com
            var hive = new SampleHiveConnection(
                    "saintcluster.cloudapp.net", //your connection string
                    10000,                       //port                    
                    "user",                      //your username
                    "yourpass");                 //your password


            //Get the results
            //Make sure you goto the dashboard and turn on the ODBC port
            var res = from d in hive.DeviceInfoTable
                      where d.ClientId < 100
                      select d;

            //Dump it to the console if you like
            var list = res.ToList();     

        }
    }

That is cool. Your LINQ query will be submitted to the Azure cluster via the ODBC driver, and will be compiled and executed in the Hive. – See more at: http://www.amazedsaint.com/2013/02/a-quick-introduction-to-hadoop-hive-on.html#sthash.7v7Kxu1E.dpuf

Advertisements
By Sriramjithendra Posted in Big Data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s