How to Create MapReduce Jobs for Hadoop Using C#

This article introduces you to Big Data, Apache Hadoop and MapReduce jobs. We will also learn how to create MapReduce jobs using C#.

What is Big Data?

The current pace at which the IT industries are gathering data provides the huge challenge of storing and processing them. Their sizes are growing by zetta bytes and yotta bytes. Even a small product based company wants to collect the telemetry information of its product in order to do some business analysis and make improvements. The data extends to productions logs of various applications, telemetry information sent by a wide range of products, videos streamed by the monitoring cameras, etc.

Big data is huge amount of continuously increasing data, which includes structured, semi-structured and complex types. Big data can be defined using three Vs.

1. Volume – The size of the data stored.

2. Velocity – The pace at which the data is streamed.

3. Variety – The different formats of data that are received.

Hadoop – A Solution for Big Data

Hadoop is a platform developed to handle big data and most of the Hadoop framework is built using the java programming language. It is an open source platform and the core technology concept was something that was adopted from Google. The Hadoop File System (HDFS) is the one built for storing big data and the map reduce model is the one used for processing big data.

The Hadoop is a clustered environment running many servers that do not share any memory. The fed data is chunked and stored onto the different cluster nodes. Hadoop also keeps track of the information on which data is stored in which cluster. An Hadoop cluster will be composed of a master node and multiple worker nodes. There could also be multi-level hierarchical nodes.

Hadoop on Windows Azure – HDInsight

Microsoft provides the Hadoop environment on Windows Azure. The Hadoop service provided by Microsoft is HDInsight. The HDInsight service allows people to create Hadoop clusters on Windows Azure pretty quickly. As a developer you don’t have to worry much about simulating the cluster in your development environment as Microsoft readily offers HDInsight Emulator, which is a single node cluster.

HDInsight Emulator can be downloaded from here.

MapReduce Jobs

MapReduce is also a programming model originally developed and followed by Google. This is the model which is followed to write the jobs for processing big data stored on the Hadoop clusters. The programming model follows two steps basically which is Map and Reduce.

Map – It is the task sent to map the available data from the master node to the worker nodes. If there is a multilevel hierarchy then the map task is propagated to the lower level nodes in the cluster. The result is then returned from the worker node back to its master node.

Reduce – The master node gathers the query answers from its worker nodes, combines them together and returns it back as the output.

Sample MapReduce Job in .NET

Using the .NET framework you can write the MapReduce jobs for the HDInsight Hadoop clusters. It is achieved by including the NuGet package Microsoft .Net Map Reduce API for Hadoop. In this section let us create a sample Map Reduce job.

In this sample program let us create a simple map reduce job that counts the number of times the word “Error” is available on the data stored on the Hadoop clusters. As a first step create a console application and include the required NuGet package for the .NET map reduce job.

Create a Mapper class by deriving it from the base class MapperBase and name it as ErrorTextMapper. Following is the source code in C#.

    public class ErrorTextMapper : MapperBase
    {
        public override void Map(string inputLine, MapperContext context)
        {
            if (inputLine.ToLowerInvariant().Equals("error"))
                context.EmitLine(inputLine);
        }
    }

Now let’s go to the reduce step, which consolidates the results and returns back the output.

public class ErrorTextReducerCombiner : ReducerCombinerBase
{
    public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)
    {
        context.EmitKeyValue("errortextcount: ", values.Count().ToString());
    }
}

Finally we will write the code for the entry point of the job, which will interact with the Hadoop environment to get the job done.

class Program
{
    static void Main(string[] args)
    {
        HadoopJobConfiguration hadoopConfiguration = new HadoopJobConfiguration();
        hadoopConfiguration.InputPath = "/input";
        hadoopConfiguration.OutputFolder = "/output";
        Uri myUri = new Uri("DEV URL for Hadoop");
        IHadoop hadoop = Hadoop.Connect(myUri, "user_name", "pwn");
 
        hadoop.MapReduceJob.Execute<ErrorTextMapper, ErrorTextReducerCombiner>(hadoopConfiguration);
 
        Console.Read();
    }
}

I have kept the example very simple for the purpose of easy understanding. I hope this article provides a good introduction on Hadoop and writing map reduce jobs using .NET framework.

Happy Reading!



Comments

  • map_reduce_problem

    Posted by raheleh on 07/06/2014 02:50am

    Dear General Director : Thanks for your good tips in this web site. I want to use Hadoop MapReduce related with visual studio ( specially c#.net)for programming in c# with Map Reduce. so it is possible for you,please help me about Hadoop and Mapreduce Softwares that i need to install in my computer. Thanks With Best Regards R.Shahbazi

    Reply
  • Good heads-up to big data

    Posted by Muralitharan on 04/01/2014 12:35am

    This is a good heads-up; but should have been more details.

    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Java developers know that testing code changes can be a huge pain, and waiting for an application to redeploy after a code fix can take an eternity. Wouldn't it be great if you could see your code changes immediately, fine-tune, debug, explore and deploy code without waiting for ages? In this white paper, find out how that's possible with a Java plugin that drastically changes the way you develop, test and run Java applications. Discover the advantages of this plugin, and the changes you can expect to see …

  • Not long ago, security was viewed as one of the biggest obstacles to widespread adoption of cloud-based deployments for enterprise software solutions. However, the combination of advancing technology and an increasing variety of threats that companies must guard against is rapidly turning the tide. Cloud vendors typically offer a much higher level of data center and virtual system security than most organizations can or will build out on their own. Read this white paper to learn the five ways that cloud …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds