SHARE

How to Create MapReduce Jobs for Hadoop Using C#

This article introduces you to Big Data, Apache Hadoop and MapReduce jobs. We will also learn how to create MapReduce jobs using C#. What is Big Data? The current pace at which the IT industries are gathering data provides the huge challenge of storing and processing them. Their sizes are growing by zetta bytes and […]

Written By

CodeGuru Staff

Mar 6, 2014

3 minute read

CodeGuru content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

This article introduces you to Big Data, Apache Hadoop and MapReduce jobs. We will also learn how to create MapReduce jobs using C#.

What is Big Data?
Hadoop – A Solution for Big Data
Hadoop on Windows Azure – HDInsight
MapReduce Jobs
Sample MapReduce Job in .NET

What is Big Data?

The current pace at which the IT industries are gathering data provides the huge challenge of storing and processing them. Their sizes are growing by zetta bytes and yotta bytes. Even a small product based company wants to collect the telemetry information of its product in order to do some business analysis and make improvements. The data extends to productions logs of various applications, telemetry information sent by a wide range of products, videos streamed by the monitoring cameras, etc.

Big data is huge amount of continuously increasing data, which includes structured, semi-structured and complex types. Big data can be defined using three Vs.

1. Volume – The size of the data stored.

2. Velocity – The pace at which the data is streamed.

3. Variety – The different formats of data that are received.

Hadoop – A Solution for Big Data

Hadoop is a platform developed to handle big data and most of the Hadoop framework is built using the java programming language. It is an open source platform and the core technology concept was something that was adopted from Google. The Hadoop File System (HDFS) is the one built for storing big data and the map reduce model is the one used for processing big data.

The Hadoop is a clustered environment running many servers that do not share any memory. The fed data is chunked and stored onto the different cluster nodes. Hadoop also keeps track of the information on which data is stored in which cluster. An Hadoop cluster will be composed of a master node and multiple worker nodes. There could also be multi-level hierarchical nodes.

Hadoop on Windows Azure – HDInsight

Microsoft provides the Hadoop environment on Windows Azure. The Hadoop service provided by Microsoft is HDInsight. The HDInsight service allows people to create Hadoop clusters on Windows Azure pretty quickly. As a developer you don’t have to worry much about simulating the cluster in your development environment as Microsoft readily offers HDInsight Emulator, which is a single node cluster.

HDInsight Emulator can be downloaded from here.

MapReduce Jobs

MapReduce is also a programming model originally developed and followed by Google. This is the model which is followed to write the jobs for processing big data stored on the Hadoop clusters. The programming model follows two steps basically which is Map and Reduce.

Map – It is the task sent to map the available data from the master node to the worker nodes. If there is a multilevel hierarchy then the map task is propagated to the lower level nodes in the cluster. The result is then returned from the worker node back to its master node.

Reduce – The master node gathers the query answers from its worker nodes, combines them together and returns it back as the output.

Sample MapReduce Job in .NET

Using the .NET framework you can write the MapReduce jobs for the HDInsight Hadoop clusters. It is achieved by including the NuGet package Microsoft .Net Map Reduce API for Hadoop. In this section let us create a sample Map Reduce job.

In this sample program let us create a simple map reduce job that counts the number of times the word “Error” is available on the data stored on the Hadoop clusters. As a first step create a console application and include the required NuGet package for the .NET map reduce job.

Create a Mapper class by deriving it from the base class MapperBase and name it as ErrorTextMapper. Following is the source code in C#.

    public class ErrorTextMapper : MapperBase
    {
        public override void Map(string inputLine, MapperContext context)
        {
            if (inputLine.ToLowerInvariant().Equals("error"))
                context.EmitLine(inputLine);
        }
    }

Now let’s go to the reduce step, which consolidates the results and returns back the output.

public class ErrorTextReducerCombiner : ReducerCombinerBase
{
    public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)
    {
        context.EmitKeyValue("errortextcount: ", values.Count().ToString());
    }
}

Finally we will write the code for the entry point of the job, which will interact with the Hadoop environment to get the job done.

class Program
{
    static void Main(string[] args)
    {
        HadoopJobConfiguration hadoopConfiguration = new HadoopJobConfiguration();
        hadoopConfiguration.InputPath = "/input";
        hadoopConfiguration.OutputFolder = "/output";
        Uri myUri = new Uri("DEV URL for Hadoop");
        IHadoop hadoop = Hadoop.Connect(myUri, "user_name", "pwn");
 
        hadoop.MapReduceJob.Execute<ErrorTextMapper, ErrorTextReducerCombiner>(hadoopConfiguration);
 
        Console.Read();
    }
}

I have kept the example very simple for the purpose of easy understanding. I hope this article provides a good introduction on Hadoop and writing map reduce jobs using .NET framework.

Happy Reading!