How to Create MapReduce Jobs for Hadoop Using C#

This article introduces you to Big Data, Apache Hadoop and MapReduce jobs. We will also learn how to create MapReduce jobs using C#.

What is Big Data?

The current pace at which the IT industries are gathering data provides the huge challenge of storing and processing them. Their sizes are growing by zetta bytes and yotta bytes. Even a small product based company wants to collect the telemetry information of its product in order to do some business analysis and make improvements. The data extends to productions logs of various applications, telemetry information sent by a wide range of products, videos streamed by the monitoring cameras, etc.

Big data is huge amount of continuously increasing data, which includes structured, semi-structured and complex types. Big data can be defined using three Vs.

1. Volume – The size of the data stored.

2. Velocity – The pace at which the data is streamed.

3. Variety – The different formats of data that are received.

Hadoop – A Solution for Big Data

Hadoop is a platform developed to handle big data and most of the Hadoop framework is built using the java programming language. It is an open source platform and the core technology concept was something that was adopted from Google. The Hadoop File System (HDFS) is the one built for storing big data and the map reduce model is the one used for processing big data.

The Hadoop is a clustered environment running many servers that do not share any memory. The fed data is chunked and stored onto the different cluster nodes. Hadoop also keeps track of the information on which data is stored in which cluster. An Hadoop cluster will be composed of a master node and multiple worker nodes. There could also be multi-level hierarchical nodes.

Hadoop on Windows Azure – HDInsight

Microsoft provides the Hadoop environment on Windows Azure. The Hadoop service provided by Microsoft is HDInsight. The HDInsight service allows people to create Hadoop clusters on Windows Azure pretty quickly. As a developer you don’t have to worry much about simulating the cluster in your development environment as Microsoft readily offers HDInsight Emulator, which is a single node cluster.

HDInsight Emulator can be downloaded from here.

MapReduce Jobs

MapReduce is also a programming model originally developed and followed by Google. This is the model which is followed to write the jobs for processing big data stored on the Hadoop clusters. The programming model follows two steps basically which is Map and Reduce.

Map – It is the task sent to map the available data from the master node to the worker nodes. If there is a multilevel hierarchy then the map task is propagated to the lower level nodes in the cluster. The result is then returned from the worker node back to its master node.

Reduce – The master node gathers the query answers from its worker nodes, combines them together and returns it back as the output.

Sample MapReduce Job in .NET

Using the .NET framework you can write the MapReduce jobs for the HDInsight Hadoop clusters. It is achieved by including the NuGet package Microsoft .Net Map Reduce API for Hadoop. In this section let us create a sample Map Reduce job.

In this sample program let us create a simple map reduce job that counts the number of times the word “Error” is available on the data stored on the Hadoop clusters. As a first step create a console application and include the required NuGet package for the .NET map reduce job.

Create a Mapper class by deriving it from the base class MapperBase and name it as ErrorTextMapper. Following is the source code in C#.

    public class ErrorTextMapper : MapperBase
    {
        public override void Map(string inputLine, MapperContext context)
        {
            if (inputLine.ToLowerInvariant().Equals("error"))
                context.EmitLine(inputLine);
        }
    }

Now let’s go to the reduce step, which consolidates the results and returns back the output.

public class ErrorTextReducerCombiner : ReducerCombinerBase
{
    public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)
    {
        context.EmitKeyValue("errortextcount: ", values.Count().ToString());
    }
}

Finally we will write the code for the entry point of the job, which will interact with the Hadoop environment to get the job done.

class Program
{
    static void Main(string[] args)
    {
        HadoopJobConfiguration hadoopConfiguration = new HadoopJobConfiguration();
        hadoopConfiguration.InputPath = "/input";
        hadoopConfiguration.OutputFolder = "/output";
        Uri myUri = new Uri("DEV URL for Hadoop");
        IHadoop hadoop = Hadoop.Connect(myUri, "user_name", "pwn");
 
        hadoop.MapReduceJob.Execute<ErrorTextMapper, ErrorTextReducerCombiner>(hadoopConfiguration);
 
        Console.Read();
    }
}

I have kept the example very simple for the purpose of easy understanding. I hope this article provides a good introduction on Hadoop and writing map reduce jobs using .NET framework.

Happy Reading!



Comments

  • hadoop training

    Posted by Hadoop online Training on 10/17/2014 03:15am

    Hi, Thanks for nice data provided easy way to learn training with real time on hadoop online training through the experienced experts

    Reply
  • map_reduce_problem

    Posted by raheleh on 07/06/2014 02:50am

    Dear General Director : Thanks for your good tips in this web site. I want to use Hadoop MapReduce related with visual studio ( specially c#.net)for programming in c# with Map Reduce. so it is possible for you,please help me about Hadoop and Mapreduce Softwares that i need to install in my computer. Thanks With Best Regards R.Shahbazi

    Reply
  • Good heads-up to big data

    Posted by Muralitharan on 04/01/2014 12:35am

    This is a good heads-up; but should have been more details.

    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Protecting business operations means shifting the priorities around availability from disaster recovery to business continuity. Enterprises are shifting their focus from recovery from a disaster to preventing the disaster in the first place. With this change in mindset, disaster recovery is no longer the first line of defense; the organizations with a smarter business continuity practice are less impacted when disasters strike. This SmartSelect will provide insight to help guide your enterprise toward better …

  • Live Event Date: October 29, 2014 @ 11:00 a.m. ET / 8:00 a.m. PT Are you interested in building a cognitive application using the power of IBM Watson? Need a platform that provides speed and ease for rapidly deploying this application? Join Chris Madison, Watson Solution Architect, as he walks through the process of building a Watson powered application on IBM Bluemix. Chris will talk about the new Watson Services just released on IBM bluemix, but more importantly he will do a step by step cognitive …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds