How to Create MapReduce Jobs for Hadoop Using C#

This article introduces you to Big Data, Apache Hadoop and MapReduce jobs. We will also learn how to create MapReduce jobs using C#.

What is Big Data?

The current pace at which the IT industries are gathering data provides the huge challenge of storing and processing them. Their sizes are growing by zetta bytes and yotta bytes. Even a small product based company wants to collect the telemetry information of its product in order to do some business analysis and make improvements. The data extends to productions logs of various applications, telemetry information sent by a wide range of products, videos streamed by the monitoring cameras, etc.

Big data is huge amount of continuously increasing data, which includes structured, semi-structured and complex types. Big data can be defined using three Vs.

1. Volume – The size of the data stored.

2. Velocity – The pace at which the data is streamed.

3. Variety – The different formats of data that are received.

Hadoop – A Solution for Big Data

Hadoop is a platform developed to handle big data and most of the Hadoop framework is built using the java programming language. It is an open source platform and the core technology concept was something that was adopted from Google. The Hadoop File System (HDFS) is the one built for storing big data and the map reduce model is the one used for processing big data.

The Hadoop is a clustered environment running many servers that do not share any memory. The fed data is chunked and stored onto the different cluster nodes. Hadoop also keeps track of the information on which data is stored in which cluster. An Hadoop cluster will be composed of a master node and multiple worker nodes. There could also be multi-level hierarchical nodes.

Hadoop on Windows Azure – HDInsight

Microsoft provides the Hadoop environment on Windows Azure. The Hadoop service provided by Microsoft is HDInsight. The HDInsight service allows people to create Hadoop clusters on Windows Azure pretty quickly. As a developer you don’t have to worry much about simulating the cluster in your development environment as Microsoft readily offers HDInsight Emulator, which is a single node cluster.

HDInsight Emulator can be downloaded from here.

MapReduce Jobs

MapReduce is also a programming model originally developed and followed by Google. This is the model which is followed to write the jobs for processing big data stored on the Hadoop clusters. The programming model follows two steps basically which is Map and Reduce.

Map – It is the task sent to map the available data from the master node to the worker nodes. If there is a multilevel hierarchy then the map task is propagated to the lower level nodes in the cluster. The result is then returned from the worker node back to its master node.

Reduce – The master node gathers the query answers from its worker nodes, combines them together and returns it back as the output.

Sample MapReduce Job in .NET

Using the .NET framework you can write the MapReduce jobs for the HDInsight Hadoop clusters. It is achieved by including the NuGet package Microsoft .Net Map Reduce API for Hadoop. In this section let us create a sample Map Reduce job.

In this sample program let us create a simple map reduce job that counts the number of times the word “Error” is available on the data stored on the Hadoop clusters. As a first step create a console application and include the required NuGet package for the .NET map reduce job.

Create a Mapper class by deriving it from the base class MapperBase and name it as ErrorTextMapper. Following is the source code in C#.

    public class ErrorTextMapper : MapperBase
    {
        public override void Map(string inputLine, MapperContext context)
        {
            if (inputLine.ToLowerInvariant().Equals("error"))
                context.EmitLine(inputLine);
        }
    }

Now let’s go to the reduce step, which consolidates the results and returns back the output.

public class ErrorTextReducerCombiner : ReducerCombinerBase
{
    public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)
    {
        context.EmitKeyValue("errortextcount: ", values.Count().ToString());
    }
}

Finally we will write the code for the entry point of the job, which will interact with the Hadoop environment to get the job done.

class Program
{
    static void Main(string[] args)
    {
        HadoopJobConfiguration hadoopConfiguration = new HadoopJobConfiguration();
        hadoopConfiguration.InputPath = "/input";
        hadoopConfiguration.OutputFolder = "/output";
        Uri myUri = new Uri("DEV URL for Hadoop");
        IHadoop hadoop = Hadoop.Connect(myUri, "user_name", "pwn");
 
        hadoop.MapReduceJob.Execute<ErrorTextMapper, ErrorTextReducerCombiner>(hadoopConfiguration);
 
        Console.Read();
    }
}

I have kept the example very simple for the purpose of easy understanding. I hope this article provides a good introduction on Hadoop and writing map reduce jobs using .NET framework.

Happy Reading!



Comments

  • map_reduce_problem

    Posted by raheleh on 07/06/2014 02:50am

    Dear General Director : Thanks for your good tips in this web site. I want to use Hadoop MapReduce related with visual studio ( specially c#.net)for programming in c# with Map Reduce. so it is possible for you,please help me about Hadoop and Mapreduce Softwares that i need to install in my computer. Thanks With Best Regards R.Shahbazi

    Reply
  • Good heads-up to big data

    Posted by Muralitharan on 04/01/2014 12:35am

    This is a good heads-up; but should have been more details.

    Reply
Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Hybrid cloud platforms need to think in terms of sweet spots when it comes to application platform interface (API) integration. Cloud Velocity has taken a unique approach to tight integration with the API sweet spot; enough to support the agility of physical and virtual apps, including multi-tier environments and databases, while reducing capital and operating costs. Read this case study to learn how a global-level Fortune 1000 company was able to deploy an entire 6+ TB Oracle eCommerce stack in Amazon Web …

  • Event Date: April 15, 2014 The ability to effectively set sales goals, assign quotas and territories, bring new people on board and quickly make adjustments to the sales force is often crucial to success--and to the field experience! But for sales operations leaders, managing the administrative processes, systems, data and various departments to get it all right can often be difficult, inefficient and manually intensive. Register for this webinar and learn how you can: Align sales goals, quotas and …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds