How to Create MapReduce Jobs for Hadoop Using C#

This article introduces you to Big Data, Apache Hadoop and MapReduce jobs. We will also learn how to create MapReduce jobs using C#.

What is Big Data?

The current pace at which the IT industries are gathering data provides the huge challenge of storing and processing them. Their sizes are growing by zetta bytes and yotta bytes. Even a small product based company wants to collect the telemetry information of its product in order to do some business analysis and make improvements. The data extends to productions logs of various applications, telemetry information sent by a wide range of products, videos streamed by the monitoring cameras, etc.

Big data is huge amount of continuously increasing data, which includes structured, semi-structured and complex types. Big data can be defined using three Vs.

1. Volume – The size of the data stored.

2. Velocity – The pace at which the data is streamed.

3. Variety – The different formats of data that are received.

Hadoop – A Solution for Big Data

Hadoop is a platform developed to handle big data and most of the Hadoop framework is built using the java programming language. It is an open source platform and the core technology concept was something that was adopted from Google. The Hadoop File System (HDFS) is the one built for storing big data and the map reduce model is the one used for processing big data.

The Hadoop is a clustered environment running many servers that do not share any memory. The fed data is chunked and stored onto the different cluster nodes. Hadoop also keeps track of the information on which data is stored in which cluster. An Hadoop cluster will be composed of a master node and multiple worker nodes. There could also be multi-level hierarchical nodes.

Hadoop on Windows Azure – HDInsight

Microsoft provides the Hadoop environment on Windows Azure. The Hadoop service provided by Microsoft is HDInsight. The HDInsight service allows people to create Hadoop clusters on Windows Azure pretty quickly. As a developer you don’t have to worry much about simulating the cluster in your development environment as Microsoft readily offers HDInsight Emulator, which is a single node cluster.

HDInsight Emulator can be downloaded from here.

MapReduce Jobs

MapReduce is also a programming model originally developed and followed by Google. This is the model which is followed to write the jobs for processing big data stored on the Hadoop clusters. The programming model follows two steps basically which is Map and Reduce.

Map – It is the task sent to map the available data from the master node to the worker nodes. If there is a multilevel hierarchy then the map task is propagated to the lower level nodes in the cluster. The result is then returned from the worker node back to its master node.

Reduce – The master node gathers the query answers from its worker nodes, combines them together and returns it back as the output.

Sample MapReduce Job in .NET

Using the .NET framework you can write the MapReduce jobs for the HDInsight Hadoop clusters. It is achieved by including the NuGet package Microsoft .Net Map Reduce API for Hadoop. In this section let us create a sample Map Reduce job.

In this sample program let us create a simple map reduce job that counts the number of times the word “Error” is available on the data stored on the Hadoop clusters. As a first step create a console application and include the required NuGet package for the .NET map reduce job.

Create a Mapper class by deriving it from the base class MapperBase and name it as ErrorTextMapper. Following is the source code in C#.

    public class ErrorTextMapper : MapperBase
        public override void Map(string inputLine, MapperContext context)
            if (inputLine.ToLowerInvariant().Equals("error"))

Now let’s go to the reduce step, which consolidates the results and returns back the output.

public class ErrorTextReducerCombiner : ReducerCombinerBase
    public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)
        context.EmitKeyValue("errortextcount: ", values.Count().ToString());

Finally we will write the code for the entry point of the job, which will interact with the Hadoop environment to get the job done.

class Program
    static void Main(string[] args)
        HadoopJobConfiguration hadoopConfiguration = new HadoopJobConfiguration();
        hadoopConfiguration.InputPath = "/input";
        hadoopConfiguration.OutputFolder = "/output";
        Uri myUri = new Uri("DEV URL for Hadoop");
        IHadoop hadoop = Hadoop.Connect(myUri, "user_name", "pwn");
        hadoop.MapReduceJob.Execute<ErrorTextMapper, ErrorTextReducerCombiner>(hadoopConfiguration);

I have kept the example very simple for the purpose of easy understanding. I hope this article provides a good introduction on Hadoop and writing map reduce jobs using .NET framework.

Happy Reading!


  • hadoop online training

    Posted by mjtrainings on 11/06/2014 03:07am

    Hi, nice to share information and hadoop real time online training with real time experts on hadoop online training industry based projects

  • hadoop training

    Posted by Hadoop online Training on 10/17/2014 03:15am

    Hi, Thanks for nice data provided easy way to learn training with real time on hadoop online training through the experienced experts

  • map_reduce_problem

    Posted by raheleh on 07/06/2014 02:50am

    Dear General Director : Thanks for your good tips in this web site. I want to use Hadoop MapReduce related with visual studio ( specially programming in c# with Map Reduce. so it is possible for you,please help me about Hadoop and Mapreduce Softwares that i need to install in my computer. Thanks With Best Regards R.Shahbazi

  • Good heads-up to big data

    Posted by Muralitharan on 04/01/2014 12:35am

    This is a good heads-up; but should have been more details.

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Learn How A Global Entertainment Company Saw a 448% ROI Every business today uses software to manage systems, deliver products, and empower employees to do their jobs. But software inevitably breaks, and when it does, businesses lose money -- in the form of dissatisfied customers, missed SLAs or lost productivity. PagerDuty, an operations performance platform, solves this problem by helping operations engineers and developers more effectively manage and resolve incidents across a company's global operations. …

  • Live Event Date: December 18, 2014 @ 2:00 p.m. ET / 11:00 a.m. PT The Internet of Things (IoT) incorporates physical devices into business processes using predictive analytics. While it relies heavily on existing Internet technologies, it differs by including physical devices, specialized protocols, physical analytics, and a unique partner network. To capture the real business value of IoT, the industry must move beyond customized projects to general patterns and platforms. Check out this upcoming webcast …

Most Popular Programming Stories

More for Developers

RSS Feeds