Introduction to LINQ, Part 1: LINQ to Objects

Perhaps the most important new feature to the next version of Visual Studio, for now code-named 'Orcas,' is the release of LINQ, which stands for Language INtegrated Queries. LINQ is actually a set of operators, called standard query operators, that allow the querying of any .NET sequence. LINQ comes in three sub-sets:

  • LINQ to ADO.NET, which includes the following:
    • LINQ to Entities, for querying EDM entities
    • LINQ to DataSet, for querying objects in the DataSet family
    • LINQ to SQL, for querying relational databases
  • LINQ to XML, for querying XML data
  • LINQ to objects, for querying any sequence of objects

In a series of three articles, I will introduce you to LINQ to objects, LINQ to XML, and LINQ to SQL. This first article refers to the former.

Why LINQ?

You may wonder whether LINQ was necessary. Is it actually useful? A very large number of applications deal with data sources, and the most two common sources are XML files and relational databases. However, in an application that works with relational databases, you actually deal with two languages: the first is the language the application is written in (C#, C++, Java, and so forth), and second is SQL. However, the SQL commands are always written as strings, which means it lacks two things:

  • Compile time support: You cannot know until run time whether the string is correct because the compiler treats it as a simple string.
  • Design time support: No tool offers IntelliSense support for these command strings.

LINQ addresses both these issues. Because the queries are integrated into the language, they are verified during compilation and you have IntelliSense support for them.

On the other hand, in procedural languages—such as C#, C++, and Java—you have to specify not only "what" to do, but also "how" to do that. Many times, that means a lot of code. However, if you could specify only "what" you want to do, and have the compiler or other tools decide how to do it, your work would be simpler and productivity would be increased. And that's where LINQ steps in, because you don't specify how a query is made; only what you want to query.

Short Overview of C# 3.0 Features

LINQ actually is based on several new features to C# 2.0 and especially 3.0. These include:

  • Lambda expressions: Express the implementation of a method and the instantiation of a delegate from that method; they have the form:
    c => c + 1
    which means a function with an argument, that returns the value of the argument incremented by one
  • Extension methods: Extend classes that you could not modify. They are defined as static methods of other classes that take at least one parameter, and the first parameter has the type of the class it extends, but preceded by the keyword this.
  • Initialization of objects and collections in expression context: Initialization previously could be done only in a statement context. Now, it could be done in an expression statement, such as when passing a parameter to a function.
  • Local types inference: The context keyword var is used as a placeholder for the type, which is inferred at compile time from the expression on the right side; it can only be used for local variables. Classes cannot have members declared with var, and functions cannot return var.
  • Anonymous types: The context keyword var also can be used to instantiate objects of types that are not explicitly defined, but created by the compiler. These anonymous types are limited to local scope.
  • Lazy evaluation: In C# 2.0, the keyword yields was introduced; this is used in loops before a return statement and delays the iteration of a source until the result is iterated. To understand what that means, I suggest reading more about it in my blog; lazy evaluation is essential for performance on LINQ queries because it avoids the generation of intermediary unnecessary results by delaying the execution until the latest possible moment, when all information about what is wanted is known.

I have only enumerated these new features. If they are unknown to you, I suggest additional readings, such as this Preview of What's New in C# 3.0 by Sahil Malik.

Classic Approach

To understand how helpful LINQ could be, I will start with a problem approached in the classic procedural way in C#. Consider that you have a list of UEFA Champions League winners and you want to list the winners on a console. However, they should be grouped on the countries they represent, ordered descending by the number of winners from each country, and in case of the same number of winners, alphabetically by the name of the country.

Naturally, I would start by defining a class Winner, and a list of winners:

/// <summary>
/// Encapsulates information about a UEFA Champions League winner
/// </summary>
class Winner
{
   string _name;
   string _country;
   int _year;

   public string Name
   {
      get { return _name; }
      set { _name = value; }
   }

   public string Country
   {
      get { return _country; }
      set { _country = value; }
   }

   public int Year
   {
      get { return _year; }
      set { _year = value; }
   }

   public Winner(string name, string country, int year)
   {
      _name = name;
      _country = country;
      _year = year;
   }
}

/// <summary>
/// utility class with a single method returning the sequence of
/// winners
/// </summary>
class UCL
{
   /// <summary>
   /// returns a sequence of all UCL winners
   /// </summary>
   /// <returns>IEnumerable<Winner></returns>
   public static IEnumerable<Winner> GetWinners()
   {
      Winner[] winners =  {
         new Winner("Barcelona", "Spain", 2006),
         new Winner("Liverpool", "England", 2005),
         new Winner("FC Porto", "Portugal", 2004),
         new Winner("AC Milan", "Italy", 2003),
         new Winner("Real Madrid", "Spain", 2002),
         new Winner("Bayern Munchen", "Germany", 2001),
         new Winner("Real Madrid", "Spain", 2000),
         new Winner("Manchester Utd.", "England", 1999),
         new Winner("Real Madrid", "Spain", 1998),
         new Winner("Borussia Dortmund", "Germany", 1997),
         new Winner("Juventus", "Italy", 1996),
         new Winner("AFC Ajax", "Netherlands", 1995),
         new Winner("AC Milan", "Italy", 1994),
         new Winner("Olympique de Marseille", "France", 1993)
      };

      return winners;
   }
}

To list the countries according to the specified criteria, I would start by creating a dictionary with the country name as the key and a list of Winners as the value. After filling in this dictionary, I would have to re-sort the entries, ascending by the number of winners from each country. To do that, I would use a list with elements of same type as the dictionary, but sorted as specified above.

/// <summary>
/// orders the country descending by the number of titles won
/// </summary>
/// <param name="g1">first element to compare</param>
/// <param name="g2">second element to compare</param>
/// <returns>-1, 0 or 1</returns>
private static int
   CompareCoutryGroups(KeyValuePair<string, List<Winner>> g1,
                       KeyValuePair<string, List<Winner>> g2)
{
   if (g1.Value.Count > g2.Value.Count) return -1;
   else if (g1.Value.Count == g2.Value.Count)
   {
      return g1.Key.CompareTo(g2.Key);
   }
   return 1;
}

/// <summary>
/// prints the list of UCL winners grouped by countries,
/// descending by the number of title won by teams from that
/// country, and in case of same number of titles, alphabetically
/// by the country's name
/// </summary>
/// <param name="winners">sequence of Winner</param>
public void ListByCountriesClassic(IEnumerable<Winner> winners)
{
   // a dictionary is used to group the teams by the country
   // key is the country name
   // value is a list of winners from that country
   Dictionary<string, List<Winner>> dict =
      new Dictionary<string, List<Winner>>();

   // populate the dictionary with winners
   foreach (Winner w in winners)
   {
      try
      {
            dict[w.Country].Add(w);
      }
      catch (KeyNotFoundException)
      {
         dict.Add(w.Country, new List<Winner>());
         dict[w.Country].Add(w);
      }
   }

   // create a list with elements the key-value-pair from the
   // dictionary
   // the list is necessary to order the country groups by the
   // number of winners
   List<KeyValuePair<string, List<Winner>>> orderedlist =
        new List<KeyValuePair<string, List<Winner>>>();

   // populate the list
   foreach (KeyValuePair<string, List<Winner>> group in dict)
   {
      orderedlist.Add(group);
   }

   // sort the list by the specified criteria
   orderedlist.Sort(CompareCoutryGroups);

   // print the list
   foreach (KeyValuePair<string, List<Winner>> item in orderedlist)
   {
      Console.WriteLine("{0}: {1}", item.Key, item.Value.Count);

      foreach (Winner w in item.Value)
      {
         Console.WriteLine("{0}\t{1}", w.Year, w.Name);
      }
   }
}

For sorting, I defined a function called CompareCoutryGroups that takes two objects of type KeyValuePair<string, List<Winner>>, and returns -1, 0, or 1 according to the required criteria.

Of course, using that is quite trivial:

ObjectsDemo p = new ObjectsDemo();

IEnumerable<Winner> winners = UCL.GetWinners();

p.ListByCountriesClassic(winners);

and the output is:

Spain: 4
2006    Barcelona
2002    Real Madrid
2000    Real Madrid
1998    Real Madrid
Italy: 3
2003    AC Milan
1996    Juventus
1994    AC Milan
England: 2
2005    Liverpool
1999    Manchester Utd.
Germany: 2
2001    Bayern Munchen
1997    Borussia Dortmund
France: 1
1993    Olympique de Marseille
Netherlands: 1
1995    AFC Ajax
Portugal: 1
2004    FC Porto

You can see that Spain is first with 4 wins, followed by Italy with 3, and then England and Germany, both with 2 wins, but sorted ascending, alphabetically. The same for France, Netherlands, and Portugal, each with 1 win.

Introduction to LINQ, Part 1: LINQ to Objects

The LINQ Approach

All that can be greatly simplified with a LINQ query:

/// <summary>
/// list the winners using a query on the sequence of winners
/// </summary>
/// <param name="winners">sequence of winners</param>
public void ListByCountriesLINQ(IEnumerable<Winner> winners)
{
   // select the winners,
   // grouped by countries,
   // ordered the number of winners from each country and the name
   // of the country as the second criteria
   var result = from w in winners
                group w by w.Country into groups
                orderby groups.ToList().Count descending,
                groups.Key
                select groups;

   // print the groups
   foreach (var item in result)
   {
      Console.WriteLine("{0}: {1}", item.Key, item.ToList().Count);

      foreach (Winner w in item.ToList())
      {
         Console.WriteLine("{0}\t{1}", w.Year, w.Name);
      }
   }
}

What you see in the code above is a query performed with the declarative syntax (or query syntax), which is very similar to the SQL syntax (but looks different in C# and VB.NET, the two languages that currently support it). IntelliSense has full support on the declarative syntax, which is a language-specific syntax on top of the .NET query operators.

The query specifies that the winners are grouped by the country name, and the groups are ordered first by the number of elements, and in case of equality, the next criteria is the name of the country, and the projection of the query are the groups, which are later iterated and the winners are listed in the console. Of course, running the code would yield the same output.

What should be remarked here is the use of the keyword var as a placeholder for the type yielded by the query, which is actually IOrderedSequence<IGrouping<string, Winner>>. That is the sequence iterated in the foreach loop. I pointed out earlier that due to the use of the yield keyword, the source is iterated only when the result is iterated. In the absence of the foreach statement that iterates over the result of the query, the query is never executed. Thus, result actually only a description of a query.

Standard Query Operators

They allow querying any array or collection, and most of them are extension methods (defined in System.Query.Sequence and other classes). They are compliant to the .NET 2.0 common language specification. The arguments of the most operators are delegates, which means you could pass lambda expressions.

The same query shown with the query syntax can be put with he standard query operators like this:

/// <summary>
/// list the winners using a query with Standard Query Operators
/// (SQO) on the sequence of winners
/// </summary>
/// <param name="winners">sequence of winners</param>
public void ListByCountriesLINQwSQO(IEnumerable<Winner> winners)
{
   // select the winners,
   // grouped by countries,
   // ordered the number of winners from each country and the name
   // of country as second criteria
   var result = winners.
                GroupBy(w => w.Country).
                OrderByDescending(w => w.ToList().Count).
                ThenBy(w => w.Key).
                Select(w => w);

   // list the winners
   foreach (var item in result)
   {
      Console.WriteLine("{0}: {1}", item.Key, item.ToList().Count);

      foreach (Winner w in item.ToList())
      {
         Console.WriteLine("{0}\t{1}", w.Year, w.Name);
      }
   }
}

The GroupBy operator is applied on the collection winners (of type IEnumerable<Winner>), and on the result OrderByDescending is applied; then on the result ThenBy, and finally Select for projecting the groups of winners. As you can see, this is a chain of calls of functions that normally would produce intermediary results. Due to the use of the yield keyword and the mechanism of lazy evaluation, these results are avoided because the query is assembled at the latest possible moment.

Another important aspect is a small difference in the query syntax from the SQL syntax: In SQL the SELECT clause is always the first. With LINQ, that is not possible and actually is not natural because it's not possible to select something when you don't know the source. In LINQ, you first have to specify the source, and then what you project from the source. In the later example with the Standard Query Operators, that can be understood easily because GroupBy is applied on a source (winners).

There are several categories of operators:

  • restriction: Where
  • projection: Select, SelectMany
  • partitioning: Take, Skip, TakeWhile, SkipWhile
  • join: Join, GroupJoin
  • cancatening: Concat
  • ordering: OrderBy, ThenBy, OrderByDescending, ThenByDescending, Reverse
  • grouping: GroupBy
  • set: Distinct, Union, Intersect, Except
  • conversion: ToSequence, ToArray, ToList, ToDictionary, ToLookup, OfType, Cast
  • equality: EqualAll
  • elements: First, FirstOrDefault, Last, LastOrDefault, Single, SingleOrDefault, ElementAt, ElementAtOrDefault, DefaultEmpty
  • generation: Range, Repeat, Empty
  • qualifiers: Any, All, Contains
  • aggregation: Count, LongCount, Sum, Min, Max, Average, Aggregate

As has already been said, most of the operators take delegates as arguments, and that allows you to pass the lambda expression, as seen the query above. For instance, GroupBy( w => w.Country) shows a lambda expression that expresses the implementation of a method that takes arguments of type Winner and returns a string representing the name of the country of the winner. That is used as the key for the group when the winner is placed. The declaration from System.Linq.Enumerable of the overload of GroupBy used here is:

public static IEnumerable<IGrouping<TKey, TSource>>
   GroupBy<TSource, TKey>(this IEnumerable<TSource> source,
                          Func<TSource, TKey> keySelector);

GroupBy is an extension method, the first argument, preceded by keyword this, being IEnumerable<TSource> and the second being the lambda expression.

More Examples

Say you want to list the winners, but print only the name of the winners and the year of the victory. In this case, instead of projecting a sequence of Winners, you could project a sequence of an anonymous type with two properties, one for year and one for name:

/// <summary>
/// runs a query on the sequence on Winner and projects a sequence
/// of objects of anonymous type
/// </summary>
/// <param name="winners">sequence of winners</param>
void LinqQuery1(IEnumerable<Winner> winners)
{
   // orders the winners ascending by the year of the winning
   // and projects an anonymous type with two properties:
   //  - Year of type int, inferred from the property Year of Winner
   //  - Team of type string, inferred from the property Name of
   //    Winner
   var result = from w in winners
                orderby w.Year
                select new { w.Year, Team = w.Name };

   // lists the winner
   foreach (var w in result)
   {
      // accesses the properties Year and Team of the anonymous type
      Console.WriteLine("{0} {1}", w.Year, w.Team);
   }
}

The compiler-created anonymous type has two methods, Year and Name, with the type and name inferred from the type and name of properties Year and Name of Winner. IntelliSense shows IEnumerable<AnonymousType> for the result; if you look into the MSIL, you can see the anonymous type.

[anonymoustype.png]

In case you want to list only the distinct name of winners, you could use the operator Distinct like this:

/// <summary>
/// lists only the distinct winners
/// </summary>
/// <param name="winners">sequence of winners</param>
void LinqQuery2(IEnumerable<Winner> winners)
{
   // applies the set operator Distinct<> on a sequence that is the
   // result of a query
   // the variable result holds a sequence of distinct name of
   // winners
   var result = Enumerable.Distinct(
                   from w in winners
                   select w.Name);

   // lists the winners
   foreach (var w in result)
   {
      Console.WriteLine(w);
   }
}

The output in this case is:

Barcelona
Liverpool
FC Porto
AC Milan
Real Madrid
Bayern Munchen
Manchester Utd.
Borussia Dortmund
Juventus
AFC Ajax
Olympique de Marseille

A Last Exotic Example

As a final example in this article, I will show how you could sort an array of integers with a query:

/// <summary>
/// sorts an array on integers, ascending
/// </summary>
/// <param name="array"></param>
void SortWithLinq(int[] array)
{
   // sorts the array
   var result = from a in array
                orderby a
                select a;

   // prints the sorted sequence
   foreach (int a in result)
      Console.WriteLine(a);
}
ObjectsDemo p = new ObjectsDemo();

int[] array = {1, 5, 3, 2, 9, 4, 8, 6, 7 };
p.SortWithLinq(array);

Conclusions

LINQ is a very powerful addition to .NET, making queries first class citizens of languages that support it (for now, C# and VB.NET, possible IronPython too). It allows you to run queries on every .NET sequence, XML trees, or relational databases (the later two will be addressed in two separate articles). Ultimately, LINQ makes your developer lives easier because the code is simpler and has design-time support (IntelliSense) and compiler support.



About the Author

Marius Bancila

Marius Bancila is a Microsoft MVP for VC++. He works as a software developer for a Norwegian-based company. He is mainly focused on building desktop applications with MFC and VC#. He keeps a blog at www.mariusbancila.ro/blog, focused on Windows programming. He is the co-founder of codexpert.ro, a community for Romanian C++/VC++ programmers.

Downloads

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Live Event Date: December 11, 2014 @ 1:00 p.m. ET / 10:00 a.m. PT Market pressures to move more quickly and develop innovative applications are forcing organizations to rethink how they develop and release applications. The combination of public clouds and physical back-end infrastructures are a means to get applications out faster. However, these hybrid solutions complicate DevOps adoption, with application delivery pipelines that span across complex hybrid cloud and non-cloud environments. Check out this …

  • Due to internal controls and regulations, the amount of long term archival data is increasing every year. Since magnetic tape does not need to be periodically operated or connected to a power source, there will be no data loss because of performance degradation due to the drive actuator. Read this white paper to learn about a series of tests that determined magnetic tape is a reliable long-term storage solution for up to 30 years.

Most Popular Programming Stories

More for Developers

RSS Feeds