Working with Large Memory-Mapped Files in VB 2010

Introduction

The truth be told sometimes a technology comes along, and I don't have a user story for it. So, the technology may not be that hard to describe, but figuring out a user story just doesn't come to me or seem compact enough to fit into an article length column. Memory-mapped files are like that.

Generally a file is something that is read all at once or sequentially into memory and then manipulated. (Random access is supported with traditional streaming libraries, too.) The problem is very large files can easily blow out memory, especially on a 32-bit machine with a two gigabyte limit. Memory-mapped files logically treat all file data as if it is loaded in memory, and you can access any part of the file as if it were.

In a nutshell, a memory-mapped file allows you to treat a file as if it were entirely loaded in memory, and there is no logical upper limit to the file size. Without spilling the beans entirely let's chunk up the differences between File I/O and memory mapped files into some meaty bits.

File I/O vs. Memory Mapped Files

A traditional use for System.IO file usages is to read a file into memory and manipulate it or seek to a point in a file and manipulate that file at that point. The challenges with traditional File I/O have to do with physical memory limitations-two gigabytes (or usually much less) for 32-bit systems-and the cost of seeking through very large files. A streaming based file access system would probably not work very well for a database server, large text documents, or anything file-based that is particularly large.

The memory mapped file capabilities in .NET framework 4 work with physical files and logical files. The memory mapped capabilities logically map a file (or part of a file) and lets you treat the file as if it were all loaded in memory. The system's memory manager takes care of moving between the logical and physical mapping of a file. (There is also stream-based support for memory mapped files.) What this means is you can have a an extremely large file and interact with it as if it were all loaded and the memory manager handles moving the bits to and from its physical location.

There is still a two gigabyte limit for memory mapped files in a 32-bit system, but it is two gigabytes per chunk and you can create multiple chunks. The idea is that with memory mapped files you can grab all or a chunk of the file from a starting and ending location and access that chunk. Because you can split chunks up you can access more than the two gigabyte upper limit (on a 32-bit system), and you can access the chunks on different processes. Remember because you are logically working with memory you can think of mapped files as you would any other memory-this means breaking bits up between processes.

Accessing a File Using File IO

I think a challenging scenario might be writing your own file server, database server, or maybe a document processing application and explore how memory mapped files might help might improve performance. (However, this is probably a pretty big task.)

The task I picked was to perform a word frequency count. The example simply reads a file and counts the frequency of words. Listing 1 demonstrates one way you might do this using standard File I/O.

  Imports System.IO
  Imports System.IO.MemoryMappedFiles
  
  Public Class Form1
  
    Private Const filename As String = "..\..\cicero.txt"
    Private splitChars() As Char =
    {",", ".", " ", ":", ";", "/", "\", "[", "]",
      "{", "}", "=", "+", "-", "*", "`", "'", "1", "2",
      "3", "4", "5", "6", "7", "8", "9", "0", "(", ")", "!"}
  
    Private Sub Form1_Load(ByVal sender As System.Object,
    ByVal e As System.EventArgs) Handles MyBase.Load
  
      Me.DoubleBuffered = True
    End Sub
  
    Private Sub CountWordsWithStreamToolStripMenuItem_Click(
      ByVal sender As System.Object,
      ByVal e As System.EventArgs) Handles CountWordsWithStreamToolStripMenuItem.Click
  
      hash.Clear()
      Dim lines = File.ReadLines(filename)
      Dim words() As String
  
      For Each line In lines
        words = line.Split(splitChars, StringSplitOptions.RemoveEmptyEntries)
        For Each word In words
          AddWordToHash(word)
          Application.DoEvents()
        Next
      Next
      Dump(hash)
    End Sub
  
    Private Sub Dump(ByVal hash As Hashtable)
      TextBox1.Text = String.Format("Number of words: {0}", hash.Count)
      For i = 1 To hash.Count - 1
        TextBox1.Text += vbCrLf +
          String.Format("{0} occurs {1} times", hash.Keys()(i), hash(hash.Keys()(i)))
      Next
    End Sub
  
    
    Private Sub ClearAllToolStripMenuItem_Click(ByVal sender As System.Object,
    ByVal e As System.EventArgs) Handles ClearAllToolStripMenuItem.Click
      TextBox1.Clear()
    End Sub
  
    Private hash As Hashtable = New Hashtable()
    Private Sub AddWordToHash(ByVal word As String)
      word = word.ToLower()
      If (hash(word) Is Nothing) Then
        hash.Add(word, 1)
      Else
        hash(word) = CType(hash(word), Long) + 1
      End If
      word = ""
    End Sub
  
  End Class

Listing 1: Count the frequency of words using standard File I/O.

Listing 1 is pretty straight forward. Read all of the lines of a text file. Split each line into an array using an array of non-word characters and stick each word in a hashtable. Every time the same word is stuffed in the hash table--the word is the key and the count is the value--the value at that hash location is incremented. When you are all done you have a word frequency count. You could use the same approach for tasks like search and replace, highlight keywords, spell checking-thinking word processing here. You get the idea.

The output from running the code in Listing 1 against some text from Cicero is shown in Figure 1. Cicero was a Roman philosopher and the quote was generated and extracted from .NET framework 4" TARGET="newFrame">http://www.blindtextgenerator.com.

word frequency of some text from Cicero
Figure 1: The word frequency of some text from Cicero.

Working with Large Memory-Mapped Files in VB 2010

Accessing a File Using a MemoryMappedFile

Memory mapped files can map to a logical file or a physical file. You can access a memory mapped file using a MemoryMappedViewAccessor, a MemoryMappedFile, or a MemoryMappedViewStream.

There are two kinds of memory mapped files: persisted and non-persisted. Persisted memory mapped files are files that are persisted on the file system. When the last process wraps up the data is saved to the file system. Non-persisted memory mapped files are not associated with a file in the file system, and when the last process is finished the data is lost and the memory is reclaimed by the garbage collector.

If you want to work with a memory mapped file in the same way you work with file streams then create an instance of the MemoryMappedViewStream. The methods in that class jive with filestream methods. If you want to work with persisted and non-persisted views in a non-filestream based way then request a MemoryMappedViewAccessor. Stream-style usage is going to have methods like Read, Write, and Seek. Non-stream-style usage supports methods that allow you to read native types like characters and numbers as well as structures and arrays of structures.

Listing 2 demonstrates how to read all of the characters in a file using MemoryMappedFile and count the frequency of words. Notice that there is no single place where all of the file data is accessed at once.

  Imports System.IO
  Imports System.IO.MemoryMappedFiles
  
  Public Class Form1
  
    Private Const filename As String = "..\..\cicero.txt"
    Private splitChars() As Char =
    {",", ".", " ", ":", ";", "/", "\", "[", "]",
      "{", "}", "=", "+", "-", "*", "`", "'", "1", "2",
      "3", "4", "5", "6", "7", "8", "9", "0", "(", ")", "!"}
  
    Private Sub Form1_Load(ByVal sender As System.Object,
    ByVal e As System.EventArgs) Handles MyBase.Load
  
      Me.DoubleBuffered = True
    End Sub
  
    Private Sub Dump(ByVal hash As Hashtable)
      TextBox1.Text = String.Format("Number of words: {0}", hash.Count)
      For i = 1 To hash.Count - 1
        TextBox1.Text += vbCrLf +
          String.Format("{0} occurs {1} times", hash.Keys()(i), hash(hash.Keys()(i)))
      Next
    End Sub
  
    
    Private Sub ClearAllToolStripMenuItem_Click(ByVal sender As System.Object,
    ByVal e As System.EventArgs) Handles ClearAllToolStripMenuItem.Click
      TextBox1.Clear()
    End Sub
  
    Private hash As Hashtable = New Hashtable()
    Private Sub AddWordToHash(ByVal word As String)
      word = word.ToLower()
      If (hash(word) Is Nothing) Then
        hash.Add(word, 1)
      Else
        hash(word) = CType(hash(word), Long) + 1
      End If
      word = ""
    End Sub
  
    Private Sub CountWordsWithMappedFileToolStripMenuItem_Click(
    ByVal sender As System.Object,
    ByVal e As System.EventArgs) Handles CountWordsWithMappedFileToolStripMenuItem.Click
  
      hash.Clear()
      Dim word As String = ""
      Dim ch As Char = ""
  
      Dim mappedFile As MemoryMappedFile =
        MemoryMappedFile.CreateFromFile(Path.GetFullPath(filename))
      Try
      Dim position As Long = 0
        Using accessor = mappedFile.CreateViewAccessor()
            While (position < accessor.Capacity)
              ch = Microsoft.VisualBasic.ChrW(accessor.ReadByte(position))
  
              If (Not splitChars.Contains(ch)) Then
                word += ch
              Else
               AddWordToHash(word)
               word = ""
              End If
  
              position += 1
              Application.DoEvents()
            End While
        End Using
      Finally
        mappedFile.Dispose()
      End Try
  
      Dump(hash)
    End Sub
  End Class

Listing 2: Counting word frequency using a MemoryMappedFile.

Listing 2 performs the same task but works differently. The first statement clears the storage Hashtable (no big deal). CreateFromFile is one of the methods that creates a MemoryMappedFile instance from a persisted file--a file on disk. There are other methods for persisted and non-persisted files. MemoryMappedFile.CreateViewAccessor without parameters maps the entire file to memory returning a MemoryMappedViewAccessor. Both the MemoryMappedFile and MemoryMappedViewAccessor are IDisposable so use a Try Finally block and explicitly call Dispose or a Using statement which implicitly calls Dispose at the end of the Using statement. (A Using statement is basically interpolated into a try finally block at compile time.)

Because String is not a discrete type MemoryMappedFiles do not read strings, so my approach reads a byte at a time breaking words on the characters I defined as word delimiters. Again, each unique word is inserted into the Hashtable and the number of inserts is counted. Because the text file, word-finder concept is so simple there aren't significant performance differences in this demo as written. When basic File I/O and streams are too slow or won't work then use a MemoryMappedFile.

The example in Listing 3 splits the text file into a couple of chunks using a MemoryMappedFile and threads to illustrate that MemoryMappedFiles support multiple, simultaneous processes against the same file.

  Imports System.IO
  Imports System.IO.MemoryMappedFiles
  Imports System.ComponentModel
  Imports System.Collections.Concurrent
  
  Module Module1
  
      Private Const filename As String = "..\..\cicero.txt"
      
      Private fullpath As String = Path.GetFullPath(filename)
      Private info As FileInfo = New FileInfo(fullpath)
      Private hash As ConcurrentDictionary(Of String, Long) = 
        New ConcurrentDictionary(Of String, Long)()
  
  
      Sub Main()
        Dim mapped As MemoryMappedFile = MemoryMappedFile.CreateFromFile(fullpath,
          FileMode.Open, "Mapped1")
        Dim worker1 As BackgroundWorker = New BackgroundWorker()
        Dim worker2 As BackgroundWorker = New BackgroundWorker()
        Try
          Dim w1 As MyWorker = New MyWorker(0, info.Length / 2, hash)
          Dim w2 As MyWorker = New MyWorker(info.Length / 2, info.Length, hash)
  
          AddHandler worker1.DoWork, AddressOf w1.Work
          AddHandler worker2.DoWork, AddressOf w2.Work
          worker1.RunWorkerAsync(mapped)
          worker2.RunWorkerAsync(mapped)
  
          While (worker1.IsBusy Or worker2.IsBusy)
  
          End While
  
        Finally
          mapped.Dispose()
          worker1.Dispose()
          worker2.Dispose()
        End Try
  
        Dump(hash)
        Console.ReadLine()
  
      End Sub
  
      Sub Dump(ByVal hash As ConcurrentDictionary(Of String, Long))
        Console.WriteLine("Number of words: {0}", hash.Count)
  
        Dim ordered = From k In hash.Keys
                      Order By k
                      Select New With {.Word = k, .Count = hash(k)}
  
                      
        Array.ForEach(ordered.ToArray(), Sub(o)
          Console.WriteLine("{0} occurs {1} times", o.Word, o.Count)
        End Sub)
      End Sub
  
  End Module
  
  
  Public Class MyWorker
    Private splitChars() As Char =
      {",", ".", " ", ":", ";", "/", "\", "[", "]",
      "{", "}", "=", "+", "-", "*", "`", "'", "1", "2",
      "3", "4", "5", "6", "7", "8", "9", "0", "(", ")", "!"}
  
    Private Property start As Long
    Private Property finish As Long
    Private hash As ConcurrentDictionary(Of String, Long)
    Private worker As BackgroundWorker = New BackgroundWorker()
  
    ''' <summary>
    ''' Initializes a new instance of the MyWorker class.
    ''' </summary>
    ''' <param name="Hash"></param>
    Public Sub New(ByVal Start As Long, ByVal Finish As Long,
      ByVal Hash As ConcurrentDictionary(Of String, Long))
        Me.start = Start
        Me.finish = Finish
        Me.hash = Hash
    End Sub
  
  
    Private Sub AddWordToHash(ByVal word As String)
      If (Not hash.ContainsKey(word)) Then
        hash.TryAdd(word, 1)
      Else
        hash(word) = CType(hash(word), Long) + 1
      End If
    End Sub
  
    Public Sub Work(ByVal sender As Object, ByVal e As DoWorkEventArgs)
      Dim mapped As MemoryMappedFile = DirectCast(e.Argument, MemoryMappedFile)
      Dim position As Long = start
      Dim ch As Char
      Dim word As String = ""
      Using accessor = mapped.CreateViewAccessor()
        While (position < finish)
          ch = Microsoft.VisualBasic.ChrW(accessor.ReadByte(position))
          If (Not splitChars.Contains(ch)) Then
            word += ch
          Else
            AddWordToHash(word)
            word = ""
          End If
          position += 1
        End While
      End Using
    End Sub
  End Class

Listing 3: Using the BackgroundWorker to split reading the mapped file into multiple threads.

The revised sample program uses the ConcurrentDictionary which is thread-safe. The file is split in half and each half is processed on its own BackgroundWorker.

Summary

The MemoryMappedFile supports a stream mode that lets you perform seeks like the System.IO File stream classes and it supports mapping a file to memory so you can access very large files as if they were an object in memory. For extremely large files, like those that exceed the 32-bit memory limit you can split the file into multiple MemoryMappedViewAccessors and operate chunks of the file up to the logical memory limit of each process.





About the Author

Paul Kimmel

Paul Kimmel is the VB Today columnist for CodeGuru and has written several books on object-oriented programming and .NET. Check out his upcoming book Professional DevExpress ASP.NET Controls (from Wiley) now available on Amazon.com and fine bookstores everywhere. Look for his upcoming book Teach Yourself the ADO.NET Entity Framework in 24 Hours (from Sams). You may contact him for technology questions at pkimmel@softconcepts .com. Paul Kimmel is a Technical Evangelist for Developer Express, Inc, and you can ask him about Developer Express at paulk@devexpress.com and read his DX blog at http:// community.devexpress.com/blogs/paulk.

Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Savvy enterprises are discovering that the cloud holds the power to transform IT processes and support business objectives. IT departments can use the cloud to redefine the continuum of development and operations—a process that is becoming known as DevOps. Download the Executive Brief DevOps: Why IT Operations Managers Should Care About the Cloud—prepared by Frost & Sullivan and sponsored by IBM—to learn how IBM SmartCloud Application services provide a robust platform that streamlines …

  • In this on-demand webcast, Oracle ACE and Toad Product Architect Bert Scalzo discusses 10 powerful and hidden features in Toad® that help increase your productivity and DB performance. Watch this webcast today.

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds