Working with Large Memory-Mapped Files in VB 2010

Introduction

The truth be told sometimes a technology comes along, and I don’t have a user story for it. So, the technology may not be that hard to describe, but figuring out a user story just doesn’t come to me or seem compact enough to fit into an article length column. Memory-mapped files are like that.

Generally a file is something that is read all at once or sequentially into memory and then manipulated. (Random access is supported with traditional streaming libraries, too.) The problem is very large files can easily blow out memory, especially on a 32-bit machine with a two gigabyte limit. Memory-mapped files logically treat all file data as if it is loaded in memory, and you can access any part of the file as if it were.

In a nutshell, a memory-mapped file allows you to treat a file as if it were entirely loaded in memory, and there is no logical upper limit to the file size. Without spilling the beans entirely let’s chunk up the differences between File I/O and memory mapped files into some meaty bits.

File I/O vs. Memory Mapped Files

A traditional use for System.IO file usages is to read a file into memory and manipulate it or seek to a point in a file and manipulate that file at that point. The challenges with traditional File I/O have to do with physical memory limitations-two gigabytes (or usually much less) for 32-bit systems-and the cost of seeking through very large files. A streaming based file access system would probably not work very well for a database server, large text documents, or anything file-based that is particularly large.

The memory mapped file capabilities in .NET framework 4 work with physical files and logical files. The memory mapped capabilities logically map a file (or part of a file) and lets you treat the file as if it were all loaded in memory. The system’s memory manager takes care of moving between the logical and physical mapping of a file. (There is also stream-based support for memory mapped files.) What this means is you can have a an extremely large file and interact with it as if it were all loaded and the memory manager handles moving the bits to and from its physical location.

There is still a two gigabyte limit for memory mapped files in a 32-bit system, but it is two gigabytes per chunk and you can create multiple chunks. The idea is that with memory mapped files you can grab all or a chunk of the file from a starting and ending location and access that chunk. Because you can split chunks up you can access more than the two gigabyte upper limit (on a 32-bit system), and you can access the chunks on different processes. Remember because you are logically working with memory you can think of mapped files as you would any other memory-this means breaking bits up between processes.

Accessing a File Using File IO

I think a challenging scenario might be writing your own file server, database server, or maybe a document processing application and explore how memory mapped files might help might improve performance. (However, this is probably a pretty big task.)

The task I picked was to perform a word frequency count. The example simply reads a file and counts the frequency of words. Listing 1 demonstrates one way you might do this using standard File I/O.

  Imports System.IO
  Imports System.IO.MemoryMappedFiles

  Public Class Form1

    Private Const filename As String = "..\..\cicero.txt"
    Private splitChars() As Char =
    {",", ".", " ", ":", ";", "/", "\", "[", "]",
      "{", "}", "=", "+", "-", "*", "`", "'", "1", "2",
      "3", "4", "5", "6", "7", "8", "9", "0", "(", ")", "!"}

    Private Sub Form1_Load(ByVal sender As System.Object,
    ByVal e As System.EventArgs) Handles MyBase.Load

      Me.DoubleBuffered = True
    End Sub

    Private Sub CountWordsWithStreamToolStripMenuItem_Click(
      ByVal sender As System.Object,
      ByVal e As System.EventArgs) Handles CountWordsWithStreamToolStripMenuItem.Click

      hash.Clear()
      Dim lines = File.ReadLines(filename)
      Dim words() As String

      For Each line In lines
        words = line.Split(splitChars, StringSplitOptions.RemoveEmptyEntries)
        For Each word In words
          AddWordToHash(word)
          Application.DoEvents()
        Next
      Next
      Dump(hash)
    End Sub

    Private Sub Dump(ByVal hash As Hashtable)
      TextBox1.Text = String.Format("Number of words: {0}", hash.Count)
      For i = 1 To hash.Count - 1
        TextBox1.Text += vbCrLf +
          String.Format("{0} occurs {1} times", hash.Keys()(i), hash(hash.Keys()(i)))
      Next
    End Sub

    
    Private Sub ClearAllToolStripMenuItem_Click(ByVal sender As System.Object,
    ByVal e As System.EventArgs) Handles ClearAllToolStripMenuItem.Click
      TextBox1.Clear()
    End Sub

    Private hash As Hashtable = New Hashtable()
    Private Sub AddWordToHash(ByVal word As String)
      word = word.ToLower()
      If (hash(word) Is Nothing) Then
        hash.Add(word, 1)
      Else
        hash(word) = CType(hash(word), Long) + 1
      End If
      word = ""
    End Sub

  End Class

Listing 1: Count the frequency of words using standard File I/O.

Listing 1 is pretty straight forward. Read all of the lines of a text file. Split each line into an array using an array of non-word characters and stick each word in a hashtable. Every time the same word is stuffed in the hash table–the word is the key and the count is the value–the value at that hash location is incremented. When you are all done you have a word frequency count. You could use the same approach for tasks like search and replace, highlight keywords, spell checking-thinking word processing here. You get the idea.

The output from running the code in Listing 1 against some text from Cicero is shown in Figure 1. Cicero was a Roman philosopher and the quote was generated and extracted from .NET framework 4” TARGET=”newFrame”>http://www.blindtextgenerator.com.

word frequency of some text from Cicero
Figure 1: The word frequency of some text from Cicero.

More by Author

Must Read