Three Powerful Yet Untapped Features of NTFS

The NTFS windows file system has introduced several features that improve the performance, stability, and reliability of file storage. Of these, three features lend themselves to advanced methods of information storage and collation. Few applications have, however, made extensive use of these features. This article introduces these features and identifies some of the potential uses each one exhibits.

This article is intended for experienced programmers.

Introduction

While working on Windows NT, Microsoft realised the huge deficiencies that existed within the old FAT file system (FS). A server utilising FAT is not able to function as reliably as is required in commercial or industrial environment.

In an effort to address these deficiencies and the ubiquitous inefficiencies of FAT, Microsoft designed the new NTFS. This new FS bolsters security, performance, and reliability, and also supplies several advanced features not supported by FAT. Native support for file encryption and compression, disk quotas, and access permissions were all included in NTFS (although some have only been added subsequent to the original design of NTFS).

An example of the improvements of NTFS over FAT is transactional file processing. In an effort to improve reliability, NTFS volumes support transactional processing by utilising a log file; this ensures that transactions are completed successfully. When a transaction is begun, an entry is first written to this log file, the transaction is committed, and the end of the transaction is written to the log. Should a system fault occur during the processing of the transaction, its end will not be reflected in the log file. The system then can perform a recovery on the volume at the next system start-up.

Hardlinks In NTFS

One of the features that is built into NTFS is the separation of the data and presentation layers. This means that data exists in one form on the disk, but when accessed through NTFS can be presented differently to the user. This is done though the use of links.

A link is very similar to the concept of a pointer as used in coding. While data exists in memory, it can be accessed via a pointer that simply stores the address of that data. Multiple pointers can be used to point to the same data. These pointers can be moved, copied, and deleted, leaving the original data intact. Moving or copying the pointers incurs a fraction of the overhead required to move or copy the data.

In the same way, because links are used to reference data on a volume, these links can be moved and copied to represent movement of the data to the user while the data remains at the same location.

The concept of a link is depicted in the following diagram:

Most Windows users are already familiar with this concept, and have utilised shortcuts. Shortcuts in Windows are called softlinks. However, a more deeply ingrained version of linking exists in NTFS. These are named hardlinks. All files in NTFS are presented through hardlinks, and so one may say that all users of Windows have, in fact, used hardlinks. Nevertheless, data is not restricted in NTFS to be simply presented through only one hardlink. It is possible for data to be presented through multiple links.

Thus, to the user two files containing the same data are presented. However, only one copy of this of data exists on the volume.

These hardlinks may occur either in the same or in different folders, but must occur on the same volume.

Data is only removed from the data layer when the last remaining hardlink to that file is deleted. Unlike Windows shortcuts, this allows files to be safely deleted without needing to manage the link.

By the way, you may have noticed that moving a large file from disk to disk can take a long time, but moving it between folders is almost instantaneous. This is because only the link is transferred when moving between folders.

An Example of Practical Uses for Hardlinks

So, what are the potential uses of hardlinks? Well, quite often one will define a folder structure and then attempt to place files into the correct folder. But, all too often, a file should exist in more than one place.

For example, consider a filing system for pictures and photos. One may have a folder for dogs and another for dwellings.

What would you do if you had a photo of a dog in front of a house? Where does it go? It contains a dog, so it should go into the dog folder... but it also contains a house and should perhaps go into the dwellings folder.

It would be possible for the picture file to be placed in one folder and then a copy to be made in the other, but this would waste space. One also could use a shortcut to provide a link to the original file. This method can lead to management problems, especially in large filing systems. A database that identifies each file with several keywords could be built up. A lookup in the database then could provide the necessary information to find the file. This solution is not practical for the average home user to implement. It also does not integrate well with the Windows file system, making opening files from within a third-party application difficult.

Here, a hardlink may be the ideal solution. The file can be made to exist in both folders at the same time. Should the file need to be edited, the change would be reflected in both "files." The system is fairly easy to set up and no management is required to ensure that moved and deleted links are handled properly. Because the link exists at a native level for Windows, files can be easily accessed by third-party applications. The only potential downside to this technique is that, by deleting one of the "files" the data still remains on the volume and the other link is still valid. Whether or not this is an advantage or disadvantage would depend on the application.

Accessing and Managing Hardlinks

Hardlinks can be created from the command prompt by using the FSUTIL utility. The syntax for creating a hardlink is:

fsutil hardlink create FileName ExistingFileName

Where FileName is the name of the new "file" to be created, as a link to the original file, ExistingFileName.

From within code, hardlinks are created by utilising the CreateHardLink Windows API function. This function is defined as follows (according to the Visual Studio 6.0 MSDN):

BOOL CreateHardLink(
   LPCTSTR lpFileName,
   LPCTSTR lpExistingFileName,
   LPSECURITY_ATTRIBUTES lpSecurityAttributes
);

lpFileName is the name and path of the link to be created to the original file, given by lpExistingFileName. lpSecurityAttributes points to a security descriptor for the new link (may be NULL to use default security).

Three Powerful Yet Untapped Features of NTFS

Junction Points

It is not possible to create hardlinks for a folder under Windows. However, it is still possible to create the same effect by using a different technique. To provide transparency when working with distributed files, Microsoft introduced the Distributed File System (DFS). The MSDN describes the DFS in a conference paper entitled "DFS Overview" as "DFSS allows administrators to build hierarchical directories that span multiple file servers and file shares."

An administrator could setup a DFS as follows (taken from the MSDN, in the conference paper "Technical Overview of DFS"):

[JunctionsSetup.jpg]

This would appear to the user as:

[JunctionsUserView.jpg]

From this illustration, it can be seen that a junction is simply a link, on a hard drive, that points to another location. This location can either be on the same drive or elsewhere in a network. The most common form of junction encountered by Windows users in a mounted network drive. (More information on the different types of links and junctions can be found in the MSDN). By using this technique of junctions, one can replicate directories, just as one can replicate files when using hardlinks.

It is also interesting to note the presence of alternate volumes. A single junction can contain up to 32 alternate volumes. This means that, should the normally accessed volume not be accessible, the alternate volume will be used. This process is used extensively in clustering. DFS does not ensure that both volumes contain the same data. The synchronisation of the volumes is left up to the user, and is usually handled by an automatic backup service.

An example of practical uses for junctions

Consider, again, the example of an image library. As the image library grows, it may become larger than the amount of disk space available. A new disk then must be made available, either on the same machine or on a separate machine in the network. To copy all the files to this new drive may not be practical, and often staff have to be retrained on where to find their required files.

This process can be simplified by adding a junction, which would appear as a directory to the users, but the data exists in a different location. Several image categories can be placed on this new drive, and others left in their original location.

In addition, a backup server that contains a copy of the image library can be created. Junctions can be established to point to this backup as an alternate volume. Should maintenance be required on the normal file server, the secondary server can be made to perform the same task seamlessly, without any disruptions to the network or users.

Accessing and Managing Junctions

A search for "junction" in the Windows XP Help and Support Centre returns a result that states that the "linkd" command may be used to create junctions. This program is not natively available in XP, but can be obtained with the Windows 2003 Resource Kit (download it here). The operating systems supported are Win 2003 and Win XP.

A number of API functions are available to manage junctions. These are:

  • NetDfsAdd
  • NetDfsEnum
  • NetDfsGetInfo
  • NetDfsRemove
  • NetDfsSetInfo

They are defined as follows:

NetDfsAdd

NET_API_STATUS NET_API_FUNCTION NetDfsAdd(
   LPWSTR DfsEntryPath,
   LPWSTR ServerName,
   LPWSTR ShareName,
   LPWSTR Comment,
   DWORD Flags
);

This adds a new junction point to a DFS tree. DfsEntryPath, is the entry path for the added junction (for example, from the above example \\IIS\Root\Content); ServerName is the server exporting the storage, ShareName is the existing share name for the storage; Comment, is a comment field; and Flags denoted various flags that can be used to create the junction.

NetDfsEnum

NET_API_STATUS NET_API_FUNCTION NetDfsEnum(
   LPWSTR DfsName,
   DWORD Level,
   DWORD PrefMaxLen,
   LPBYTE *Buffer,
   LPDWORD EntriesRead,
   LPDWORD ResumeHandle
);

Used to list all junction points in a DFS tree. DfsName inputs the name of the DFS; Level inputs the amount of information required; PrefMaxLen inputs the maximum number of bytes to be returned; Buffer the pointer returned that contains the requested information; EntriesRead outputs the number of entries returned; ResumeHandle 0 for the first call, and re-used thereafter.

NetDfsGetInfo

NET_API_STATUS NET_API_FUNCTION NetDfsGetInfo(
   LPWSTR DfsEntryPath,
   LPWSTR ServerName,
   LPWSTR ShareName,
   DWORD Level,
   LPBYTE *Buffer
);

Acquires information about a junction point. DfsEntryPath inputs the DFS entry path for the junction point; ServerName is optional and inputs the name of server exporting the storage; ShareName is also optional and inputs name of share exporting the storage; Level inputs the amount of information requested; Buffer returns the required information.

NetDfsRemove

NET_API_STATUS NET_API_FUNCTION NetDfsRemove(
   PWSTR DfsEntryPath,
   LPWSTR ServerName,
   LPWSTR ShareName
);

Removes a junction point from a DFS. DfsEntryPath inputs the path of the junction point to be removed; ServerName inputs the name of server exporting the storage; ShareName inputs the name of share exporting the storage.

NetDfsSetInfo

NET_API_STATUS NET_API_FUNCTION NetDfsSetInfo(
   LPWSTR DfsEntryPath,
   LPWSTR ServerName,
   LPWSTR ShareName,
   DWORD Level,
   LPBYTE Buffer
);

Associates the inputted information with the required DFS. DfsEntryPathdenotes the DFS entry path of the junction point; ServerName is optional and denotes the name of server exporting the storage; ShareName is also optional and denotes the name of share exporting the storage; Level is the level of information to be set; Buffer is the memory buffer containing the information to be set.

Three Powerful Yet Untapped Features of NTFS

Multiple Data Streams

To allow Windows to be compliant with Mac OS, support for multiple data streams was introduced. Multiple data streams are, as the name would suggest, multiple binary sequences all associated with a single file name.

One can think of multiple data streams as multiple files grouped together under one file name.

[MultipleStreams.jpg]

In a file with multiple data streams, only the first (default) stream is examined by most applications. Indeed, a Windows application has to be specifically written to comprehend multiple streams and to read the data from streams other than the default stream. Even Windows Explorer does not comprehend multiple streams. (You can prove this by creating a multiple stream file and placing a large amount of data in the second stream. Explorer will only report the size of the first stream.)

An example of practical uses for alternate streams

Often, it doesn't matter how good your filing system is; once the size gets too large, you just can't find what you are looking for. The above-mentioned image library is a perfect example.

All images within the library could contain multiple data streams. The first data stream (as seen by most applications) would be the picture. Keywords and a description or comment can be written into the alternate streams. An application then can be developed to allow these alternate streams to be displayed or perhaps searched. The user then could simply type in a keyword they are looking for (for example, house) and all images containing that keyword in their alternate streams could be returned.

This is all well and good, but why not simply use a database and store all the keywords in there, with links to the file's path (it could well be quicker)? First, link management. It would be a real pain to ensure that the database and file system are correctly synchronised. Should a file be moved or deleted, its link would become broken. Second, the file becomes self-describing. It can be copied to another machine with a similar setup and would automatically be correctly referenced.

As a second example, consider coding a small application... and you want to be sure you don't lose the source code. You can define the .exe as the default stream and save all source code in alternate streams. Or, if your program depends on other resources, .dlls, or binary files, these can also be stored in alternate streams, should you so desire. One could even implement a source control using these alternate streams.

Accessing alternate data streams

Alternate streams are accessed by utilising a colon (:) in the filename. For example, a file might have the name test.txt. The default stream is accessed by using test.txt. A second stream can be accessed by using the name test.txt:stream2. You may have a little bit of trouble trying to test this through the normal Windows interfaces (Explorer). But as a test, try the following:

Create a file named c:\test.txt, containing the text "Default Stream." Then, from the command window, enter the command "echo Second Stream>c:\test.txt:stream2." Now, when you open the file test.txt, "Default Stream" will appear. Open the second stream (notepad c:\test.txt:stream2) and Notepad will display "Second Stream".

[MultipleStreamsExample.jpg]

When attempting this example, I have encountered problems with Notepad wanting to create test.txt:stream2.txt, when trying to view test.txt:stream2. If you also encounter this, simply view the alternate stream by using the prompt command "more <c:\test.txt:stream2."

By the way, you'll notice that, no matter how big you make the second stream, Explorer will still tell you the size is 14 bytes.

Managing alternate streams through code

There is very little documentation in the MSDN on how to manipulate alternate streams through code.

Simple functionality can be obtained by using some of the standard API functions. To create multiple streams from within code, the CreateFile Windows API function can be used, with the alternate stream name provided after a colon. Data from within alternate streams can be read using any of the ReadFile functions, and data can be written to a stream by using the WriteFile functions.

However, testing whether a file contains any alternate streams, and subsequently manipulating them, is something of a chore. This requires accessing several of the exported functions from a DLL named ntdll.dll, found in the system32 directory. These functions are not made accessible though the Win32 SDK; instead, one has to look into the DDK. The most useful function for alternate stream information is the NtQueryInformationFile function.

Microsoft and alternate streams

If you are like me, the first thing you have thought is... "Wonder what Microsoft has been hiding in there?" Well, guess what; the answer is... NOTHING! Microsoft hasn't made much use of this functionality. Well, at least not much. .exe files downloaded using Internet Explorer 6 and files saved from atachments in Outlook Express (after loading XP sp2) have had alternate streams attached. I suspect that Microsoft uses this method to inform the computer that a file originates from a non-trusted source. Please note: This is merely a suspicion; I have not yet been able to prove anything for or against it.

By the way, it should perhaps be noted that Windows NT 3 supported multiple streams inconsistently, but this problem was fixed in NT 4.

Conclusion

Hardlinks, junctions, and multiple data streams have been introduced, and the basic operation and uses of each have been discussed. From the examples shown, the uses of these explored features can be seen. Through harnessing these features, better formed and more robust filing systems can be built to provide users with extended usability and functionality. Complex file management systems can be created with relative ease thanks to these features.

Acknowledgements

Thanks go to Bjorn Liebenberg for supplying the idea for this article and providing much of the initial research.



Comments

  • There are no comments yet. Be the first to comment!

Leave a Comment
  • Your email address will not be published. All fields are required.

Top White Papers and Webcasts

  • Event Date: April 15, 2014 The ability to effectively set sales goals, assign quotas and territories, bring new people on board and quickly make adjustments to the sales force is often crucial to success--and to the field experience! But for sales operations leaders, managing the administrative processes, systems, data and various departments to get it all right can often be difficult, inefficient and manually intensive. Register for this webinar and learn how you can: Align sales goals, quotas and …

  • Protecting business operations means shifting the priorities around availability from disaster recovery to business continuity. Enterprises are shifting their focus from recovery from a disaster to preventing the disaster in the first place. With this change in mindset, disaster recovery is no longer the first line of defense; the organizations with a smarter business continuity practice are less impacted when disasters strike. This SmartSelect will provide insight to help guide your enterprise toward better …

Most Popular Programming Stories

More for Developers

Latest Developer Headlines

RSS Feeds