Building the Right Environment to Support AI, Machine Learning and Deep Learning
A lot of us use UDTs in our programming, but how are they are really laid out in memory?
As my programming progressed, I went through a definite phase where I used UDTs a lot. They were a nice, organized way to clump together a bunch of settings/variables into a nice discreet package.
For our first example, let's look at a simple UDT defined with four numeric elements.
Type test a as Long b as Long c as Long d as Integer end type
In VB, the long datatype takes up four bytes of memory; the integer, two. This puts our total UDT size at 14 bytes of memory. We can confirm this with the following code:
Dim i As Integer, l As Long, myUdt As test MsgBox "Integers are " & Len(i) & " bytes long" MsgBox "Longs are " & Len(l) & " bytes long" MsgBox "My UDT w/ 3 longs and an int is: " & Len(myUdt) & _ " bytes long"
To really dig behind the scenes, we are going to need a way to look at the actual UDT held in memory. Unfortunately, VB does not provide a memory dump window for us, so we will have to illicit some outside help. There are many programs out there that will let you read a processes memory, everything from game cheats to system debuggers. What I will use for this demo is the debugger that comes with Visual C++ 6. (You can also use windbg, which is free and can be downloaded from the MS site.)
First, we create a simple VB project that loads our test UDT with some values. After the UDT is loaded, we will display some of its properties.
dim myUdt as test with myUdt ... 'load udt with values end with MsgBox "Address: " & hex( varptr(myUdt) ) & " ByteSize:" & _ len(myUdt)
When the MessageBox fires, it will tell us the memory address where the UDT resides, as well as the size in bytes of the structure.
The MessageBox also performs a second function for us. Because MessageBoxes are modal, they pause the execution of the code and give us time to attach the debugger and check out the memory location of the UDT.
To attach the debugger, first compile the exe, and then manually start it up. You will see the MessageBox with the info we need. Now, start up your debugger and attach it to the process. In Visual Studio, this is done through the Build, Debug, Attach to Process menu.
After our debugger is attached to the program, we want to probe the memory location of the UDT. Make sure the memory window is open (View, Debug Windows, Memory). Now, enter the hex value of the address of the UDT.
Here is a screen shot of the code that loaded the UDT and the actual UDT in memory. (The code is included in the sample project download.)
In the screen above, I have highlighted the 14 bytes that make up the UDT created by the code. I have also color-coded each member of the UDT, based on its variable type length. (Remember that Longs are four bytes long; integers, two, and so forth.)
One thing that looks kind of strange is the value of member A. In the code it was loaded with the value 258; however, in the memory we see it stored as 02 10 00 00. Is this right?
When numbers are stored in memory, they are stored in the little endian format. This is really just a fancy way of saying that their bytes are read from left to right. 02 10 00 00 in memory is actually the number 00 00 10 02, which in hex is &h102 (258 decimal).
Okay, a simple UDT is just a block of sequential memory, with its numbers stored in a funny format. How does that help me?
Well, I could have just made you the $10,000 jackpot winner on Jeopardy, but in more immediate terms, this background knowledge can help us understand UDTs some more and let us do some kinds of unconventional things with them.
Let's say we wanted to make a copy of a UDT. With the CopyMemory API, we could literally clone the structure to another one.
Dim udt1 as test, udt2 as test CopyMemory udt2, udt1, len(udt)
Of course, this is not that helpful. We could assign udt2 = udt1, but this example proves our point that, because we know how the UDT is laid out in memory, we can use that to our advantage and manipulate it from the raw memory layout.
Let's say we had another UDT structure that had the exact same layout of its first four elements. Because these are both just blocks of memory of a known size, we could actually load up this second UDT's first four members from our other UDT type!
That could be handy, handy to know anyway (we will get into just where later on).
How about if we wanted to store a UDT as an array of bytes? Now that is something handy. (Actually, that is what drove me to this research.)
We know the size of our UDT, we know how many bytes it is, and that they are all sequential. So, let's store them into a byte array.
Note: I used a base 1 byte array because the length returned from len() is base 1. You could just as easily use a base 0 array, but would have to subtract 1 from the len(udt) return when you dimensioned your array.
If you look at the values displayed in the immediate window, you can see that the byte array now holds the same contents the UDT was shown to hold in memory from above.
From here, you can either save them in some alternative format, easily pass the byte array to other functions, store it in a database, reconstitute it with another call to CopyMemory, or even use your knowledge of the byte layout to perform functions on it by its byte values.
One such use of these byte-level manipulations is to use this technique to help you get the low and high bytes of an integer. Knowing that an integer is composed of two bytes, can you think of how to put this all together to extract the high and low bytes from it?
That all looks pretty straightforward. doesn't it? But I have noticed you have not included any examples with strings or arrays in the UDT. Does this same method hold true for more complex UDTs?
Unfortunately, no, it doesn't.
Let's try another experiment.
type test a as long b as string end type dim t as test t.a = 1 t.b = "this is my string" msgbox len(t)
This code tells us that the length of the test structure is 8! How can that be? Humm, let's dig behind the scenes some more. Using the debugging and memory dumping techniques we went through at the top of the article, let's see what is happening to our structure that contains a string.
Eight bytes is the same as two longs. Looking at the memory dump window, we can see that the myString variable is actually a long pointer to a string. If we then look up the memory address it points to, we find our string. (This is confirmed by the MessageBox value of strPtr(mystring).)
Because of the way UDTs that contain strings and arrays are handled, this does knock out some of our previous bags of tricks on these more complex setups. Then again, knowing this limitation and how it all works, we are still better off than we were. So, we know what we have to work with anyway, right?
It's clear that our CopyMemory tricks cannot transfer the contents of the strings over. As soon as the UDT that owned the string goes out of scope, there goes our valid pointer to the string.
Is there any other way we can dump a more complex UDT type out and either save it as an array of bytes or reconstruct it from an array of bytes?
As luck would have it, there is. When I browsed through some old C documentation, I noticed a technique they were using to dump their structures straight to disk. This would allow you to save your configured object to a file with a single line of code. Curious to see if VB had implemented such a feature for its UDT handling, I whipped up the following tidbit of code.
This code is actually from an earlier paper I wrote, available here. The paper's main focus was on describing how you can dump even complex UDTs to disk and then easily reload them with only a couple lines of code.
That is pretty handy, and a good thing to know. It is also interesting that VB's Put and Get commands were built to be pretty smart. We know that complex UDTs aren't stored in memory as a continuous block; however, the VB Put command is kind enough to pack the whole structure and data into a new format for us so that it is complete when dumping it to disk.
If you look at the file dump and read the other article, you will notice that preceding each string stored in the file there is a length counter for how many bytes are in the string. This reflects the fact that VB uses an OLE type, called a BSTR, to store all of its strings. BSTRs are Unicode strings prefixed by a long value (4 byte) length counter.
This is how VB knows where the string ends. Some other languages hold strings as null terminated. That is, they read the string until the first null character (byte 0). Because VB strings are Unicode, and every other character is usually a null, that just wouldn't work for us.
Two cool side effects of this is that our strings can contain embedded nulls without penalty, and it is very fast for VB to tell us the length of the string because it only has to read this prefix counter value. Some other languages actually have to loop through each and every character in the string, incrementing a counter variable until they hit that null terminator to find out how long a string is!
Now, if we look at the preceding memory dump of the string above, we see the purple highlighted area. This is actually the string length variable we were just discussing. It is the first four bytes just before the strptr(mystr).
Strings are just about always better off manipulated as strings. But with this little tidbit of trivia, now you know how you could locate the string in memory and determine its byte length from the raw UDT data.
Okay, I guess that is enough for now. Hopefully, this paper was a good description of UDTs for you, and will give you some insight into some other tricks you can do with them, as well as the things you probably don't want to try to do with them!
Also, I hope you got a good idea how to use the debugger to probe through and do your own research to figure out your own burning questions for yourself. Once you have someone walk you through it once, I think you are going to find that knowledge never goes out of scope and will end up building like a snowball rolling downhill.