Click to See Complete Forum and Search --> : OT: The "Little Big End"


cjard
August 6th, 2004, 02:36 PM
following a private message in response to an old thread, i wrote some information about little and big endian data formats. the response ended up too long for a private message, so i've posted it here, mainly for member Kyoy, but some other people might find it useful:

I am very confused about the little endian and big endian concept. What is it exactly all about, whats the difference when it comes to applying it for programming.

data is arranged in memory and loaded into a CPU in order to be run. CPUs are "endian" - i.e. they care about which way round the data appears. As a throwback to the days of 8 bit processors, data is expressed in blocks of 8 bits.. which is one byte. That gives a possible 256 combinations of 0s and 1s.. which isnt enough to represent some data. Unicode, for example, is a mechanism for allowing character data to live in computer memory, and the set is large enough to address all the symbols known in the world (256 wasnt enough, the early hack was "Code Pages".. unicode has more than 65000 possible symbols; enough to store every known symbol)
However, remember that data is ordered into blocks of 8, and unicode requires 16.. this means we need two 8bit blocks of data to store a unicode character.
Now comes the endian part. Here are 16 bits:

1111000001010101

there are two ends to this number.. a big end, and a small end. This number (in bits) represents two bytes. One byte is called the Most Significant Byte, and the other is the Least Significant Byte.
The binary above, is the representation for the decimal number 61525. This will be made up of a significant (large) part, plus an insignificant (small) part:

11110000 01010101

lets break it down to a sum:
11110000 00000000 plus
00000000 01010101

in decimal:
11110000 00000000 -> 61440
00000000 01010101 -> 85

one number is more significant than the other..
So whats this to do with ends? well, you and I read english, which is left-to-right, so we could naturally feel that the numbers on the left are more significant than the numbers on the right:

123 456 = "1 hundred and twenty three thousand, 4 hundred and fifty six"

but what if there were a race that wrote everything right-to-left? they might read this number (from right to left) as "4 hundred fifty six thousand, 1 hundred and twenty three"

quite a difference, just by altering the reading order!

And the same for CPUs, our number:

11110000 01010101

which we know to be 61525, could actually be understood to be 22000, if the CPU read in the second block of bits, then the first:

01010101 11110000 (you should read it in left to right order)

why would a cpu do this though? because that's the way it was designed, and there is usually a good reason for it (not just "i wanna be different") - it may be easier to work with right-to-left formatted data, due to some hardware design restriction

and this is where "endian"ness comes in.

Little endian means "the little end of the number comes first". Big endian means "the big end of the number comes first"

In real life text terms:

The letter A is equal to number 65, or 0x40 in hex. In unicode, this will take 2 bytes to store (even though one would do), so a zero-byte will be added in there somewhere.

Our number of 65 (0x40), is small.. it will fit into the least significant byte (the first 256 / 0xFF) of a pair of bytes, but the order these will appear on screen depends on the way the system interprets pairs of bytes.

If the system is Big Endian, it means "the most significant end is first". Remember that 0x40 is small, so its part of the least significant end. Our most significant end will hence be 00, and also be first:

00 40

That's a pair of bytes that represent the letter A in Big Endian format. The big end is first, the big end has a value of zero (big end deals with numbers over 256, our number is just 65)

In little endian format, it is:

40 00

The little end is first (numbers up to 256). The little end has a value of 40.

--

How does this affect your programming? Well, suppose you have a text file that contains unicode text, generated by a source that is big endian.
A sequence of five letter A (AAAAA) will look like:

0040 0040 0040 0040 0040 (i've split into 5 chunks for AAAAA, rather than splitting msb/lsb)

The file will be 10 bytes big.

Now suppose you feed this datastream into a system that understands things in little endian.. the stream of bytes comes out of the file the same as it went in:

0040 0040 0040 0040 0040

00 is read in as the "little end"
40 is read in as the "big end"
this makes a hexadecimal number of the form 0x4000 instead of 0x0040

While 0x0040 does mean 65, or letter A.. our little endian system will understand this as the number 0x4000, or 16384 in decimal (i dont know what letter this is.. some weird character, certainly not an A)

So little endian/big endian has ramifications for data transfer between systems.. Many datastreams encoded using 16 bit blocks, have an indicator at the start, for which way round the data is.. if the data lacks this indicator, then the computer wont be able to tell which way round the data should be without some sophisticated logic applied after decoding twice (in both endian modes) - and even then, the data might not be able to be deciphered properly

In your programming, you will most often find issues with it when transferring data between systems using different endian formats; either instruct the source system to generate a format suitable for consumption byt he end system, or tell the end system to consume in the way the source produced. Sometimes you may have no control over one end, so you need to adjust the other end to cope.

Drain
August 6th, 2004, 04:11 PM
Thanks cjard... a very clear and thorough explanation. http://www.plvl.com/forum/aniwink.gif