Click to See Complete Forum and Search --> : File name question


John E
September 6th, 2008, 10:04 AM
I was asked this simple question yesterday and had to admit that I didn't know the answer... :(

My computers are based in the UK and use the UK's familiar character set. In some other country (let's say Japan) they'd use different characters that wouldn't be recognizable to me. If I wanted to add multi-language support to a program that I'd written, I could do this by using Unicode. All my program's dialog boxes etc could then be adapted to display German or Greek or Japanese or whatever translations were being supported. So far, so good.

But what about the file system? Most file handlers that I've seen seem to have an inbuilt assumption that file and folder names are simple char arrays. Therefore, although some unusual characters will undoubtedly be supported, anything requiring 2 or more bytes per character wouldn't be allowed. In other words, on a Japanese PC, the file and folder names would still be (broadly speaking) in Anglo/American/European characters - or more correctly, in some format that only requires one byte per character.

I just wondered if that assumption is true or false. For example, is Unicode used to represent file & folder names on a Japanese PC?

Arjay
September 13th, 2008, 11:07 PM
NT based systems (NT3.1 to Vista) are all UNICODE under the covers and so is the ntfs file system.

John E
September 14th, 2008, 03:33 AM
Thanks Arjay. I guess the thing that's really puzzling me is Linux. Linux root folders tend to have names like /usr, /var, /opt, /mnt etc and (AFAIK) it's almost unheard of for users to rename these. Likewise, Linux users rarely (if ever) get the option of choosing where their s/ware packages will get installed.

To me, this suggests that there's some kind of inbuilt assumption that /opt (for example) will still be called /opt even if the user is Russian, French, Japanese or whatever.

Bearing in mind that all file systems are networkable these days, wouldn't this lead to a 'lowest common denominator' situation? In other words, even if a file system (internally) uses Unicode, it would tend to to keep its file and folder names within the standard ASCII character set. After all, if Unicode was also used on Linux or Mac, it's unlikely to be the same Unicode used by Microsoft...!

I've never encountered file or folder names with non-standard characters so I'm curious to know how common it is. What happens with (say) a Japanese or Chinese file system? Even though they might not use 'English' characters, I'm guessing that the names would be limited to the standard ASCII character set for each specific locale. Don't know for sure though.

Arjay
September 14th, 2008, 01:25 PM
From The Windows 2000 File System (http://www.informit.com/articles/article.aspx?p=26353). Dated but I'm sure still accurate.

11.7.1 Fundamental Concepts
Individual file names in NTFS are limited to 255 characters; full paths are limited to 32,767 characters. File names are in Unicode, allowing people in countries not using the Latin alphabet (e.g., Greece, Japan, India, Russia, and Israel) to write file names in their native language. For example, file is a perfectly legal file name. NTFS fully supports case sensitive names (so foo is different from Foo and FOO). Unfortunately, the Win32 API does not fully support case-sensitivity for file names and not at all for directory names, so this advantage is lost to programs restricted to using Win32 (e.g., for Windows 98 compatibility).

boudino
September 15th, 2008, 04:13 AM
As far as I know, unusual characters in file name can cause problems only if the file is transfered to a system which doesn't support the code page used to create the file name. If the file can be created, it can be readed on the same system as well.

The problem with file names with unusual charaters is that the unrecognized character is represented as question mark or square, which are not valid characters, so you cannot pass the name as a parameter to a file handling routine.

With standard locations (like /opt), on Windows there are functions, which return the standard directory name appropriate to the current system installation (Environment.GetFolderPaht() in .NET). On Linux, I think that the "special" folders are never localized, so you can considered them to be fixed and common on any installation.

John E
September 15th, 2008, 05:33 AM
Thanks guys. An important consideration (I presume) is that different OS's favour different flavours of Unicode. For example, Windows uses wide char Unicode whereas Linux (I think) favours UTF-8. Don't know about Mac. This brings up the question of how Unicode file system names would work under Linux. Is there a UTF-8 version of fopen() for example? I've never come across one but then again, I've never looked for one..!

My guess is that even though modern OS's can (and do) offer support for Unicode file & folder names, Unicode characters - probably - aren't used very often. That's only a guess though.