Full Text Search: The Key to Better Natural Language Queries for NoSQL in Node.js
Interactivity with _bstr_t and CString Classes
Conversion Operators and Methods
Loading Strings from Resources
Direct Access to the Internal String Buffer
BSTR Allocation Methods
Working with GUIDs
Spanning and Trimming
Case Conversion Methods
Other CString-Like Methods
New Methods Summary
Shortcomings of Extant String Classes
There are two reasons that urged me to create one more string class, in spite of the fact that so many of them already exist. Let's consider the following scenarios:
- TCHAR-based application extensively using COM, the functions/interface methods of which are WCHAR-based.
- TCHAR-based application working with network protocols, which normally make use of ASCII.
- Passing strings across module borders (between an executable and a DLL or between DLLs). In this situation, strings are required to use some global memory allocator (such as GlobalAlloc, CoTaskMemAlloc, or VirtualAlloc).
- Necessity to have strings that are compared case-insensitively by default. Sometimes, it is desirable as well to compare case-sensitive strings in a case-insensitive manner and vice versa.
- Any combination of the previous scenarios or any situation when different kinds of strings must be used jointly.
MFC's CString in VC++ 6.0 is hard-coded as a TCHAR-based class, has no provisions to change its own allocator (C++ new/delete operators), and does not allow you to set its default case-sensitivity state (to perform a case-insensitive comparison, it is necessary to use the CompareNoCase method explicitly). Thus, it does not fit any of the above scenarios.
The same problem takes place with ATL/MFC CStringT template class in VC++ 7.0/7.1 (NET/NET 2003). An instance of CStringA class cannot be used together with a CStringW variable without explicit cast.
Resolving this problem was the first reason to create one more string class. The other reason is the fact that all standard VC string classes have some lack of functionality. Thus, the string content manipulation and comparison methods of the CString class yield noticeably to their STL's counterparts (find, rfind, replace, find_first_of, find_first_not_of, find_last_of, and find_last_not_of). On the other hand, the basic_string class lacks many very handy CString's methods, such as LoadString, Format/FormatV, TrimLeft/TrimRight, SpanIncluding/SpanExcluding, Mid/Right/Left, and some other.
The BasicString template class presented here is derived from the STL's basic_string template, and fills the mentioned functionality gap by providing new constructors, CString-like methods, and some new methods, which are described in the following sections. Because the the styles of CString and basic_string method names are starkly different, all the methods ported from CString, as well as some new ones, have both MFC- and STL-style names.
The second version of the BasicString class, in addition to the mentioned problems, deals with performance issues caused by using the STL's basic_string class, which became more evident after I have migrated from VC++ 6.0 to VC++ 7.1. See the Changes in Version 2 section for more detailed information.
- The BasicString template class is designed to provide full interactivity between objects of all string classes based on it.
The full interactivity means that any operator or method of any BasicString-based class accepts arguments of any other BasicString-based type. Besides, these arguments can be raw character pointers of any type (both ASCII and Unicode). Such interactivity extends over the global binary operators ( +, ==, !=, <, >, <=, >= ), too.
For more convenience, eight different string classes that cover the most frequently used string types are typedefed in the "BasicString.h" header. They are as follows (for each of them the following information is listed): a) the character type, b) case-sensitivity of the compare method and the relational and equality operators, c) allocator type):
- AString—char, case-sensitive, standard C++ allocator;
- IAString—char, case-insensitive, standard C++ allocator;
- GAString—char, case-sensitive, global COM-compliant allocator;
- GIAString—char, case-insensitive, global COM-compliant allocator;
- WString—wchar_t, case-sensitive, standard C++ allocator;
- IWString—wchar_t, case-insensitive, standard C++ allocator;
- GWString—wchar_t, case-sensitive, global COM-compliant allocator;
- GIWString—wchar_t, case-insensitive, global COM-compliant allocator;
Their TCHAR-based analogs are typedef'ed as well. They are:
Additionally, the BasicString-based classes are fully interactive with the basic_string-based classes derived from the same char trait and allocator templates. For convenience, the typedefs of eight such basic_string-based classes are supplied. They are named in the same manner as the BasicString-based ones:
AStdString (the same as STL's string class), IAStdString, GAStdString, GIAStdString WStdString (the same as STL's wstring class), IWStdString, GWStdString, GIWStdStringand their TCHAR-based analogs:
StdString, IStdString, GStdString, GIStdString.
- The full interactivity covers the _bstr_t and CString/CStringT classes as well. The corresponding code is compiled only if the project uses them. So, the "BasicString.h" header should be included after the <comdef.h> or <afx.h> headers. To support such functionality constructors, assignment operators, operators of conversion to _bstr_t and CString, and all binary operators (+, ==, !=, <, >, <=, >=) have been properly overloaded.
The only known inconvenience is the impossibility to assign an object of a BasicString-based class to an object of the _bstr_t or CString type. The reason of such a behaviour is the fact that _bstr_t and CString are unaware of the BasicString class' existance, and consequently do not supply proper assignment operators. This hitch can be easily worked around by explicitly casting to the LPCWSTR or LPCTSTR as in the following example:
String str = _T("Test"); _bstr_t bstr(str); // OK bstr = (LPCWSTR)str; // explicit conversion required
- Operators of conversion to the LPCSTR and LPCWSTR (and hence to the LPCTSTR) as well as the c_strA(), c_strW(), and c_strT() methods have been provided. They all return either the pointer to the object's own character sequence (if cast to the object's character type requested) or to the internally allocated buffer.
Because the BasicString class derives from the basic_string, the destructor of which is non-virtual, the BasicString's destructor may not be called at some well-known circumstances. But, owing to the fact that the BasicString class does not have its own members and uses a special technique when allocating buffers to place converted strings, not calling its destructor does not cause any memory leaks.
Yet, it must be assumed that pointers returned by the conversion operators and methods are temporary and can be invalidated by any subsequent non-const operation with the object that produced them (including, of course, its destruction, both implicit and explicit).
Based on this functionality, the set of macros for in-place ASCII/Unicode conversion has been implemented:
ToStr (str) // str -> TCHAR* ToWstr (str) // str -> wchar_t* ToAstr (str) // str -> char*Here, the str argument can be char*, wchar_t*, or any kind of a string class. The pointers returned by these macros are temporary, so their use is limited to the function call argument lists or the inside of expressions they are used in. The second version of the BasicString class allows you to easily extend the functionality of these macros. See the Changes in the version 2 section for more detailed information.
- Loading strings from resources have been implemented. This is done either by means of constructor accepting raw character string pointers in the same manner as in the CString class:
String str((LPCTSTR)IDS_TEST);or by means of two STL-style methods with two their MFC-style analogs:
bool load ( HINSTANCE hInst, UINT nID ); bool load ( UINT nID ); BOOL LoadString ( HINSTANCE hInst, UINT nID ); BOOL LoadString ( UINT nID );The method globally changing the module handle that is used to locate resources is supplied as well: Based on this functionality, three macros for in-place conversion from the resource ID to the raw character string pointers of different character types has been implemented:
ResToStr (nID) ResToWstr (nID) ResToAstr (nID)The pointers returned by these macros are temporary, so their use is limited to the function call argument lists or the inside of expressions they are used in.
- Direct access to the internal string buffer is supported by means of two STL-style methods and two their MFC-style analogs:
_E* get_buf ( size_type _Len = -1 ); void set_buf ( size_type _Len = -1 ); _E* GetBuffer ( int nBufLen = -1 ); void ReleaseBuffer ( int nNewLen );Their functionality is the same as their CString counterparts have, with two following minor improvements:
- The get_buf and GetBuffer methods can be called without a length parameter, in which case the current string length is used as the buffer size;
- Debug versions of the set_buf and ReleaseBuffer methods perform the check of overwriting the allocated buffer boundaries. If such overwriting is detected, they assert.
- BSTR allocation methods analogous to those of CString class have been added. They are:
BSTR AllocSysString () const; BSTR SetSysString ( BSTR* pbstr ) const;and their STL-style analogs:
BSTR bstr () const; // the same as AllocSysString BSTR realloc_bstr ( BSTR* pbstr ) const; // the same as SetSysStringThese methods are compiled only if OLE automation headers have been included before the "BasicString.h" header file.
- Printf-like formatting has been implemented (using CRT library support). The following STL- and MFC-style methods do this:
void __cdecl printf ( const BasicString& _Fmt, ... ); void __cdecl printf ( unsigned int _FmtID, ... ); void vprintf ( const BasicString& _Fmt, va_list _args ); void vprintf ( unsigned int _FmtID, va_list _args ); void __cdecl Format ( const BasicString& strFormat, ... ); void __cdecl Format ( UINT nFormatID, ... ); void FormatV ( const BasicString& strFormat, va_list _args ); void FormatV ( UINT nFormatID, va_list _args );
- Constructing strings from GUIDs by means of the following constructor has been implemented:
BasicString ( const GUID& );as well as assigning a GUID to the string:
BasicString& operator = ( const GUID& );Additionally, the set of macros for in-place conversion from GUID to raw character string pointers and vice versa has been provided:
GuidToStr (guid) GuidToWstr (guid) GuidToAstr (guid)The pointers returned by these macros are temporary, so their use is limited to the function call argument lists or the inside of expressions they are used in. To retrieve GUID from its text representation the following methods can be used:
GUID guid ( size_type _Pos = 0 ) const; GUID GetGuid ( size_type _Pos = 0 ) const; // the same as guid(_Pos)GUID representation must start at the position specified by the _Pos argument. If there is no GUID at this position, GUID_NULL will be returned. If the string contains a GUID somewhere inside it, it is possible to find its start position by using the following method:
size_type find_guid ( size_type _Pos = 0 ) const; size_type FindGuid ( size_type _Pos = 0 ) const;The _Pos argument specifies the search start position. If there is no GUID found, the npos value (-1) will be returned. To allow for in-place conversion from any kind of a string to GUID, the following macro is supplied:
StrToGuid (str) // accepts any kind of a string or // raw character pointers
- STL's comparison capabilities have been extended in the following ways. First, as it was described above, there are different classes (such as String and IString), whose relational and equality operators (==, != <, >, <=, >=) and compare method are by default always case sensitive or insensitive. The MFC-style Compare method added to the BasicString class inherits the behaviour of the compare method.
int Compare ( const BasicString& _S ) const; // restricted version of STL's compare methodBut, sometimes it could be handy to coerce case sensitivity or insensitivity of a comparison operation independently on the string class type. For this purpose, the following STL- and MFC-style methods have been added:
int cmp_case ( const BasicString& _S ) const; // always case-sensitive comparison int cmp_no_case ( const BasicString& _S ) const; // always case-insensitive comparison int CompareCase ( const BasicString& _S ) const; // the same as cmp_case int CompareNoCase ( const BasicString& _S ) const; // the same as cmp_no_case
Note While using relational and equality operators (==, != <, >, <=, >=) with both operands being BasicString objects, the case-sensitivity of the operation is determined by the left operand. However, if one of the operands is of a raw character pointer (char* or wchar_t*), _bstr_t, or CString type, the case-sensitivity of the operation is determined by the BasicString object independently on its position. Thus in the following example:
String str = _T("string"); IString stri = _T("String"); bool b1 = str == stri; bool b2 = stri == str; bool b3 = _T("String") == stri; bool b4 = stri == _T("String");b1 becomes false, and b2, b3, and b4 become true.
- The CString-like Find, ReverseFind, and FindOneOf methods have been added. They all have been extended to support the second parameter specifying the start position. Besides, the ReverseFind method has been overloaded to support the string argument (CString supports TCHAR only). At last, the FindOneNotOf, FindLastOf, and FindLastNotOf methods have been added. Here are their prototypes:
size_type Find ( _E _C, size_type _Pos = 0 ) const; size_type Find ( const BasicString& _S, size_type _Pos = 0 ) const; size_type ReverseFind ( _E _C, size_type _Pos = npos ) const; size_type ReverseFind ( const BasicString& _S, size_type _Pos = npos ) const; size_type FindOneOf ( const BasicString& _CharSet, size_type _Pos = 0 ) const; size_type FindOneNotOf ( const BasicString& _CharSet, size_type _Pos = 0 ) const; size_type FindLastOf ( const BasicString& _CharSet, size_type _Pos = npos ) const; size_type FindLastNotOf ( const BasicString& _CharSet, size_type _Pos = npos ) const;Yet, the native STL's methods, that have been used to implement the preceding ones, have much broader functionality than their MFC's counterparts. These methods are (for those unfamiliar with STL): find, rfind, find_first_of, find_first_not_of, find_last_of, and find_last_not_of.
- The CString-like SpanIncluding, SpanExcluding, TrimLeft, TrimRight, and their STL-style counterparts have been implemented:
BasicString substr_of ( const BasicString& _CharSet ) const; BasicString substr_not_of ( const BasicString& _CharSet ) const; BasicString SpanIncluding ( const BasicString& _CharSet ) const; BasicString SpanExcluding ( const BasicString& _CharSet ) const; void trim ( _E _C ); void trim ( const BasicString& _CharSet = sm_SpaceSym ); void rtrim ( _E _C ); void rtrim ( const BasicString& _CharSet = sm_SpaceSym ); void TrimLeft ( _E _C ); void TrimLeft ( const BasicString& _CharSet = sm_SpaceSym ); void TrimRight ( _E _C ); void TrimRight ( const BasicString& _CharSet = sm_SpaceSym );
- As in CString, the FormatMessage API helpers have been included into the BasicString class. Comparing to MFC two new overloads of the FormatMessage method has been added. They accept the va_list argument list as their second parameter.
void __cdecl format_msg ( unsigned int _FmtID, ... ); void __cdecl format_msg ( const _E* _Fmt, ... ); void vformat_msg ( unsigned int _FmtID, va_list _Args ); void vformat_msg ( const _E* _Fmt, va_list _Args ); void __cdecl FormatMessage ( UINT _FmtID, ... ); void __cdecl FormatMessage ( const _E* _Fmt, ... ); void FormatMessageV ( UINT _FmtID, va_list _Args ); void FormatMessageV ( const _E* _Fmt, va_list _Args );One more major improvement is the FormatSystemMessage and FormatModuleMessage methods, which format messages that reside in the system resource tables or in a module resources correspondingly. For example, the FormatSystemMessage method can be used to obtain text descriptions for the results of GetLastError or error codes returned by COM objects.
void format_sys_msg ( DWORD _MsgCode, ... ); void vformat_sys_msg ( DWORD _MsgCode, va_list _Args ); void format_mod_msg ( HMODULE hInst, DWORD _MsgCode, ... ); void vformat_mod_msg ( HMODULE hInst, DWORD _MsgCode, va_list _Args ); void FormatSystemMessage ( DWORD _MsgCode, ... ); void FormatSystemMessageV ( DWORD _MsgCode, va_list _Args ); void FormatModuleMessage ( HMODULE hInst, DWORD _MsgCode, ... ); void FormatModuleMessageV ( HMODULE hInst, DWORD _MsgCode, va_list _Args );
- In-place character case conversion methods have been implemented. They are:
void to_upper (); void to_lower (); void MakeUpper (); void MakeLower ();
- CString-like methods Mid, Right, and Left, that simply map to the native STL's substr method, have been added:
BasicString Mid ( size_type _Pos, size_type _N ) const; BasicString Right ( size_type _N ) const; BasicString Left ( size_type _N ) const;as well as the GetLength, IsEmpty, Insert, Delete, and Replace ones:
BasicString& Insert ( size_type _Pos, _E _C ) BasicString& Insert ( size_type _Pos, const BasicString& _S ) BasicString& Delete ( size_type _Pos, size_type _N = 1 ) BasicString& Replace ( _E chOld, _E chNew ) BasicString& Replace ( const BasicString& strOld, const BasicString& strNew )Note, that unlike their CString counterparts, the Insert, Delete, and Replace methods return not the new string length, but the reference to the string itself. This helps to build the nested string modification clauses, while the length of the resulting string is still available through the length method without any performance penalties.
Changes in Version 2
As I noted at the beginning of the article, after I migrated to VC++ 7.1 I was terrified by inefficiency of Microsoft's basic_string implementation. Implementation in VC++ 6.0 is by far not optimal as well, but it is much better than what VC++ 7.1 offers now. Let's shortly consider the flaws.
In VC++ 6.0, basic_string supports reference counting, but keeps a DWORD of string length and a DWORD of allocated string buffer size in each instance of a string object instead of keeping them only in the shared memory block together with reference count (as, for example, CString does).
Another space waste is allocator, which is a member of basic_string. Therefore, even the standard STL's allocator that has no members takes a DWORD in the string object storage. If allocator were a base class, the compiler could use empty base optimization (EBO) to not allot the memory for it at all.
As the net result, each string object in VC++ 6.0 takes four DWORDs of storage, while it could take only one DWORD (to keep a pointer to the shared memory block).
In VC++ 7.1, basic_string DOES NOT support reference counting at all, but instead each instance of the string object has a 16-byte member array for short strings. This increases the string object size up to seven DWORDs (28 bytes)!
As my tests have demonstrated, the absence of reference counting severely hits performance when strings or objects containing strings are stored in STL's containers, and this is the most frequent scenario of string usage! On the other hand, using either a member array or a dynamically allocated buffer depending on the string length requires one conditional operation for each string reference. But this badly affects huge pipelines of modern processors, and overall performance of string operations proves to be lower than if the string buffer had to be always dynamically allocated (because allocation is done once, and references to strings are normally frequent). At last, when operating with large sets of string objects, the size of the object itself becomes important. For example, vector of 1.000.000 strings (far not the biggest one for a large search system) in VC++ 7.1 requires 28 Mb while it could be only 4 Mb in size.
To improve the performance, I have modified the basic_string class from VC++ 6.0 to amend the flaws mentioned above and to make it compatible with the VC++ 7.1 iterator model. Now, BasicString class can be based not only on the native STL's basic_string, but also on its modified version contained in the header "xstring.h". Unfortunately, the compiler bug in VC++ 6.0sp5 prevents usage of the modified version with Visual Studio 6. This compiler generates incorrect code because of a special inheritance graph allowing EBO. Therefore, usage of the optimized basic_string version is suppressed in VC++ 6.0.
In VC++ 7.1, the optimized version of basic_string is used by default. To force using native STL's implementation, define the USE_CURRENT_STL_IMPL flag before including the "BasicString.h" header.
The main improvements in the BasicString class provided by the optimized basic_string version are as follows:
- The compact object layout—the only DWORD vs. four DWORs in VC++ 6, and seven DWORDs in VC++ 7.1 (!);
- Optimized copy/assignment implementation;
- Optimized GetBuffer/ReleaseBuffer performance;
- Support for maximally quick initialization of temporary string objects with a string literal via the family of _R... macros. For more information, see the comments for these macros declaration in the "BasicString.h" header.
Other improvements (not depending on the basic_string class) are:
- Template constructor that is used to construct a BasicString object from any type, which either can be converted to pointer to underlying string element type (const _E*), or to const char*, or to const wchar_t*, or has the c_str() method.
This constructor prevents ambiguity when the source type can be implicitly converted to both const char* and const wchar_t* (and in this case selects the most efficient way of initializing).
Actually, this allows you to create BasicString objects from any extant string class and, potentially, from future string classes without adding specialized constructors (e.g. the first version of BasicString contained specialized conversion constructors for _bstr_t and CString, now they became unnecessary).
The only side-effect of using this constructor is bulky compiler error messages when you try to create a string from the object of improper type.
- Extensible ToStr/ToWstr/ToAstr macros semantics. You can make any type convertible to the String/WString/AString type by means of a simple declaration using one of the DEFINE_TYPE_TO_STRING / DEFINE_TYPE_TO_STRING_x macros. Look how this is done for RECT, POINT, SIZE, CRect, CPoint, and CSize types at the end of the "BasicString.h" header.
Incompatibilities with the first version include:
- Definition of the BasicString class has been moved from the std namespace to std_ex;
- Constructor taking UINT of a string resource ID is not supported anymore. Instead, cast UINT to LPCSTR or LPCWSTR explicitly before passing it to the constructor.
- The ToTstr, ResToTstr, and GuidToTstr macros have been renamed to ToStr, ResToStr, and GuidToStr correspondingly.
- The set_bstr method has been renamed to realloc_bstr.
Note also that I have not tested usage of the BasicString objects based on the optimized version of the basic_string class with the iostream library.
New Methods Summary
The following table comprises the new methods added by the BasicString class to its STL's basic_string predecessor. Because almost all new methods have both MFC- and STL-like names, the table has three columns. The first two contain MFC- and STL-style names of new methods, and the third column—the names of extant basic_string methods, corresponding to the added CString equivalents.
Despite the fact that the BasicString class is STL-based, it violates the STL's principle of platform independence by using some NT platform-specific methods. If your target OS is not Windows, define the _NO_WINDOWS symbol before the first inclusion of this header. This will exclude the code from using Windows-specific APIs.
Additionally, find calls to the functions WideCharToMultiByte and MultiByteToWideChar and replace them with the currently commented standard C-library function calls. But note that they do not use the current system locale (at least with VC++), so if you are writing multilingual applications, you must manage locales manually using proper C-library functions.