Wednesday 9 February 2011

A Not-So-Clever-Now String Implementation

Back around 1995 when I first started using C++ commercially[+] I picked up an excellent book by the late Paul DiLascia titled “Windows++”. It was about writing Windows applications in C++ and the book developed a framework along similar lines to the then fledgling MFC. I decided that writing my own C++ framework would be a great way to learn C++ and I (somewhat foolishly) decided that I would use a similar code style to MFC so that I could easily reuse any of my framework code in my day job (after making it production ready of course).

Reinventing the Wheel

The heart of any C++ class library back then was the String class. C++ was still evolving and my work involved using MSVC 1.52c (aka Visual C++ 1.5) which had no template[^] support so naturally I created my own ‘classic’ class that managed an ANSI char buffer:-

class CString
{
private: 
  int   bufsize;  
  char* buffer;      
  . . .
};

One of the problems with this class was that it was too easy to screw up[*] when using the printf() family of functions because it had a different memory layout to a plain char*, so if you forgot to invoke the moral equivalent of c_str() you could end up crashing the process because the bufsize member would be interpreted as a string pointer:-


  CString str = “chris”; 
  printf(“hello %s”, str); // BOOM!
}

Swapping the two members around would probably work for the single-string case above, but not for N arguments as the trailing bufsize member would become the next string:-


  CString first = “chris”, last = “oldwood”; 
  printf(“hello %s %s”, first, last); // Still BOOM!
}

Now, I felt that giving up the beauty of the printf() family was too much; I had just spent the last few years learning C and so I had the various format specifiers committed to memory. Windows also had its own variation, wsprintf(), that supported a few other useful extensions such as near/far pointers and the ANSI/Unicode build dependent variants of “%c” and “%s”.

Tempted by the Dark Side

One day whilst debugging, I stepped into some code that passed a MFC string directly into a printf() call and realised that it hadn’t gone bang. As I stepped into the printf() call I realised that the reason it had worked was because an MFC CString object had the same stack footprint as a raw char* pointer. When an object is passed into a varargs style function the object is passed by value, which in the case of a CString is effectively a single pointer. Also because the class only has a pointer based member it has no extra padding or alignment requirements – at least not on the x86 targets I was interested in.

I thought this was rather clever and set about changing my implementation so that it had similar behaviour. It hadn’t registered at first but I later realised the similarity with the implementation of the COM BSTR type which also stores the string length in an area preceding the string contents. And so I switched my implementation to this:-

class CString
{
private: 
  class CStringBuffer 
  {  
      int   bufsize; 
      char buffer[0]; 
  }; 

  char* string;  // Really CStringBuffer.buffer 
  . . .
};

So now I allocate a single chunk of memory for the header (CStringBuffer) + contents (N * sizeof(char)), but instead of storing a pointer to the CStringBuffer header as a member, I slide along the memory block and store a pointer to the first character of the string, i.e. &CStringBuffer.buffer[0]. I used an interesting Microsoft extension- the zero element array. This beast used to be quite common in the Windows headers and is a useful way to create a placeholder for a variable length array that follows a structure in a contiguous piece of memory. The alternative is the standards compliant “char buffer[1]”, but then that makes your sizeof() arithmetic a little more complicated as you have padding to account for along with the extra byte in the array. Of course to access the header structure without littering the code with casts I wrapped the necessaries in an inline method:-

CStringBuffer* CString::GetBuffer()

  return ((CStringBuffer*)string)-1;
}

All very ugly and switching from a C-style cast to a proper reinterpret_cast<> doesn’t make it any less unpleasant. Still, I happily lived in ignorant bliss for well over a decade until I decided to make a conscious effort to drop my CString class in favour of the standard string type std::string – at least for new libraries that is. The conduit was reading Matthew Wilson’s Imperfect C++ and coming across his STLsoft libraries. Suddenly I found a new impetus to deal with the impedance mismatch between MFC/ATL and Standard C++ (something I believe his new book may well be tackling).

A New Hope

The switchover went swimmingly well with my new Core and XML libraries showing the way forward whilst the old WCL, NCL and MDBL libraries became the epitome of the word ‘'legacy’. Where possible I added extra overloads to allow std:string’s to be used naturally but by-and-large the worlds were kept apart and the cost of the occasional implicit[#] conversion from a CString to a std::string was left for the profiler to find, should it ever show up.

Then I decided that where possible all new code, even in my legacy libraries, should adopt the proper string type and that I would slowly migrate code across. So I started making deeper and deeper refactorings (backed by my unit tests – where they existed) and still things looked good. Visual C++ 10 (aka VS2010) arrived on the scene and so a port was required to ensure my codebase was compatible and much to my surprise a unit test in my ANSI to Unicode conversion helper functions choked on an ASSERT inside the STL. The code was wrong, no doubt about it, but why hadn’t I see it before as I usually use STLport to keep me honest? I did a build with STLport 5.2 and it also whinged! So I fixed my iterator dereferencing bug and ran the entire suite to make sure I hadn’t entered into the murky waters of Undefined Behaviour anywhere else by accident and again it when BANG! One of my COM library tests was now failing! Running the test under the debugger revealed a more serious problem as the MS debug runtime started complaining about other matters…

Undefined Behaviour Strikes Back

It took a couple of debugging attempts but then I winced as I spotted my error. I had switched over a couple of helper functions from returning a CString to a std::string and the result of that function was in turn being passed to a varargs logging function like so:-

std::string SomeClass::toString()

  return . . .;
}
. . .
void Elsewhere::DoSomething()

  LOG(“Blah, blah, blah %s”, something.toString()); 
  . . .
}

Yup, there was no call to c_str()!

Then the true cost of my foolish design choice started to dawn on me as I realised that there was no obvious way to catch the problem at compile time because varargs functions are effectively type-less once you extend into the ‘variable’ part of the parameter list. Yes, it might be possible to do something with templates and macros but that would be non-trivial. Add in the fact that I have used varargs style functions to do all my string formatting, often in my exception handling, and you can guarantee that there are plenty of time bombs in the shape of rarely executed exception handlers just waiting to go off.

Another New Hope

Maybe it was fate, but I had always been meaning to get my Core library and its unit test suite to build and run under more than just Visual C++. So I started investigating the alternatives such as GCC and Digital Mars (see “GCC for the Visually C++ Impaired”) and soon I had both Core and XML working under GCC 3.x and 4.x via the Code::Blocks IDE – albeit only with the standard warning level.

After being distracted for some months I finally got back to playing with GCC as I had always intended on compiling my code with the highest warning settings available to see how much static code analysis GCC can do over-and-above what VC++ already does – an awful lot it turns out (a topic for another day). And, guess what one of the behaviours that it singles out? Yup, the passing of non-primitive types to printf() style functions! I’ve yet to dig deeper and see if it is varargs functions in general or whether it’s functions like printf(). Either way it spots code like this:-


  std::string text(“. . .”); 
  printf(“%s”, text);
}

Which is great because I now have a clear path forward for deprecating my CString type safely:-

  1. Add a c_str() method to my CString type to provide compatibility with std::string
  2. Start compiling[$] more of my codebase with GCC 4.x to identify and fix those bugs by appending a call to .c_str()
  3. When a suitable refactoring moment arises switch code from my CString type to the standard string type
  4. Test, test and test some more…

Epilogue

I often wonder whether it’s still worth maintaining my 15 year old C++ codebase of tools. Many of them were made obsolete years ago and so I have stopped fixing up the really old applications but the libraries live on – kept alive by each new utility that I write. I could start again from scratch but you don’t get that luxury in the real world so I’m inclined to try and refactor my way out. This way it remains a learning exercise – albeit one about refactoring legacy code – which sadly is what a lot of commercial C++ projects have become.

 

[+] I first used it at university some years earlier and at that time you used the ‘overload’ keyword to mark overloaded functions. I managed to get G++ (nay GCC) working on my Atari 1040 STE (with twin floppy drives) but it was not a pleasant experience. I already thought C was too bloated in comparison to Assembly Language and C++ was therefore even more pointless. Sooo naive…

[^] Not that I knew what a template was back then of course. I never really started to understand templates until the turn of the millennium and Template Meta-programming is still largely theory to me – that’s why there’s the Boost libraries!

[*] If you were lucky it was in a common code path and you got a UAE/GPF there and there; but often you weren’t and it became a time bomb if the code path was in an error handler that was called infrequently.

[#] Yes, I added a conversion operator to ‘const char*’ to avoid having two overloads on every text API method or littering my code with .c_str() style warts. Choose your poison…

[$] Sadly, although my Core and XML libraries link and I can therefore run the unit tests, I can’t do the same for my WCL, COM or MDBL libraries or projects dependent on them (i.e. everything). But that’s ok because it’s the static code analysis I’m after anyway.

2 comments:

  1. Oddly enough I have just this week implemented a string class remarkably similar to your 90s one!
    The class has a single char* pointer, with a refcount and string length stored *before* the address pointed to.
    In my case the goal was not for interop with varargs functions. I needed a ref counted, COW, implementation with a single allocation per instance - but for it to still be easy to see in the debugger.
    Doing so reduced our memory footprint by at least 60Mb (from 240mb) - so it seemed it was worthwhile :-)

    As I was reading your post I was thinking, "what's the big caveat that I'm missing that he's about to point out?" I was quite relieved that it was just the printf issue in the end :-)

    Obviously I still do have a latent issue with COW and atomic counts - but since we're 98% single threaded - and those parts of the system that do involve threads are sufficiently isolated - it's a risk we're prepared to take without using interlocked inc/ dec.

    ReplyDelete
  2. You're right of course there is nothing wrong with the CString class design per se - the problem is the reason why I implemented it that way and how I ended up using it.

    I also saw similar memory footprint gains in a MFC app I worked on due to exploiting the COW nature. A shared_ptr is clearly more explicit but so much uglier to see in interfaces :-).

    Are you really sure the interlocked calls will add an unecessary overhead? In the past I've heard that same argument with shared_ptr but found that they've been passed-by-value everywhere when pass-by-reference would have been perfectly safe and avoided the unecessary extra ref count twiddling around each call.

    ReplyDelete