Aussie AI

Chapter 19. Supportability

  • Book Excerpt from "Safe C++: Fixing Memory Safety Issues"
  • by David Spuler

What is Supportability?

Supportability refers to making it easier to support your customers in the field. This means making it easier for your customers to solve their problems, and also making it easier for your phone support staff whenever customers call in.

Hey! I have an idea: how about you build an AI chatbot that knows how to debug your software? Umm, sorry, rush of blood to the head.

Some of the areas where the software's design can help both customers and support staff include:

  • Easy method to print the program's basic configuration, version, and platform details (e.g., either an interactive method or they're logged to a file).
  • Printing important platform stats (e.g., what CPU/GPU acceleration was found by the program, what is sizeof int, and so on).
  • Self-check common issues. Don't just check for input file not found. You can also check if it was empty, blanks only, zero words, punctuation only, wrong character encoding, and so on.
  • Verbose and meaningful error messages. Assume every error message will be seen by customers.
  • Specific error messages. Lazy coders group two failures: “ERROR: File not found or empty.” Which is it?
  • Unique codes in error messages.
  • Documenting your error messages in public help pages or by making your online support database world-public (gasp!).
  • Retain copies of all shipped executables, with and without debug information, as part of your build and release process, so you can postmortem debug later.
  • Have a procedure whereby customers can upload core files to support.
  • Documentation for customers about how to run the support/debug features of the software, how to run self-diagnostics, or how to locate the tracing logs or other supportability features.
  • Documenting post-mortem procedures, such as gdb on a core file, as customers aren't necessarily engineers.
  • Not crashing in the first place. Fix this by writing perfect code, please.

Why use unique message codes? Adding unique numeric or symbolic codes in your error messages and even in assertions can improve supportability in two ways: self-help and phone support call-ins. A unique code allows customers to find these error codes easily on the internet (i.e.,via Google or Bing), either in your website's online help web pages, or on the third-party websites (e.g., Stack Overflow and the like), where other customers have had the same problem.

Note that the codes don't really need to be completely unique, so don't worry if two messages have the same code, unless you're doing internationalization! And certainly, don't agonize over enforcing a huge corporate policy for all teams to use different numbers or prefixes. However, it does help for your unique code to have a prefix indicating which software application it's coming from, because the AI tech stack has quite a lot of components in production, so maybe you need a policy after all (sigh).

Note that supportability is at the tail end of the user experience. It's less important than first impressions: the user interface, installation and the on-boarding experience.

Graceful Core Dumps

Okay, so it's a bit of an oxymoron to say that. But here's a way to at least print a useful support message if your code crashes. Register a signal handler for SIGSEGV (segmentation failure), SIGILL (illegal instructions), SIGFPE (floating point error), and any other fatal signals you can think of.

Here's an example of a shutdown signal handler:

    void crash_gracefully(int sig)
    {
        static bool s_already = false;
        if (s_already) {
            // Already shutting down
            // Probably a re-raised signal
            // Too dangerous here to do anything
            return;  // Just finish
        }
        s_already = true;  // Avoid recursive calls
        fprintf(stderr, "Hi Customer, unfortunately the code is crashing.\n");
        fprintf(stderr, "Please call 1-800-DEV-NULL to report the crash.\n");
        fprintf(stderr, "You can also email the core file to devnull@<MYSITE>.com.\n");
        abort();  // Trigger the core dump
    }

This is how you install a signal handler, usually at the start of program execution, such as the top of main.

    #include <signal.h>
    // ...
    signal(SIGSEGV, crash_gracefully);
    signal(SIGILL, crash_gracefully);

I've successfully used this approach on Unix and Linux platforms, but I'm not sure about on MacOS (which is like BSD Unix) or Windows (which isn't). Note that you must be very careful about re-raised signals here. Otherwise, this function is going to spin for the customer, rather than core dump.

There's not much you can do other than print a message and core dump. You can't try to recover from this for fatal signals. If you try to block a SIGSEGV signal and just return, hoping to keep going, it won't work. Instead, the SIGSEGV will get raised again by the CPU, and also spin.

You can test this code quite easily by writing bad code that crashes (intentionally, for a change). Feel free to try getting it to print out a stack trace with std::backtrace (C++23) or the other stack trace libraries from GNU or Boost. I'm betting against it, because it'll probably be too corrupted to see the stack in a signal handler, but it might work.

Random Number Seeds

Neural network code often uses random numbers to improve accuracy via a stochastic algorithm. For example, the top-k decoding uses randomness for creativity and to prevent the repetitive looping that can occur with greedy decoding. And you might use randomness to generate input tests when you’re trying to thrash the model with random prompt strings.

But that’s not good for debugging! We don’t want randomness when we’re trying to reproduce a bug!

Hence, we want it to be random for users, but not when we’re debugging. Random numbers need a “seed” to get started, so we can just save and re-use the seed for a debugging session. This idea can be applied to old-style rand/srand functions or to the newer <random> libraries like std::mt19937 (stands for “Mersenne twister”).

Seeding the random number generator in old-style C++ is done via the “srand” function. The longstanding way to initialize the random number generator, so it’s truly random, is to use the current time:

    srand(time(NULL));

Note that seeding with a guessable value is a security risk. Hence, it’s safer to use some additional arithmetic on the time return value.

After seeding, the “rand” function can be used to get a truly unpredictable set of random numbers. The random number generator works well and is efficient. A generalized plan is to have a debugging or regression testing mode where the seed is fixed.

    if (g_aussie_debug_srand_seed != 0) {
        // Debugging mode
        srand(g_aussie_debug_srand_seed);   // Non-random randomness!
    }
    else {  // Normal run
        srand(time(NULL));
    }

The test harness has to set the global debug variable “g_aussie_debug_srand_seed” whenever it’s needed for a regression test. For example, either it’s manually hard-coded into a testing function, or it could be set via a command-line argument to your test harness executable, so the program can be scripted to run with a known seed.

This is better, but if we have a bug in production, we won’t know the seed number. So, the better code also prints out the seed number (or logs it) in case you need to use it later to reproduce a bug that occurred live.

    if (g_aussie_debug_srand_seed != 0) {
        srand(g_aussie_debug_srand_seed);   // Debug mode
    }
    else {  // Normal run
        long int iseed = (long)time(NULL);
        fprintf(stderr, "INFO: Random number seed: %ld 0x%lx\n", iseed, iseed);
        srand(iseed);
    }

An extension would be to also print out the seed in error context information on assertion failures, self-test errors, or other internal errors.

There’s one practical problem with this for reproducibility: what if the bug occurs after a thousand queries? If there’s been innumerable calls to our random number generator, there’s not really a way to reproduce the current situation. One simple fix is to instantiate a new random number generator for every query, which really isn’t very expensive.

Adding Portability to Supportability

The basic best practices are to write portable code until you can't. Here are some suggestions to further finesse your portability coding practices for improved supportability:

    1. Self-test portability issues at startup.

    2. Print out platform settings into logs.

A good idea is to self-test that certain portability settings meet the minimum requirements of your application. It's necessary to check for the exact feature you want. And you probably should do these feature self-tests even in the production versions that users run, not just in the debugging versions. It's only a handful of lines of code that can save you a lot of headaches later.

Also, you should detect and print out the current portability settings as part of the program's output (or report), or at least to the logs. Ideally, you would actually summarize these settings in the user's output display, which helps the poor phone jockeys trying to answers callers offering very useful problem summaries: “My AI doesn't work.”

If it's not a PEBKAC, then having the ability to get these platform settings to put into the incident log is very helpful in resolving production-level support issues. This is especially true if you have users running your software on different user interfaces, and, honestly, if you don't support multiple user interfaces, then what are you doing here?

You should also output backend portability settings for API or other backend software products. The idea works the same even if your “users” are programmers who are running your code on different hardware platforms or virtual machines, except that their issue summaries will be like: “My kernel fission optimizations of batch normalization core dump from a SIGILL whenever I pass it a Mersenne prime.”

 

Online: Table of Contents

PDF: Free PDF book download

Buy: Safe C++: Fixing Memory Safety Issues

Safe C++ Safe C++: Fixing Memory Safety Issues:
  • The memory safety debate
  • Memory and non-memory safety
  • Pragmatic approach to safe C++
  • Rust versus C++
  • DIY memory safety methods
  • Safe standard C++ library

Get it from Amazon: Safe C++: Fixing Memory Safety Issues