NAME

inbuf_scan -- regular-expression-based input parsing


SYNOPSIS

   #include <afblib/inbuf_scan.h>
   int inbuf_scan(inbuf* ibuf, const char* regexp, ...);
   typedef struct {
      const char* captured;
      size_t captured_len;
      int callout_number;
   } inbuf_scan_callout_block;
   typedef int (*inbuf_scan_callout_function)(inbuf_scan_callout_block*,
         void* callout_data);
   int inbuf_scan_with_callouts(inbuf* ibuf, const char* regexp,
         inbuf_scan_callout_function callout, void* callout_data);


DESCRIPTION

inbuf_scan allows to scan multiple items from ibuf on base of the regular expression regexp. The syntax of the regular expression is expected to conform to those of Perl as the Perl Compatible Regular Expression Library (libpxre) is used (-lpcre is required for linkage). For each capturing sub-expression a pointer to an initialized stralloc object has to be passed. The behaviour is undefined if too few parameters are given.

If n capturing sub-expressions are given in the regular expression, then inbuf_scan returns in case of a successful match a value m <= n which gives the actual number of captures. The first m stralloc objects whose pointers have been passed behind regexp are then filled with the contents of the individual captures. If the copy-out of a capture is to be suppressed, a null pointer may be given instead. If a capturing sub-expression matches repeatedly, the last capture is stored into the corresponding stralloc object.

While usually regular expressions are used on already existing input stored into some string, inbuf_scan automatically requests more input when the regular expression matches the current buffer contents just partially. (See the example below that returns the last input line.)

The regular expression is considered to be anchored, i.e. it must match the input right from the beginning. If, for example, there is possibly leading whitespace, ``\s*'' should be at the beginning of the regular expression.

The regular expression and the input can span multiple lines. ``.'' does not match newlines as usual, but ``^'' and ``$'' match the beginning or the end of an input line. Note that ``\s'' matches all whitespace characters including line terminators. If trailing whitespace and a newline are to be removed from the input, do it non-greedily, i.e. use ``\s*?\n'' instead of a simple ``\s*\n'' which would remove all subsequent empty lines.

inbuf_scan_with_callouts provides an alternative interface that allows to take advantage of the callout feature. The callout function is called whenever ``(?C)'' constructs are matched within the regular expression. This callout feature supports also an optional decimal non-negative integer as in ``(?C7)''.

Callout functions allow to record all occurrences of a repeating capture. Where inbuf_scan just returns the last capture, inbuf_scan_with_callouts allows to see all captures. Consider following regular expressions that captures all colon-separated fields within an input line:

   (([^:\n]*)(?C))(:([^:\n]*)(?C))*\n

In this case, inbuf_scan delivers two fields at maximum (i.e. the first and the last field). The callout handler of inbuf_scan_with_callouts gets called for each capture. The callout handler gets two parameters where the first points to following callout block:

const char* captured

Points to the beginning of the last capture. Is 0 if there was no capture yet. Note that the string is not nullbyte-terminated.

Be aware that the pointer is no longer valid when inbuf_scan_with_callouts returns. Hence, the captured content must be saved somewhere if it is to be used after the invocation of inbuf_scan_with_callouts.

size_t captured_len

Gives the length of the capture in bytes. Note that there is no nullbyte-termination and that nullbytes are permitted within the input.

int callout_number

Tells the callout number. By default this is 0 but if, for example, ``(?C7)'' would have been given, it would be 7.

The second parameter to the callout function is the last parameter which has been passed to inbuf_scan_with_callouts.


DIAGNOSTICS

On success, inbuf_scan returns the number of captures.

Otherwise, if the input does not match or if the regular expression is invalid, -1 is returned. How much from the input got consumed in case of a non-matching regular expression depends on how often the input buffer had to be refilled. Any consumed buffer fillings are lost but the most recent buffer filling is left untouched.

This ambiguity can be avoided by making trailing parts optional. For example, instead of

   "\\s*(\\d+)\\s*(\\d+)"

to capture two integers it is better to use

   "\\s*(\\d+)\\s*(\\d+)?"

and then to see if the return value is 1 (one integer read) or 2 (both integers read). If -1 is returned, some leading whitespace is possibly lost but nothing else.

inbuf_scan_with_callouts operates similar but returns the sum of the integer values returned by the callout function. If any of the callout functions calls returns -1, further processing is aborted and -1 returned.


EXAMPLES

Read an input line and discard the line terminator:

   stralloc line = {0};
   int count = inbuf_scan(&ibuf, "(.*)\n", &line);

Like before but discarding leading and trailing whitespace:

   int count = inbuf_scan(&ibuf, "[ \t]*(.*?)\\s*?\n", &line);

Read an arbitrary number of colon-separated fields until and including the line terminator:

   stralloc field = {0}; int count;
   do {
      count = inbuf_scan(&ibuf, "([^:\n]*)(:)?", &field, 0);
      if (count >= 1) {
         // process field
      }
   } while (count == 2);
   if (count == 1) inbuf_scan(&ibuf, "\n");

Collect all colon-separated fields in a list using inbuf_scan_with_callouts:

   void collect_field(inbuf_scan_callout_block* block, void* data) {
      strlist* list = (strlist*) data;
      if (block->captured) {
         size_t len = block->captured_len;
         char* s = malloc(len + 1);
         if (s) {
            memcpy(s, block->captured, len); s[len] = 0;
            strlist_push(list, s);
         }
      }
   }
   // ...
   strlist captures = {0};
   int count = inbuf_scan_with_callouts(&ibuf,
      "(([^:\n]*)(?C))(:([^:\n]*)(?C))*\n",
      handler, &captures);

Read and convert decimal floating point values from an input buffer:

   bool scan_double(inbuf* ibuf, double* val) {
      const char fpre[] =
         "\\s*([+-]?(?:\\d+(?:\\.\\d*)?|\\.\\d+)(?:[eE][+-]?\\d+)?)";
      stralloc sa = {0};
      bool ok = inbuf_scan(ibuf, fpre, &sa) == 1 &&
                stralloc_0(&sa) && sscanf(sa.s, "%lg", val) == 1;
      stralloc_free(&sa);
      return ok;
   }

Read the last line from the input:

   stralloc line = {0};
   int count = inbuf_scan(&ibuf, "(?:.*\n)*(.*)\n", &line);


AUTHOR

Andreas F. Borchert