dimanche 11 septembre 2011

spacef

Some time ago, I started working on a gopher server (gopher is the protocol behind the gopherspace, an important part of Internet) and I choose to write it in a modern, efficient language: C. The gopher protocol is mostly about sending tab-separated data from the server to the client in response to the requested path. So, most of the parsing work is done on the client side and the server just have to generate this data.

But recent implementations of gopher serves introduced a new concept: Gophermaps: those are server side files that describe what should be displayed to the client. That also means the server has some parsing work to do. But it should be ok, C comes with maybe one the best string manipulation library ever made with short, descriptive function names (for instance, strpbrk or strrchr) and a sane way to store string length (I really wonder why would anyone use an alternate implementation such as bstring).

So, the problem was quite simple: I wanted to read several pieces of non-tab data separated by tabs and wrote something like that:
sscanf(buffer, "%s\t%s", &first, &second);
I expected it would read some non-space data (yes, there's already a problem here if there's space before the tab), then a tab, then some other non-space data. It doesn't.

After reading the man page and doing some tests, I understood a few things about my expression:
  • %s first skips leading spaces, before reading non-space data;
  • \t match any number of spaces, in fact any space in a pattern match any number of spaces.
So my scanf call was the same as this one:
sscanf(buffer, "%s%s", &first, &second);
and is equivalent to this regular expresssion:
"[:space:]*([^:space:]+)[:space:]*([^:space:]+)"
not really what I wanted to read...

First I needed to figure how to read non-tab data only, and it happens that this part was simple enough. The square brackets in scanf patterns work somehow the same than in regular expression. So %[^\t] will read a sequence of non-tab characters (ho yeah... a \t between square brackets only match a tab).

Next I have to read the "only one tab" part and this part was a lot more fun. %[\t] would match a sequence of tabs. To match only a specified number of tabs, you have to use a decimal between the % and the square bracket. The pattern is now %1[\t] but there still a problem: each conversion specification (those %... things) needs to be stored in some output variable and that would be stupid to store a tab each time I need to read one. Scanf provides the * modifier that tells to discard output of the conversion specification. The final pattern for reading a that is then %*1[\t].

And the correct version  of my scanf is:
sscanf(buffer, "%[^\t]%*1[\t]%[^\t]", &first, &second);
For those curious about my gopher server, the mercurial repository is here: http://hg.tuxfamily.org/mercurialroot/gophrier/gophrier/ and there's a mirror here: https://bitbucket.org/guillaume/gophrier

Aucun commentaire:

Enregistrer un commentaire