Ruby strtok()


When developing a CSS analyzer, I needed to develop a method for splitting the CSS files into meaningful chunks. These 'meaningful' chunks should be the sequence of characters which have semantic value in the CSS specification such as keywords (em, border), selectors(div, p + p), and property values(bold, #773e1a).

C provides a function specifically crafted for this occasion - strtok(). strtok() is defined in the ISO C standard and available in may C based languages (C++, PHP, and Matlab). strtok() is a function which splits strings into tokens based on a set of delimeters. A string passed into strtok() is divided into an array of tokens which contain the characters in between one or more delimiters. Here is a simplified example (in C) of the tokenizer to demonstrate strtok().

#include <stdio.h>
#include <string.h>

#define DELIM "{}:; "

int main (int argc, char **argv) {
  char str_to_tokenize[] = "p { font-size: 1.4em; font-weight: bold }";
  char *str_ptr;

  fprintf(stdout, "Split \"%s\" into tokens:\n", str_to_tokenize);

  str_ptr = strtok(str_to_tokenize, DELIM);
  for(; str_ptr != NULL;) {
    fprintf(stdout, "%s\n", str_ptr);
    str_ptr = strtok(NULL, DELIM);
  }

  return 0;
}

Ruby does not provide an interface to strtok(). However, the String#split method can perform the same task with more flexibility. split takes one parameter, the delimiter which can be either a character or a regular expression. Below is the Ruby version of the same example.

DELIM = /[{}:;]+/
str_to_tokenize = "p { font-size: 1.4em; font-weight: bold }"
puts str_to_tokenize.split(DELIM)

Since we are using a regular expression to define the delimiter we can include additional functionality. For example, to simplify harmony's parser I opted to collect the delimiter that was found between each token. This tweak only required two additional characters DELIM = /([{}:;]+)/.

By using regular expressions with String#split we can duplicate the behavior of strtok().