27 August 2012

As we begin to focus on improving content with our monthly publication, an obvious issue to discuss is tracking the readability level. In theory, there is an optimal level of readability for our publication, and we should be careful to stay within specific bounds as we write and publish. For an introduction to the various techniques, start with Automated Readability Index (wikipedia). Most of these techniques are implemented using the classes provided by PHP Text Statistics.

Since we maintain content using Drupal, an obvious starting place was the Drupal Readability Project, which uses the PHP class referenced above. However, I was unsatisfied with its data model. Since Drupal allows changes to a node without creating a new version id (vid), it stores readability with a timestamp. We always create new versions, and we only need one index. Thus, there was a much lighter technique available that did not involve adding multiple extra modules.

Step 1: Create an integer field on the node type.

Step 2: Implement hook_form_alter() to hide the integer field from the edit field (optional).

Step 3: Implement hook_nodeapi() to update the integer field in the "presave" operation.

      include_once "sites/all/libraries/text-statistics/TextStatistics.php";
      $body_notags = trim(strip_tags($node->body));
      $stats = new TextStatistics;
      $smog = $stats->smog_index($body_notags);
      $node->field_readability_smog = array(
          'value' => $smog

Further enhancement of the analysis is also possible with minimal effort. There is a freely available web service to check grammar and readability. The web service returns various statistics that could allow us to track such specific items as cliches and repeated words (likely typos). I may bundle the two options into a module if we determine that the extra data will actually be actionable for us. At the moment, we need to establish a longer track record to decide how to use the data we collect.

It is an exciting step forward.

blog comments powered by Disqus