Terminology

For the purposes of this article, Internet pages consist of three entities:

  1. The Site (A domain, sometimes with 'alias' domains)
  2. The URL (Can consist of a directory, sub-directories and a page)
  3. The Parameters (Additional parameterised values that usually appear after a ? or ; character)

General Principles

The general aim of data processing rules is to allow the user to define that certain data is useful/interesting and needs to be retained. By definition, anything that is not defined, is therefore not of interest and is not retained. The rules are applied to any and each URL that Cognesia records. The purpose of rules are therefore:

  • To remove unwanted or unhelpful characters or data from page data (for example: session ids)
  • To extract useful information and store it appropriately (for example: keyword search fields, site area, page information)

This is achieved by building a data model based on rules.

There are three parts to a rule:

  1. What page(s) the rule should run on
  2. What strings/parameters should be retrieved (or removed) and what should be done with them
  3. How the page URL should be reconstructed

Many sites require only simple rules, but highly complex data models can be built if necessary.

(1) Defining what page(s) the rule should run on 

  • All Pages: The user simply defines % in the URL textbox
  • Selected Pages: The user must enter a string that is unique to the URL, i.e.
    • search - A page that (only) consists of the string "search"
    • search% - All pages that start with the string "search" 
    • %search% - Any page that has the string "search" anywhere in it
    • %sear%ch% - Any page that has a string starting with "sear" and ending in "ch" anywhere in it
  • Pages with Selected Parameters: In addition to the above, the user can also define additional parameters (usually with values), i.e.
    • All pages where "action=search"
    • Pages that start with "search", where "no-of-results=0"
    • Pages that have "search" anywhere in them, where "keyword=" is present

In all the cases above, you can specify ALL or SPECIFIC sites for your rule. 

(2) Defining what strings/parameters should be retrieved (or removed)

In 99% of cases, all rules should be set to RETAIN the defined parameters, i.e. the rule is setup to look for, retrieve and retain strings that have been defined.*

The process is as follows. Each row within the rule is for a separate data parameter:

  • Define the string that denotes the "Beginning" of the value you wish to retain, and the string that denotes the "End", i.e.
    • Beginning: search=    End: &
    • Beginning: keyword=    End: &
    • Beginning: sessionid=    End: ;
  • (Optional) Select the tickbox if you wish this parameter and value to be left in the Page URL
  • (Optional) If you wish to rename the string (or ending), define alternative values (These can be left blank if not required)
  • (Optional) To record the value in its own results table, choose the results table in the dropdown at the end of the row

*A REMOVAL rule is possible, but is a dangerous approach because anything that is not defined will be left in the URL. This is usually unmanageable.

(3) Define how the page should be reconstructed

The way a rule reconstructs a URL can be controlled. Both the beginning and end of the URL can be adjusted.

With the beginning of the URL, there are two possibilities:

  1. The original URL up to:
    • Generally, this is used to ensure the page (URL) that was passed to Cognesia is not altered in any way within the reporting
      • However, in complex situations, this can be setup to 'truncate' part of the URL
    • It is necessary to define what constitutes the end of the page URL.
      • Usually, this value is "?" or sometimes ";" (depending on which character your site uses to separate the page URL from any parameters)
    • This is the most used setting and if doubt, should be used
  2. The Following Text:
    • This is used to rewrite the URL, i.e. a rule could be setup to change the URL /search/results to /searchresults/
    • This is rarely used, but occasionally useful

In addition, it is possible to values onto the end of a URL:

  1. Simply define a string that should be appended.

Limitations

  • The number of 'fields' that can be specified in a rule is a contractual/account setting. If you need more, please talk with your Account Manager
  • There are no limits to the number of rules, although it is not usually necessary to create that many
  • There are no limits on the number of result tables, although it is not usually necessary to create that many
  • There are no limits on the number of URLs that are defined in a rule. Usuing the % Wildcard character should limit this

Tips & Tricks

  • Consider possible future site/URL changes when creating rules.
  • Broad rules are best. The more specific a rule is, the more likely it is to be 'tripped up' by site changes 
  • We recommend not "rewriting" parameter names unless absolutely necessary. It can be confusing for users of the reporting.
  • It is possible (and quite normal) to record a string both in the URL and also within a result table 
  • A rule will not allow you to write two different parameters into the same result table. (It is necessary to create additional rules)
  • Extracting a value from the first directory: Define the string "$BOP>" as denoting the begining of the field (and usually "/" as the end)
  • Extracting a value from the first sub-directory: Define After "1" occurences of the string "/" as denoting the begining of the field (and usually "/" as the end)
  • Extracting a value from the second sub-directory: Define After "2" occurences of" the string "/" as denoting the begining of the field (and usually "/" as the end) and so on...
  • The URL reconstruction process can be used to rewrite a URL, if done carefully