A Unique Problem With Link Normalisation? (Canonical URLs!)

So we have hit a serious problem with rbutr which we need to overcome in order for it to be effective as it needs to be for it to deliver rebuttals reliably for arguments which have been rebutted, and it has to do with link normalisation, and coverage of all links used for a single page.

An example is the best way to clarify this:

  • This page:
    http://online.wsj.com/article/SB10001424052970204301404577171531838421366.html
  • Is the same page as this one:
    http://online.wsj.com/article/SB10001424052970204301404577171531838421366.html?mod=WSJ_article_comments
  • which is the same page as this one:
    http://online.wsj.com/article/SB10001424052970204301404577171531838421366.html?mod=WSJ_article_comments#articleTabs%3Dcomments

Lets call these links: Shorty, Comments and Tabs, respectively.

Currently, rbutr is set up to make rebuttal links between URL’s, so when someone goes to URL ‘Comments’ and submits a rebuttal to the article they find there, that rebuttal will be logged against Comments and only Comments. Anyone who visits Comments in the future will see that there is a rebuttal, but anyone who visits Shorty or Tabs will see zero rebuttals listed.

So how to we solve this problem? How do we cover all possible versions of URL’s which might be used to link to a single article?

Straight Forward Normalisation and Reduction

The first answer that came to mind was to attempts to find the simplest URL that can be used for the page. This involves accessing the page content, and then destructing or reconstructing the URL a bit at a time until we get the smallest version of the URL which has the same page content as the submitted link. And that would work fine, for submitting …but doesn’t solve the exact same problem – what about all the people who land at Comments and Tab? The system took Comments, and turned it in to Shorty, and even if the system stored Comments as well as Shorty, then when people land on Tab, they still get no indication of a rebuttal…. And in some cases, there can be tens or hundreds of real URL’s to a single article. In reality, there are infinite possible URL’s to any page, and if you simply stick a ? at the end, it will still give you the same page, but the different URL will stop rbutr from recognising the page…

So that doesn’t actually solve our problem at all 🙁

Breakthrough!

After writing the first half of this post – all prior to this heading – I went to bed for the night. While I was sleeping, Craig managed to find what we were looking for. Turns out it is Canonical URL’s – something which I had seen for years in WordPress, but never really understood the significance of. I’m not sure it is 100% our solution, but it is definitely built to provide the solution we are looking for. The only problem is how many websites actually use – because it relies on the website owners to use it in order for us to take advantage of it…

Anyway, Craig found a heap of information on it, have a look at some if you want:

  • http://www.leancrew.com/all-this/2011/11/redundant-urls/
    this one repeats the problem outlined above which we are trying to avoid, as experienced on another website
  • http://www.redirectchecker.com/canonical.htm
    A short explanation of what Canonical URL’s are
  • http://www2007.org/papers/paper194.pdf
    An academic paper about “DUST” – Different Url’s with Similar Text.

So we have a way forward now, and will be exploring canonical URL’s and keeping an eye out for how to apply the concept effectively so that our user experience is kept at the highest possible level (delivering rebuttals whenever we have them!)

 

 

Share Button