• Blog
  • Papers
  • Datasets
  • Using SEO for Good

    15 October, 2018 | General Hackery

    One of the things I've been quite good at as a data wrangler is encouraging use of SEO for my own purposes. I was really, really happy when Google started encouraging the use of schema.org as it silently pushed us one more step towards the machine-readable web I long for. But I've always seen SEO to be a bit of a dark art; the job of a search engine is to return the most relevant results for a particular search query, the job of SEO is to ensure your page is returned in as many result sets as possible. So, as a web user, I kinda see the more aggressive forms of SEO as 'breaking' Google, making the web a worse place for selfish commercial gain. Despite this, I've found many ways to 'abuse' SEO practices in return in order to make my life better, and I will impart an example.

    I use RSS. I don't know how many people still do, but I do and I think it's great. It's far better than any Facebook or Twitter news feed, you can subscribe or unsubscribe to feeds as often as you like without incurring the wrath of some automatic exit survey, you can ignore articles without anyone knowing you've ever even read the headline, and, in this world of pay-per-click advertising, the articles you actually read are the ones that get the ad revenue from your clicks. You can't realistically game RSS because since Google Reader closed down there's no 'de-facto' client that people use. I imagine it's a bloody nightmare for web analytics people, but for the user it's awesome and I hope it never goes away. Indeed, pretty much all news sites still have RSS feeds, as does this blog (see the footer).

    I say you can't game it - but you can in one small way. That way is the same as every other web digest, and that's with clickbait headlines. My local paper (which shall remain nameless) regularly does this. I'll get a headline in my RSS reader like "local business destroyed in car crash horror". Often, I'll click the headline and be taken to an article about a local business I've never heard of, miles away from where I actually live that I don't really care about, that has had to close its doors because its owner happened to be in a minor car crash and lost his no-claims bonus and his premium going up has been a contributing cause of the business failing or some such twaddle. Remember when Obi-Wan told Luke that what he told him about his father was true "from a certain point of view"? That.

    But in a rather ironic twist of fate, cunning clickbait headlines in RSS are often actually rendered useless by SEO practices. A closer look at the (fictional) article in the last paragraph will reveal that the actual title of the page - the text within the <title></title> tags - to be much more detailed and much more descriptive than the title as shown in the RSS feed. Even if the headline is "local business destroyed in car crash horror", the title in the HTML will be something like "Bob's Flowers in Tittington, Chobleigh goes into liquidation, bankruptcy due to high car insurance premium". The reason is obvious - this is the bit Google sees. So the paper want anyone searching for Bob's Flowers, Tittington to see this article and hopefully click the link. However anyone subscribing to the RSS feed or already on the site reading the latest headlines is more likely to click a link that doesn't mention Bob's Flowers, if only to find out which local business to which the headline refers.

    So all you need is a single 'wrapper' script. All this script needs to do is read the real RSS feed, then go through each entry in the feed, calling the HTML and replacing the RSS item title with the one in the HTML. It's good for efficiency and stealth if you only call each article once, I do this by simply caching the HTML in a local directory. An example of my code in PHP is below

    <?php

    function get_real_page_title($url)
    {
    $cache = dirname(dirname(dirname(__FILE__))) . "/tmp/rss/" . md5($url) . ".html";
    $html = "";
    if(file_exists($cache))
    {
    $html = file_get_contents($cache);
    }
    if(strlen($html) == 0)
    {
    $html = file_get_contents($url);
    if(strlen($html) > 0)
    {
    $fp = fopen($cache, "w");
    fwrite($fp, $html);
    fclose($fp);
    }
    }

    $dom = new DOMDocument;
    $label = "";
    @$dom->loadHTML($html, (LIBXML_NOERROR && LIBXML_NOWARNING));
    foreach($dom->getElementsByTagName("title") as $t)
    {
    $title = "" . $t->textContent;
    $title = trim(preg_replace("/^(.+)\\|.+$/", "$1", $title));
    if(strlen($title) > 0) { $label = $title; }
    }

    return($label);
    }

    function process_localpaper_feed($uri)
    {
    $xml = simplexml_load_string(file_get_contents($uri));
    if(!($xml)) { return(""); }
    if(!($xml->channel)) { return(""); }
    if(!($xml->channel->item)) { return(""); }
    foreach($xml->channel->item as $item)
    {
    $title = utf8_decode(get_real_page_title($item->link . ""));
    if(strlen($title) > 0) { $item->title = $title; }

    $item->link = preg_replace("/^([^\\?]+)\\?(.*)$/", "$1", $item->link);
    }
    return($xml->asXML());
    }

    echo (process_localpaper_feed("[local paper's RSS feed URL here]"));

    Then all you need to do is put that script somewhere it can be called via a URL, and point your RSS reader at it, rather than the real feed. Hey presto, never click on an article in which you're not interested again!