MODX 360

the personal website of garry nutting

Generating a Google Sitemap when you have thousands of resources

So, you have thousands of resources on your site and you need a Google Sitemap generating. With a few hundred resources, you can use some of the available Extras for MODX Revo to alleviate the stress and pain and not have any noticeable performance dips. But what happens if you have 15,500 resources?


The answer is simple, the time required to generate the sitemap excalates greatly and you start to push both the time and memory limits of PHP. On a site I worked on recently, this was exactly the problem I had to solve. And the solution couldn't really be simpler!

As I went through the initial processes of optimization, I quickly realized that retrieving that number of resources and iterating over them to parse out the <url> nodes was far too expensive - two and a half minutes too expensive. So, I set about a way of removing the need to loop over the resources altogether and this was achievable by actually having my query generate the XML node output for me. The completed snippet is as follows:

    <?php
ini_set('max_execution_time', 0);

$priority = $modx->getOption('priorityId', $scriptProperties, 1);
$changeFreq = $modx->getOption('changeFreqId', $scriptProperties, 1);

$options = array(
  xPDO::OPT_CACHE_KEY => 'sitemap',
);
$output = $modx->cacheManager->get('sitemap', $options);

if ($output == null) {
    $output = '<?xml version="1.0" encoding="UTF-8"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    ';

    $stmt = $modx->query("
        SELECT
    	    GROUP_CONCAT(
			    '<url>',        
			    CONCAT('<loc>" . MODX_SITE_URL . "',uri,'</loc>'),
			    CONCAT('<lastmod>',FROM_UNIXTIME(editedon, '%Y-%m-%d'),'</lastmod>'),
			    IFNULL(
				    CONCAT('<priority>',(
					    SELECT value
					    FROM modx_site_tmplvar_contentvalues
					    USE INDEX (tv_cnt)
					    WHERE contentid=id AND tmplvarid={$priority}
				    ),'</priority>'),''),
			    IFNULL(
				    CONCAT('<changefreq>',(
					    SELECT value
					    FROM modx_site_tmplvar_contentvalues
					    USE INDEX (tv_cnt)
					    WHERE contentid=id AND tmplvarid={$changeFreq}
				    ), '</changefreq>'), ''),
			    '</url>'
			    SEPARATOR ''
		    ) AS node
	    FROM modx_site_content AS s
	    WHERE s.deleted = 0 AND s.published = 1 AND s.searchable = 1 AND context_key='web'
	    GROUP BY s.id
	    ORDER BY s.id ASC
    ");
  
    if ($stmt) {
	    $rows = $stmt->fetchAll(PDO::FETCH_COLUMN);
	    $output .= implode('', $rows);
    }

    $output .= '</urlset>';

    $modx->cacheManager->set('sitemap', $output, 86400, $options);
}

return $output;
    

As you can see, the query is set up to retrieve the priority and change frequency from two TVs - this is similar to how many of the Sitemap Extras handle being able to set custom priorities and change frequencies per resource. If the values aren't set specifically for a resource, it doesn't output the relevant part of the node and will use Google's defaults.

Of particular note is the USE INDEX(tv_cnt) in the subqueries, this is telling the subqueries to use a specific index tailored for filtering on the TV ID and the content ID - without that, the query takes a considerable amount of time to run.

There are a couple of points to note in addition:

  • The query restricts the resources fetched to the web context. If you decide to use this for a different context, remember to update the context key.
  • The output is cached for 24 hours. If you want a different cache TTL, change the 86400 value to the amount of seconds you would like to cache the output for.

Feel free to use and tailor this to your own needs and I would love to get any feedback that you may have.


Comments (0)


Add a Comment





Allowed tags: <b><i><br>Add a new comment: