The Sitemap protocol format consists of XML tags. All data values in a Sitemap must be entity-escaped. The file itself must be UTF-8 encoded -

One question here is why do we not use these site maps for fediwki-pages? A related question is why we don't have a JSON based version of Sitemaps yet?

If we were to modify the way we store the Fedwiki Sitemap and use the standard HTML version specified in the protocol, but extend it to add features we need for the JSON client then we would have the added benefit of SEO and also a nice index of pages.

Here is an online tool to create sitemaps -

We collect various counts while scraping sitemaps and report them as a text file. Here we plot the most recent counts available.

This is a rewrite of the first ruby scraper that saves text files useful for searching instead of whole sites in export format. It incrementally refetches pages that have changed based on dates in the sitemaps. github

We should get sitemaps quickly and keep up with their changes. This is especially true of our own.