Clean URLs with Hakyll

HTTP uses the Content-Type header to inform web browsers what they’re getting back. If we can control it, then it would be nicer to use a clean URL instead of a URL that has an extension. In other words, it would be nice if HTML pages did not have .html in the URL.

However, by default Hakyll will always include the .html extension with the HTML pages it generates. It is actually not that hard to remove it, though, and here I describe how I did it for my personal web site.

Setup in Hakyll

Some approaches rely on web servers typically serving the index.html file in a directory to serve an HTML page corresponding to a URL for the directory. However, I wanted to use a more direct approach which avoided including .html in the file names in the first place.

With a vanilla installation of a Hakyll site, you will see code such as the following, which switches the file name extension to .html for the file which will contain the HTML output translated from the original file.

match "about.markdown" $ do
  route $ setExtension "html"
  ...

It is really easy to switch things so that it removes the extension, instead. Simply set the extension to the empty string instead of .html.

match "about.markdown" $ do
  route $ setExtension ""
  ...

Directory URLs

If you would like to use directory URLs, i.e., URLs which end with a slash and whose content is actually contained in an index.html file, then some more work will need to be done if you have any links automatically generated by Hakyll, which will likely be the case. Hakyll will by default include the file name in the links it generates for index.html files, so we would like to remove the index.html from these links.

A common way is to generate pages as usual and to clean up the URLs afterwards. This has the disadvantage that it can be easy to forget to clean up URLs in every case they should be. You may also have to clean up URLs differently for different cases. For example, I had to clean up URLs in sitemap.xml differently from HTML pages when previously using this approach. I also did not realize that the usual way of cleaning up URLs does not work with Hakyll’s built-in method of generating feeds.

An alternative approach which I now use is to not include the index.html part in the generated links in the first place. The default context provided by Hakyll generates URLs by translating them from the route using the toUrl function. So what I can do is to use another context I call siteContext, where it cleans up the URL generated the same way and overrides "url" metadata field. I then use siteContext everywhere that I would usually use defaultContext.

siteContext :: Context String
siteContext = field "url" clean <> defaultContext
  where
    -- Clean up "index.html" from URLs.
    clean item = do
      path <- getRoute (itemIdentifier item)
      case path of
        Nothing -> noResult "no route for identifier"
        Just s -> pure . cleanupIndexUrl . toUrl $ s

The actual cleaning of index.html from URLs is done with cleanupIndexUrl, which strips index.html from local URLs.

cleanupIndexUrl :: String -> String
cleanupIndexUrl url@('/' : _)  -- only clean up local URLs
  | Nothing <- prefix = url  -- does not end with index.html
  | Just s <- prefix = s  -- clean up index.html from URL
  where
    prefix = needlePrefix "index.html" url
cleanupIndexUrl url = url

To prevent using defaultContext by mistake instead of siteContext, I use a custom HLint hint.

- warning: {lhs: defaultContext, rhs: siteContext}

By overriding the "url" metadata field this way, Hakyll will use the clean version of a directory URL in the first place, and I do not have to worry about forgetting to clean up URLs in site maps or feeds.

Setup in Apache

Using file names with no extension is all well and good, but it would be couterproductive if web browsers treated the content as plain text or a blob of binary bytes. In other words, we need the HTTP server to actually set the Content-Type to text/html for the HTML pages.

My web site is served using the Apache HTTP server on a shared host. Since I cannot change the main configuration for the server, I put the following in .htaccess:

<FilesMatch "^[^.]+$">
    ForceType text/html
</FilesMatch>

This will force the HTTP server to set the Content-Type to text/html if the file name has no extension. Obviously, this will not work as intended if I had dots in the names of files containing HTML, but this is fine for me because I have no such files, and my file naming convention avoids such files.

In fact, I have Hakyll generate my .htaccess file as well, so I don’t have to worry about copying or editing it separately.

See site/server/htaccess.

Custom server

There is nothing more to do if all one wants is to serve HTML pages without including the extension in the URL. However, I would like to preview my site without standing up my own Apache HTTP server.

Hakyll uses the warp HTTP server for previewing a site locally. It does not know to serve files without an extension as HTML, so I made my own customizations to warp so that it would set Content-Type to text/html for files without an extension.

main :: IO ()
main = hakyllWith config rules
  where
    config = defaultConfiguration { previewSettings = serverSettings }

serverSettings :: FilePath -> Static.StaticSettings
serverSettings path = baseSettings {ssGetMimeType = getMimeType}
  where
    baseSettings = Static.defaultFileServerSettings path
    defaultGetMimeType = ssGetMimeType baseSettings

    -- Overrides MIME type for files with no extension
    -- so that HTML pages need no extension.
    getMimeType file =
      if Text.elem '.' (fromPiece $ fileName file)
        then defaultGetMimeType file
        else return "text/html"

Caveats

A caveat with the way clean URLs are implemented here is that HTML files should not have a dot in their file names. This is not a problem for me because my file naming conventions avoids this.

See also