Designing good 404 pages

There is one essential thing that you have to understand about people ending up in your 404 page – they want to be on your site. Once you accept that fact, it stops to matter why they got there and you start to treat the 404 page like any other page on your site.

In any page on your site the next goal after showing quality content is to retain the surfer on your site and directing him towards pages which can produce value for you (get a sale, get a lead, sign a petition, etc…). The goal in the 404 page is exactly the same, it is just the lack of context that makes it more difficult then in other pages.

But do you really lack context? no one tries to get to a 404 page for the fun of it, and everybody which ends up there had a URL which he assumed is valid. What we need to do is parse the URL and try to extract as much information from it about where the surfer tried to get to.

How can we extract context from a URL? For example lets look at the recommended URL structure for WordPress sites which use pretty permalinks (virtual directories).

  • A Post URL has the structure of /year/month/day/title. Possible way to handle 404 which have similar structure is
    • If the title is of an existing post in different location display a link to that post
    • If there is a title but no post matches it suggest making a search (or show possible search results) using the wordpress search feature and google search limited to your site
    • If the month is invalid – suggest going to the archive page which lists post from that year
    • If the day is invalid – suggest going to the archive page which lists post from that month
  • An image or other file types URL has the structure of /wp-content/uploads/year/month/file. Possible way to handle 404 which have similar structure is
    • If the file is of an existing file in different location display a link to that file (or the image if it is an image)
    • If file can’t be found at all suggest making a search in the site’s media library (or show possible search results) using the wordpress search feature and google search limited to your site
  •  A category URL has the structure of /category/category name1/optional category name2. Possible way to handle 404 which have similar structure is
    • If category name 2 exists but not under category name 1 then display a link to it
    • If category name 2 don’t exist but category name 1 does, then display all the subcategories of category name 1
    • If niether category name1 nor category name 2 exist display a list of the categories

And so on, you probably got the drift by now. My examples are very generic, but knowing the content in your site you might come up with even better ways to seduce the random 404 visitor to stay in your site.

It is important to return HTTP 404 when there is no content at the URL, redirecting to another URL is evil

I got depressed when I look for a wordpress plugin which will let me control what is display on my site’s 404 page, as too many plugins suggest to redirect to another page automatically or based on some manual configuration.

This is so wrong, 404 (and all of the 4.xx codes) indicates a client error, and you should not pretend that there was no error otherwise the client will never learn and improve.

It is not only about following the letter of the standard, there are  practical implications to not following it:

  • A person that bookmarked (maybe the URL was copied from mail or configured in a smartphone) will get to an unexpected content without getting any indication whether it was his fault (used wrong URL) or the content had disappeared from the site.
  • Owners of sites that link to your content (an HTML link or reference to a img,video or other resource on your site) will never get an indication from link checkers that their links are bad and will never fix it to point to the right content.
    The worst scenario is that you, or someone affiliated with you, owning that site, and it might even be a bad kink on the same site.
  • If you use javascript based analytics software (like google analytics) they will fail to report 404 pages as the scripts will never be loaded for them
  • It effectively creates several valid URL for leading to the same content which might be good or not depending on the content in the source of the URLs and the latest google ranking algorithm

Yes, 404 pages should be more then some pretty and amusing page, but whatever it displays to the user, it has to keep sending the content with a 404 code.

whatsapp? not much

A friend of mine is going abroad and because he uses blackberry he has to have a data plan to BBM even when he can connect via wifi, and there is no skype for BB for the network he uses. This sent him looking for another communication app and he found whatsapp and I decided to pay the 1$ to get it on my IPhone.

Right now from my limited usage I am disappointed as I assumed the hype had a merit but I don’t see a reason to use it when I don’t have too.

Nice touch to the wordpress iphone app, it opens the admin at the right page if it detects that XML-RPC is disabled. Which makes me wonder why it doesn’t go all the way and submit the changed setting, or at least highlights/explains what shoud be changed.

You can’t disable pingbacks in wordpress :(

Yeh, the best that you can do is not to display them. All the site wide and per post options control whether the pingback/trackback will be stored in the DB, but the piece of code handling pingbacks from reception till checking if to put them in DB will always be executed.

This is probably not a big thing as I don’t remember exploits utilizing pingbacks, but it annoys me esthetically that an option  described “Allow link notifications from other blogs (pingbacks and trackbacks.) ” doesn’t do what can be interpreted from the description (what it does is to set the default of the ping option on the post edit page).

The decision to enable xml-rpc remote publishing support by defualt in wordpress 3.5 is good, but the execution is lacking

xml-rpc protocol is basically used to expose a set of API implemented by a site/server which enables other software to interact with the site/server in a way which is easier to program then trying to mimic user interaction. WordPress currently (version 3.5) expose API for publishing pingbacks, and content. Software like windows live writer by Microsoft uses the content publishing API supplied by WordPress to create a non browser editing environment, but the main users of the protocol right now are smartphones because the small screen size makes the WordPress web interface almost unusable.

Upto WordPress version 3.4 the remote publishing by XML-RPC was disabled by default, and the text explaining the option in the admin was technical and said nothing about smartphones.

With the rise of smartphone use, and the number of smartphone apps that use XML-RPC to publish content to wordpress, it is only a logical move to enable XML-RPC by default, but the development moto of “decisions not options” was taken too far as in this case the option has enough importance to justify having it.

The reasons are mainly security related

  1. It doesn’t matter how robust is WordPress code there is always a chance of a security bug that might relate only to the XML-RPC code
  2. Plugin authors will probably start supporting XML-RPC opening more attack vectors, and user will not even know about it because it will not have any GUI indication, and you will not know that unless you read the plugins documentation.
  3. There is no knowledge base on how to defend against brute force/dictionary attacks from XML-RPC. Current plugins might work,but will they give you a notice like “You failed to login 3 times, please wait 5 minutes till the next attempt” on the XML-RPC layer, and how the app will display that notice?

It might be that core developers are right and there is no risk added by having XML-RPC on all the time, but I think that a more conservative two step approach like

  1. Make it default to on, leave the option in the admin
  2. In two releases look at the experience of running WordPress that way and decided whether to eliminate the option as well

The reason it should work that way is that most user just leave the default setting on, so there will be a big enough user base to field test the feature even when the option to turn it off exist.

WordPress settings API is PITA

The WordPress settings API is there to “help authors manage their plugin custom options“, but does it? Many lengthy tutorials pointed to from the codex hints that the answer is probably “not really”. To quote Olly Benson from yet another settings API tutorial/example (emphasize mine)

WordPress does make a habit of creating mountains out of molehills sometimes, and the Settings API seems to be a fantastic example of this.  Trying to follow the instructions on the initial page got me hopelessly lost, and it was only when I went through Otto’s WordPress Settings API tutorial that I begin to understand how to implement it.

And the problem is not with the codex, the problem is in the structure of the API itself. Instead of having a simple fast and dirty code like

add_action('admin_init','my_init');

function my_init() {
add_page('title','title',10,'my_options_page');
}

function my_options_page() {
if (isset($_POST['my_value'])) {
validate_value();
update_option('my_option',$_POST['my_value']);
}
<form>
Enter value <input type="text" name="my_value" value="<?echo get_option('my_option')?>
</form>
}

Where all the logic of handling the change of values is placed in the same place as the presentation, the my_options_page function, which makes it much easier to understand and debug.

The settings API basically moves your options handling away from your presentation code. To use it you need to call at least 3 initialization functions to which you have to supply 3 callback functions, and all the handling is done in a “black box” that doesn’t give you any hint for misconfiguration and it is hard to debug.

When trying to use the API I end spending more time to make myself feel good about following coding best practices then needed to code the same functionality in an equally accessible and secure way.

The time has come to clean the net from the prehistoric junk called breadcrums

One guy on the wordpress stackexchange asked how have to have breadcrumbs that show a different “path” the one he currently gets, and this reminded me that I hate breadcrumbs.

Breadcrumbs where created by Jacob Nielsen, and he describes the motivation in the article “Breadcrumb Navigation Increasingly Useful“. I was using the net on 2007 and before so I totally agree that at that point in time adding breadcrumbs to sites had improved their usability. But times had changed and we understand better how users use the internet the the importance of “in site” navigation system, and a catch all default like breadcrumbs have no place anymore on the net.

The problem with breadcrumbs is that they are based on the assumption that content is hierarchical and there is an hierarchical path from the home page to any page in the site. There are several problems with that assumption:

  • A lot of the content on the internet is not hierarchical. This post is categorized as “Rant”, but it is mainly because I don’t like to have an “uncategorized” category. Many blogs and small sites owners categorize because the CMS tools give them (and some times force them) the option to categorize and the end result is that there are totally unrelated content being categorized at the same way.
    Take a look at Jacob Neilsen article itself, what is its breadcrumbs? “useit.com > Alertbox > April 2007 Breadcrumbs “. Right now the alertbox contains 457 links to artcles. If I want to find more information on an article I read there, it is better to leave the site and go for google to make a search then trying to guess which of the links on that page might actually contain the information I’m looking for.
  • For many pages (like in the question) there might be multiple hierarchical paths so which one should you use?
    For this post there is the path of the category “rant”, the tag “Breadcrumbs”, and the date it was published “2012” > “12” > “12”. For personal blog the date hierarchy is the most appropriate, but for technical hierarchy by categories makes more sense.
  • Site owners don’t want users to navigate by hierarchy, they prefer to direct user traffic to a more profitable pages, or pages that they think better serve your needs instead of overwhelming the user with information at an higher hierarchy level.
  • Most sites that care recognized that navigating by hierarchy is frustrating to the user as it requires at least 2 clicks to reach a destination (one up and one down) and improved their in-site search so the user will get to its destination in at most 2 clicks.

In 1995, when the idea of breadcrumbs was created, web sites followed the structure of file systems. The breadcrumbs solution is actually used in the windows explorer in windows 7 but it is just not good enough because yes it takes me 5 levels up in the hierarchy if I need to in one click, but it doesn’t help me to get down the hierarchy to where I need.
But modern OSs are trying to get away from the hierarchical organization of data, Iphone hides the file system from the users, and other OSs trying to analyze user behavior and provide built in search facilities. Sites, being more flexible should lead this trend instead of stick with the old ways.

But wait, what about SEO, isn’t breadcrumbs good for SEO? Maybe it was true in 2007 before search engines standardized on using site maps. Without site maps breadcrumbs where a relatively easy way to provide inter site connectivity that was required to help search engines discover all of the content in the site. Today if your site doesn’t have a site map you either don’t care much about SEO, or you are sure that your site is inter connected.

The only SEO related reason to use breadcrumbs today is that google might use them instead of the full URL in the search results when the URL is very long. This for sure improves the usability of the google results, but does it add value to the site owner? Even if it does, I would add bread crumbs either as RDF (ugly) or use the microdata way but just hide it with CSS. This way google has the data, and you have more free screen space on your site.

The possible impact of changing wordpress (and php) max memory settings on site performance

In the  last several days there where several questions in wordpress answers on stackexchange related to out of memory errors. This was mostly related to some plugin which required more memory to function, and people asked what is the way to change/overcome the default memory limit of a PHP process.

My impression from the questions and answers was that people fail to understand why there is a limit at all and treat the limit as some bizarre PHP thing that you need to overcome instead of trying to understand it. There is even a plugin “Change memory limit” that its description says

Update the WordPress default memory limit. Never run into the dreaded “allowed memory size of 33554432 bytes exhausted” error again!

To understand why there is a limit you need to understand the most hidden secrets of linux and windows that will surprise most developers – After an application had allocated memory from the OS it can not free it back.Yes, when a program call the free() function, an object destructor or any other dealoocation method, the memory is returned to the free memory pool of the application from which it might allocate its next memory, but it will never be returned to the OS as long as the software is running*.

Since software doesn’t really deallocate, a server software, that is supposed to run all the time, once reached its pick memory usage will stay there.This has to be taken into account when you want to ensure specific performance with the way apache works.

Appache in prefork mode basically run itself several times, where each instance can handle one request. If no instances are free to handle the request, the request has to wait in a queue. The maximal number of concurrent requests the server can process is the number of instances we can run at the same time. Assuming we don’t do any heavily CPU bound process, our limitation is the memory that can be allocated to each instance.

And how can we calculate the amount of memory an apache needs? The naive approach is to try and use average memory consumption, but once a software passed its “average” allocation, the memory will not be released. potentially an apache instance running one memory hungry process can take control over all the available memory leaving no memory available for the other instances which will probably lead to them failing in handling request. you might think that you configured your server to enable 10 request to be handled but 9 of them fail.

It is important to understand that once the memory was allocated it is of no importance that the instance never need again all of that memory and handles only small request. The memory is attached to the instance forever.

And this is why the memory  limit exists, to protect the whole server from one faulty piece of code. If you set the limit to 128KB then you can be assured that atleast the rest of the memory is available to the other instances.

So basically the number of apache instances we can run safely without the fear of the server suddenly breaking down for no apparent reason, is (amount of memory available on the server) / (max memory limit). The higher the limit the less requests your server can process in the same time which potentially leads to less responsive server.

Apache actually can be configure to kill instances after serving a certain amount of requests and by that actually free memory. This will improve the server performance on average but it also has a cost, the cost of running a new instance. You should probably always plan for the worst case scenario and experiment very carefully with relaxing the memory restriction.

Prefork is not the only way to configure apache to run, and there are also the worker and event configurations, but they require that the PHP library you use will be thread safe. Some people claim it actually works for them but the PHP developers don’t recommend running that way.

And then, if you use fastcgi to execute php instead of mod_php, you basically change it from being an apache problem to fastcgi problem which might actually be better since while fastcgi might hurt the performance of pages generated with PHP, apache itself will be able to serve static files.

* Mainly because memory from the OS is being allocated in big chunks and it is very likely that when you dynamically allocate and free memory from that chunk some allocated “live”  memory will be in every chunk.