Building an open source clone of search.cpan.org using the MetaCPAN API

MetaCPAN

This is an attempt to recreate the look-and-feel of search.cpan.org using the MetaCPAN API as a back-end. This is the same back-end that powers the fancier MetaCPAN.

The result can be found on sco.perlmaven.com

The source code can be found on GitHub.

YouTube playlist

Objectives

This project has a number of objectives:

Some people complain that MetaCPAN.org is too fancy for them and they prefer the look-and-feel of search.cpan.org. On the other hand some of them also complain about bugs and missing features in search.cpan.org. This project can provide them with an alternative.
It will be able to provide some feedback to the MetaCPAN authors for the features that might be still missing in MetaCPAN or in MetaCPAN API.
This can be an interesting exercise on rebuilding an existing service when you don't have access to its source code, or when you cannot read and understand the source code. (The former being a special case of the latter.)
This can be an interesting project to follow and explain with articles and screencasts. It is like many other rewrite project when you already have a working web-site and you have the database behind it, but for some reason you cannot read the source code of the application. Either because it is not available to you, or because it is so unreadable that trying to analyze it would take a lot of time.
A nice exercise in writing tests for an existing project.
The code behind this project will be open source hosted on GitHub
The project should be an almost exact replica of search.cpan.org. The places where it can differ might be minor bug-fixes and maybe some additional configurable flexibility.
The code of the project should be extensible so people who might want to create an "improved version of search.cpan.org" can use this as a base without modifying the source code.
As the project itself, this document may evolve as well. The source code can be found on GitHub

A couple of use-cases I can think of:

If the people running search.cpan.org think this is a good replica of the UI and the service, we might configure this installation to be one of the search.cpan.org servers. If that works, then maybe all the servers can be replaced with this code using the MetaCPAN API.
Alternatively we might be able to add some code to MetaCPAN.com that will allow people to opt to use this interface. In that case MetaCPAN might automatically show this interface to people who wanted.
Having an open source version of a search.cpan.org clone will allow people to fix bugs they encounter in search.cpan.org and then use this improved version. Even if they use it on their own computer.

Existing UI and features

The very first thing to do was to create a rough list of pages by type, and then try to evaluate the complexity of each page. This list can be later updated and can be used as a sort-of specification.

The home page - mostly static with some numbers at the bottom left corner. Later it turned out that these numbers appear on every page.
Authors main index Static. Just a list of the letters of the alphabet with links to the same URL with ?X at the end. (With the specific letter.)
Authors starting by a single letter dynamic (PAUSE ID and full name) could be cached for quite a long time. It could also be updated based on the "recent" pages of MetaCPAN. (Is there a way to fetch the recently registered PAUSE authors from MetaCPAN?) The ultimate source of this list is probably the 00whois.xml file generated by PAUSE.
The selected letter is displayed in red.
If the user provides more than a single letter. (e.g. ABC) sco still only takes in account the first letter. If we supply lower case letter, sco will show the same data as if we provided the upper case letter. (IMHO both of these cases should redirect to the single-letter URL. Later the ?A could be also replaced by a fixed URL such as /author/A. If we supply an invalid character (e.g. ?1 it will just show the list of letters.)
Individual author page - e.g of AADLER. Dynamic. Some generic information about the specific author (name, PAUSE ID, email, home page, avatar). A list of distributions released by the author (Distribution, abstract, date).
This list contains the latest of each distribution that was released by the author that is still on CPAN. Specifically: if there is a more recent version released by someone else the older version will still show up as long as it is still on CPAN. If it was removed and can only be found on BackPan, then it will disappear from this list as well.
This list also includes distributions that are "unauthorized", where the author did not have co-maintainer bit when the distribution was uploaded.
There are certain authors that have not uploaded any distribution, and some authors who have set their e-mail to be "not visible". For example Quinn Murphy. This information is taken from the 00whois.xml file and this MetaCPAN won't provide it to us.
The homepage value is also taken from the 00whois.xml file and MetaCPAN will supply some other value PAUSE users could set in their MetaCPAN account. There is now an explanation about the relationship between some information on MetaCPAN and what PAUSE supplies.
If the user has uploaded anything ever, then there is going to be a CPAN directory where the files currently on CPAN are list. There is also a link to the "Archive" that leads to the directory of this user on BackPAN where all that file ever uploaded by this user can be found. Even file the user has already deleted from CPAN.
Links to the CPAN Testers web site.
Avatar of the author.
List of Releases (Distributions) each one linking to the "home page" of that distribution.
Distribution - specific version (e.g. CGI-Simple-1.113). Shows various meta data of the distribution, including the list of modules included in it. If there is a newer version on CPAN, a link to that newer version is also displayed. Big red "UNAUTHORIZED RELEASE" text if the author did not have the right to release one or more of the modules in this distribution. For example Text-MediawikiFormat-1.01.
permalink to a page that always shows the latest authorized release of a distribution. Module names link to the modules in the specific version of the distribution: For example MediawikiFormat.pm.
Distribution - canonical link (Text-MediawikiFormat) always showing the latest authorized release. It looks exactly the same as the page of a specific version. The module names link to the canonical pages of the modules.
Module - specific version: http://search.cpan.org/~szabgab/Text-MediawikiFormat-1.01/lib/Text/MediawikiFormat.pm The POD in simple HTML. Numbers from RT: Number of 'New' and 'Open' requests syntax highlighting selector at the bottom right permalink
Module - canonical link (permalink) http://search.cpan.org/dist/CGI-Simple/lib/CGI/Simple.pm
Recent - List of recently uploaded distributions (distribution-version abstract). Showing the releases of the last week, grouped by day, sorted by date. There are two arrows at the top of the page. One leads to the week before, the other leads to the next week. The date in the URL can be manually changed and then sco will show the week ending on that day.
Mirrors - list of available CPAN mirrors. User can select a mirror and then the 'download' links will link to that repository instead of cpan.org
FAQ - Frequently Asked Questions about search.cpan.org (static).
Feedback - (static) - how to send feedback to the search.cpan.org developers, and where to ask questions about CPAN modules.
Search - every page has a search box, and a selector (All/Modules/Distributions/Authors). As far as I can tell each one will restrict the search to substrings in the names of the Modules/Distributions/Authors, and 'All' will somehow combine the 3 result sets and even search elsewhere, but I don't understand this perfectly. For example searching for sz among authors shows only "Arpad Szasz" and does not show "Gabor Szabo", and searching for sza among authors does not return anything. On the other hand, sz in all also returns VTKCommon where I think sz only appears in the text.

Notes

While having multiple interfaces to the data on CPAN can be a good idea, letting search engines get confused about which is the canonical URL is probably not a good idea. Hence we are going to set the robots.txt to disallow every well-behaving user-agent.

Follow the development

To follow the development of the project, you can look at GitHub repository, but if you'd also like to get detailed explanation of each step, then check out the following articles and screencasts:

Written by
Gabor Szabo

Published on 2014-10-22

If you have any comments or questions, feel free to post them on the source of this page in GitHub. Source on GitHub. Comment on this post