Wikos.org: an experiment in progress
Wikos.org will be the public face of an ongoing R&D project seeking to apply Peter Ossorio's Judgment Simulation technology to the problem of Internet-scale conceptual information retrieval. But the Internet is really big, so we are starting with Wikipedia. And Wikipedia is big, so we are starting with subsets of Wikipedia, which we extract from the Freebase.com WEX downloads.
Judgment simulation uses numerical models to replicate human judgments. The typical model represents concepts as dimensions of a vector space. For instance, color can be represented by the dimensions red, blue, and green. In such a color space any particular color can be classified by its red, blue, and green coordinates, and colors can be compared by how close they are in the space. For information retrieval purposes the dimensions represent the concepts of interest, and the space is populated with the words, phrases, and texts to be retrieved.
In Ossorio's 1960's work he asked human judges to identify and rate relevant words and phrases against various subject matters. These ratings served as coordinates to place documents in the space based on their words and phrases. Given this subject-matter space, information retrieval is simply a matter of placing a query in the space and retrieving the documents in order of closeness to the query, thus replicating human judgments of subject-matter relevance. To this day Ossorio's relevance-ranking results stand unbeaten in the literature.
To collect enough ratings for a Wikipedia-scale space could cost a lot of money, so for this demo we are working with the existing judgments of the Wikipedia editors in an attempt to reverse-engineer the ratings we would prefer to have. We are starting with the classifications of Wikipedia articles into categories, from which we statistically identify and rate relevant words and phrases to build a category space.
When you submit a query we parse out words and phrases, strip suffixes with the standard Porter stemmer, map the stemmed terms into a category space, and retrieve the categories and articles which are closest to your query.
Query results are currently organized as follows:
the recognized words and phrases in the query
how much a category relates to the query
how much an article relates to the query
how much a term or category contributes to the article score
Wikos is an open-source project. We are not yet ready to provide SVN access, but some of our prototype source code is available for browsing. Please note that these codes are not offered as sterling examples of C++ style or optimal design, or even promised to build: they were hacked quickly under extreme pressure. All of these codes remain copyright Gregory Colvin. We intend to eventually release our codes under a suitable open license: please get in touch if you don't want to wait.
In 2008 we had a business plan to commercialize this technology via a modestly capitalized startup venture. This plan is on hold as we await a more favorable economic climate, and will likely need substantial rework when the time comes. We are pursuing a National Science Foundation SBIR grant as a more appropriate funding source than venture capital at this stage of our R&D.