Last week, I read an article about Latent Semantic Indexing (javelina.cet.middlebury.edu/lsa/out/lsa_intro.htm)
which has been stewing in my mind ever since. I thought I’d sit down
and make a link to it here, and as I started the search for the article I
realized that I should have bookmarked it. So I thought, "Maybe I
did bookmark it, but just forgot." I looked in my bookmarks
list, and there were five or six unrecognizable links there. So, I headed
back to Google and restarted the longer-than-it-needed-to-be process of
relocating the article.
I don’t know why I’ve never taught myself to use bookmarks
effectively. If I remember correctly, they’ve been around since the
first version of Mosaic that I used several years ago. It’s probably
largely a result of laziness, though I don’t think that’s the
whole story. I think it’s mostly driven by the way I use the web.
It’s very rare that I load up my browser with the specific intent of
visiting one page on the internet. My typical information-addicted browsing
scenario (I optimistically and euphemistcally refer to it as "checking
my email") starts with scanning the messages from the various mailing
lists I’m subscribed to. Usually one or more messages will catch my
eye. Some will contain URLs that I will follow. Some contain a subject I
want to follow up on, which leads me to a search engine. Some just get me
thinking about something that may or may not even be related.
More often than not, the initial scan of my inbox leads to the spawning of
multiple unrelated threads. For example, I might have several browser
windows dedicated to a linguistics topic, a couple loading up BusinessWeek
and Forbes, and a third set pointing to something related to Ruby
programming. And, from here, the same behaviour that got me started comes
into play. Each new page has the potential of forking off several new
threads on potentially unrelated topics. It’s a potentially infinite
If I tried to bookmark all this stuff, I would spend as much time
bookmarking and organizing as I do reading the content. Interestingly, the
topic of this post, which has led me into this explanation of why I
don’t use bookmarks is itself a potential solution to the problem.
Latent Semantic Indexing (LSI) is a technique used for indexing and
relating information based on its semantic "closeness". I’m
certainly not an expert, having only read a few articles on it now, but
here’s a probably-bad explanation of how it works: For each document
in a set, it generates a vector showing the document’s inclusion or
non-inclusion of the words in a master word list. It then uses a technique
called Singular Value Decomposition (kwon3d.com/theory/jkinem/svd.html)
to compress the many relationships in this "term/document grid"
into fewer relationships—basically converging some of the things that
were different into sameness. At this point, you can calculate any
vector’s closeness (as in physical closeness—using Math
I never learned in school) to another, giving you an indication of how
semantically close the two are.
For example, imagine you have a collection of documents, about computer
programming topics. As a human, you could read an article about object
oriented programming, another about UML, and another about Martin
Fowler’s book "UML Distilled". Even if the last article
didn’t specifically contain the text "Object Oriented
Programming", you would be able to make the mental connection between
Martin Fowler and object oriented programming, because you know that UML is
related to OOP and that Martin Fowler is a UML guru. With a typical,
keyword-based search engine, the last article I mentioned wouldn’t
appear in a set of search results for "object oriented
programming", because it didn’t contain the text you were
searching for. A search engine using Latent Semantic Indexing, would be
able to determine that Martin Fowler is semantically close to object
oriented programming and would still return the UML article as a match for
So what does this have to do with bookmarks? I was lucky in that one of my
multiple browsing threads on the day I came across LSI led me to Agent
Agent Frank is a personal web proxy. You run it on your PC, and it tracks,
archives, and indexes the sites you visit. It can also do things that
I’m not as exicted about like blocking banner ads. The really
exciting thing here is that it is a step toward smart bookmarks. If Agent
Frank can track everything I look at, I don’t have to worry about
remembering which sites I’ve seen. Now, imagine coupling this
capability with LSI. A personal web proxy that archives the content you
visit and then indexes it with LSI, could remind you of related
pages you’ve seen as you browse. It could, of course, as Agent Frank
already does, include a search engine so that you could find pages you had
seen before—but without the hassle of having to remember keywords
that appeared in the pages. It could also use the semantic closeness of the
documents you view to build an index of what you are interested in. After a
browsing session, it could present you with an intelligently constructed
list of the categories of the information you have been browsing.
I really like the idea of reading a new page, thinking "This page
reminds me of something…", and clicking a browser button
labeled "Recall Similar".
I’ve been playing around with my own implementation of almost-LSI,
inspired by a nice tutorial (www.perl.com/pub/a/2003/02/19/engine.html)
by Maciej Ceglowski (www.idlewords.com). My version is, as
you might guess, in Ruby. And, it integrates with my weblog software (www.sourceforge.net/projects/rublog).
As is the case with many of the Ruby projects I’ve worked on, I was
surprised to discover that shortly after I started working on it, I had a
functional piece of software that could calculate semantic sameness given a
search query or an existing document. If I can find the time to clean it up
over the next week, you might find a "What’s Related" link
for each story on my website.
Hopefully, though I’m still learning this stuff, it will be a little
less braindead than Google’s "What’s Related" (www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=related%3Awww.chadfowler.com&btnG=Google+Search),
which returns a mix of pages that I have linked to, have linked to me, or
seem to have no logical relationship whatsoever. There are at least a
couple that I’d really not like to think of as "Similar",
as Google labels them.
I also started today on my own Agent Frank replacement, with builtin LSI.
I’ve got a stupid simple HTTP Proxy working and archiving everything
I view. If all goes well and I don’t lose focus (AttentionSpanChallenged)
I might have something basic working in the next week or so. Exciting
stuff. I’m glad I don’t do this for a living…it might not
be as fun.