What is Latent Semantic Indexing or LSI ?
Latent Semantic Indexing or LSI has changed the world of search engine optimization. One fine morning, SEO experts found that most of their best ranking sites on Google were in jeopardy. Google has simply updated its crawler-program to accommodate LSI and has moved towards a more relevant rating list!
LSI is a methodology involving statistical probability and correlation that helps deducing the semantic distance between words. It’s obviously a complex methodology but can be easily applied to understand the relation between certain words in a paragraph or in a document. This methodology is being used while indexing a page in the search engine’s database.
Delving deeper, LSI is concerned not only with studying a document for keywords and listing it in the database, but also with studying a collection of documents and recognizing and identifying the words that are common between these documents. This way it can conclude on the semantic relation between the words being used in these documents. The process then finds out which other documents include or makes use of these semantically close words. The resultant documents are indexed to be related or closely relevant to a context, according to latent semantic indexing.
LSI regards the documents with certain proportion of words being used frequently to be semantically close. If there are fewer words common among documents, they are supposed to be semantically distant. Therefore, LSI introduces interdependence of measure and it rates the relevance of any document on a scale of 0 to 1. Unlike regular keyword searches, LSI can acknowledge the measure of how close is a document to another or how relevant is a credential to a particular context.
Let’s consider an example here. In a document that discusses Stephen Covey and his preaching, words such as ‘effective’, ‘habits’, ‘interdependence’, ‘independence’, ‘synergic’, ‘paradigm’, ‘continuum’, ‘public victory’, ‘private victory’, ‘circle of influence’ and so on would be found frequently. Once the search engine indexing tool that uses the LSI technique recognizes these commonly-used words from a given set of documents, it can find other documents or Web pages on the net that contains the same set of keywords in almost similar frequency and index them in the database beside the relevant context (Stephen Covey and his preaching) that it leads to.
Now compare this simple method with a human brain’s approach to search a context. If you are given a set of document and asked to locate the one’s that discuss a particular context, what will you do? Anyone will try to find out the things in common in the sample context and use the observation to compare the rest of the documents to classify them. This intelligence has been added to the lifeless crawler-software or computer technology through the LSI technology.
Quite obviously, the LSI algorithm doesn’t understand anything about the meaning of a word in a document. It just reads through the pattern of the usage of particular words and calculates the correlation of their occurrence and hence their correlation with a particular context. Let’s get into the practical side of it, that is, how it is applied to a search engine technique.