Google Sets got me thinking today. (http://labs.google.com/sets). What this thing does is, it takes a few words as input and tries to determine as many words as possible that "belong to it". To get a general feel of this thing, I'll run an example.. I gave the word "cell phone" (mobile phone) as input and it returned the following...
Cell phone
Pager
Telephone
Fax
Work Phone
Home Phone
Email
PDA
Blankets
Others
Palm OS
Closure
clothes
Smart Card
VISA
I think it is pretty cool, though i dont know how VISA is related to a cellphone. I, for the moment am not aware of any practical purpose that this "Set Maker" would solve, though I am sure there is a pretty good reason why it exists (If any of you can think of a practical use of the "Set Maker", please comment). But that apart, what got me thinking was the probable algorithm that is being used in this. There were three things that stood out in the above result set. They were
Clothes
Blankets
Others
How in the world are these things related to a cellphone? But on the other hand, one could immediately make a relation between "Clothes" and "Blankets". So, I initially thought that this could be a recursive algorithm - that is, having got to "Blankets" somehow, it would have tried to find all the words similar to "Blankets". But, if that is the case, when i ran the "Set Maker" for the word "Blankets", i should have got the word "Clothes". But, on the contrary, the word does not show up. (However, the word "Clothing" does show up, but i prefer to consider them as two different words). So, my initial premise that this is recursive in nature is for the moment-wrong!
So, my second postulate was that the algorithm did a search in its database for all the words that the user might have entered. It then tries to ascertain common words in all the pages that it might have shortlisted and then displays the results. Well, using this postulate, I ran the "Set Maker" for the words "Vikram" and "Madhavan" (These guys are South Indian Film Stars).
Here is the result:
Madhavan
Vikram
Abbas
Smita Jayekar
Maya Alagh
Anupam Kher
Diya Mirza
Navin Nischol
Now, the result might be a fair call. But the interesting part here are the things that dont turn up in the result. I do know that all the people in the list are actors. But of the above lot only Madhavan, Vikram and Abbas belong to the South Indian Industry. Why did the algorithm not include in the result names like "Rajni", "Kamal Hasan", "Shivaji" and the likes?? Definitely, these people are more-closely-related to "Madhavan" and "Vikram" than "Navin Nischol". So, if there are any pages that contain references to Madhavan and Vikram, the probability is higher that it would contain references to Rajni and Kamal Hasan rather than Smita Jayekar and Diya Mirza. So, I think I can safely say that my second postulate is also out of the window.
I still dont have a third postulate and I am currently chatting with Nithya from Delhi (my Orkut friend) as to how this might work! Lets see if we are able to solve the "puzzle". In the meanwhile, if you guys know how this might work, do comment. I would love to know your take on the algorithm!!
|