Name : Fiat Sienna

Type : Car

Description : 1.6 HP, not bad for a sedan, relatively good performance for the price, best sedan i've bought

Tags : Cars Fiat Silver



Remove punctuation and non alphanumeric characters (replace them with spaces)

Collapse spaces and split the text on them

Match the generated list of terms to a stop word list to remove them (words like "on", "the" should not be considered in the index)

The remaining terms are converted to lower case and then converted to their stem representation (we use a snowball stemmer for now)



Attribute : Value

===================

Name : fiat sienna

Type : car

Description : hp sedan performance price sedan buy

Tags : car fiat silver

Term : Frequency

=======================

fiat : 2

car : 2

sedan : 2

sienna : 1

performance : 1

silver : 1

buy : 1

hp : 1

Term Vector A vs. Term Vector B = cos a = (A . B) / (|A|.|B|)

A.B = dot product for the two vectors of TF-IDF values

|A|.|B| = scalar product of the magnitudes of the two vectors

# Monkey patch string to be able to extract terms from any string

Class String

def to_terms(boost = 1, terms = {})

# remove all non letters and reject stop words

terms_list = self.gsub(/(\s|\d|\W)+/u,' ').rstrip.strip.split(' ').reject{|term|$stop_words.include?(term)}

# transform to a hash with a frequency * boost value

terms_list.each do|term|

if terms[term]

terms[term] = terms[term] + boost

else

terms[term] = boost

end

end

terms

end

end



#our item class, which we match upon

Class Item

def to_terms(terms = {})

#a hash of attributes to serialize

{:name => 10,

:description => 3,

:type_name => 1,

:tag_names => 1}.each_pair do |field, boost|

terms = send(field).to_terms(boost, terms)

end

terms

end

end

Class Item

def to_tf_idf

#assume we have a method that returns the df value for any term

terms = self.to_terms

total_frequency = terms.values.inject(0){|a,b|a+b}

terms.each do |term,freq|

terms[term]= (freq / total_frequency) * self.df(term)

magnitude = magnitude + terms[term]**2

end

terms, magnitude

end

def match(item)

my_tf_idf, my_magnitude = self.to_tf_idf

his_tf_idf, his_magnitude = item.to_tf_idf

dot_product = 0

my_tf_idf.each do |term,tf_idf|

dot_product = dot_product + tf_idf * his_tf_idf[term] if his_tf_idf[term]

end

cosine_similarity = dot_product / (my_magnitude * his_magnitude)

end

end

With the debut of the new version of meOwns we have introduced several features that are concerned with how objects are related to each other. In the user profile page, you get a list of other user that are similar to him/her. And you are faced with the chemistry meter which tells you how much you are related to this user (if you are logged in of course). If you happen to be viewing your own profile page you will get a list of recommended items that you might be interested in. Last but not least, when you view an item, you get a list of similar items.In this article I will be talking about the features from an implementation point of view. Naturally the first hurdle was to define the problem. What we needed was to find a way to match items and users. So first we needed to represent them in a way that can be matched. We started from the items first and looked at how to match two items together.What is an item? In meOwns an item is simply a name, a description, a type and some tags. We decided to ignore photos and comments from the item. Each of those fields gets a weight which affects the value of terms found in it. Here is a sample product (encoded in Yaml):The above fields are then processed to extract relevant terms from them, this is done in the following mannerThe above can be represented as follows:After doing so, we create a list of terms with their frequencies, each field has a frequency multiplier according to its significance. Assumming a multiplier value of 1 for all the fieldsThis Term-Frequency vector is the basis of doing item matching in meowns. Several different approaches can be implemented to reach an item representation. Even the details of a given approach can vary significantly. I have chosen to stick to the easiest approach in the initial implementation. Those willing to dig further are free to lookup more into document representation and indexing strategies.User representation is just an aggregation of their item (and wished item) representations. This way user and items share the same term vector structure and hence we can match users to items as much as we can match items to items and users to users.The matching process involves further encoding of the term frequency vector. Those in the academia refer to this as TF-IDF (Term Frequency, Inverse Document Frequency) representation. In lay man's terms this is a representation of how significant a term is to the certain document.It is composed of the product of two parts. This first which is TF (Term Frequency) is simply the frequency of the term in the document divided by the total sum of frequency of terms in the same document.The second (IDF) is the total number of documents in the corpse (the document store) divided by the number of documents which contain this term. If the term is found in all documents then the IDF will be equal to 1 and hence won't have an effect on the final product of the two parts. On the other hand, if we have a corpus of 1,000,000 documents and only one with the given term then it will multiply the TF value by 1,000,000 which is significant.A slight variation (widely used) is to multiply the TF with the Logarithmic value of the IDF. After we are done calculating TF-IDF values for all the terms in term vectors matching can be done as followsThe returned value is called the cosine similarity between the two items. It ranges between 0 and 1 where zero means no correlation and one means exact match. We are still experimenting with threshold values but these are essentially the figures you get when you see another user's chemistry meter for example. In another installment we will discuss how are we implementing behind the scenes matching of users and items in an efficient way.Just to justify calling the post document matching in Ruby, here's a Ruby code to implement the aboveThe above methods enable us to extract the terms from different items. The first method refers to a global (bad me) stop word list which should be present.Now that we have the items represented as term frequencies we can generate their tf-idf vectors (we will keep them as hashes though) and we can use them to do the matchingPretty easy, now you get a value between 0 and 1 that represents how similar those two items are.