Tagging with Redis

Its been a long time since my last post and this one is about Redis. So I’ve been working on this project for a bit now and I had this story to implement where I had to associate posts with other posts which were identified to be similar based on a tagging system that already existed. The trouble here was that the existing tagging system was closely tied to a lot of other functionalities and couldn’t be easily (quickly) re-written. The feature needed to be rolled out quickly for several reasons which I won’t get into.

The Problem

The tags associated to each posts were poorly organized where one record in the Tag model would hold all the tags associated to the post as a whitespace separated string (ouch!)


posts.tags.first.name # business money finance investment

So to find posts that had 2 tags in common from a database of approximately 2000 posts with at least 4 tags each took a significant amount of time. Here is a basic benchmark of just finding the matches on each request.


Benchmark.measure do
 similar_posts = []
 Post.tagged_posts.each do |post|
   similar_tags = (post.tags.first.name.split & current_post.tags.first.name.split)
   similar_posts << post if similar_tags.size >= 2
  end
end

Here is what benchmark returns


 => #Benchmark::Tms:0x111fd8cb8 @cstime=0.0, @total=2.25, 
@cutime=0.0, @label="", @stime=0.19, @real=4.79089307785034, @utime=2.06

Not Great.

So the next option was to pre-populate the related posts for every post and store it in the database as similar_posts. So all that was required was to fetch the post with its ‘similar_posts’ at runtime. This seemed like an acceptable solution considering the tags were not changed on a regular basis but if changed it would require the similar_posts table to be rebuilt again(which took a long time). Here is the benchmark for fetching the pre-populated data from the database


Benchmark.measure { p.similar_posts }
=> #Benchmark::Tms:0x104d153f8 @cstime=0.0, @total=0.0100000000000007, 
@cutime=0.0, @label="", @stime=0.0, @real=0.0333230495452881, @utime=0.0100000000000007

Nice! But this came at the cost of having to rebuild the similar_videos every time something had to be changed with tags or videos.

Redis

Redis is this awesome in memory key-value, data-structure store which is really fast. Though it would be wrong to think of it as a silver bullet it does a lot things which are really awesome. One is the ability to store data structures and perform operations on them, PubSub and a lot of other cool stuff. It even allows you to persist the information using a snapshotting technique which takes periodic snapshots of the dataset or by “append only file” where it appends to a file all the operations that take place and recreates the dataset in case of failure.

In my case I didn’t need to persist the data but maintain a snapshot of it in memory. So assuming some basic understanding of Redis and the redis gem I’ll describe the approach.

We created a SET with each tag name as key in REDIS so that every tag contains a set of the post_ids of all posts that have that tag. Inorder to identify a post having two tags in common all that was needed was the intersection of tags sets and REDIS provides built methods for these operations. Thats it!

 
{"communication" => ['12','219', .. , '1027']}  #sample SET

Fetching the similar posts


def find_similar_posts 
  similar_posts = []
  if self.tags.present? && self.tags.first.present?
    tags = self.tags.first.name.split
    tags_clone = tags.clone
    tags_clone.shift
    tags.each do |tag| 
      tags_clone.each do |tag_clone|  
        similar_posts << REDIS_API.sinter(tag.to_s, tag_clone.to_s)
      end
      tags_clone.shift        
     end
  else
    puts "No matching posts."
   end                          
 similar_posts -= [self.id.to_s]
 Post.find(similar_posts)  

Benchmark


 >> Benchmark.measure { p.find_similar_posts }
Benchmark::Tms:0x1100636f8 @cstime=0.0, @total=0.0100000000000007, @cutime=0.0, @label="",
@stime=0.0, @real=0.163993120193481, @utime=0.0100000000000007
 

Which is pretty good considering that we do all the computation within this request and nothing is pre-populated.