Its been a long time since my last post and this one is about Redis. So I’ve been working on this project for a bit now and I had this story to implement where I had to associate posts with other posts which were identified to be similar based on a tagging system that already existed. The trouble here was that the existing tagging system was closely tied to a lot of other functionalities and couldn’t be easily (quickly) re-written. The feature needed to be rolled out quickly for several reasons which I won’t get into.
The Problem
The tags associated to each posts were poorly organized where one record in the Tag model would hold all the tags associated to the post as a whitespace separated string (ouch!)
posts.tags.first.name # business money finance investment
So to find posts that had 2 tags in common from a database of approximately 2000 posts with at least 4 tags each took a significant amount of time. Here is a basic benchmark of just finding the matches on each request.
Benchmark.measure do
similar_posts = []
Post.tagged_posts.each do |post|
similar_tags = (post.tags.first.name.split & current_post.tags.first.name.split)
similar_posts << post if similar_tags.size >= 2
end
end
Here is what benchmark returns
=> #Benchmark::Tms:0x111fd8cb8 @cstime=0.0, @total=2.25,
@cutime=0.0, @label="", @stime=0.19, @real=4.79089307785034, @utime=2.06
Not Great.
So the next option was to pre-populate the related posts for every post and store it in the database as similar_posts. So all that was required was to fetch the post with its ‘similar_posts’ at runtime. This seemed like an acceptable solution considering the tags were not changed on a regular basis but if changed it would require the similar_posts table to be rebuilt again(which took a long time). Here is the benchmark for fetching the pre-populated data from the database
Benchmark.measure { p.similar_posts }
=> #Benchmark::Tms:0x104d153f8 @cstime=0.0, @total=0.0100000000000007,
@cutime=0.0, @label="", @stime=0.0, @real=0.0333230495452881, @utime=0.0100000000000007
Nice! But this came at the cost of having to rebuild the similar_videos every time something had to be changed with tags or videos.
Redis
Redis is this awesome in memory key-value, data-structure store which is really fast. Though it would be wrong to think of it as a silver bullet it does a lot things which are really awesome. One is the ability to store data structures and perform operations on them, PubSub and a lot of other cool stuff. It even allows you to persist the information using a snapshotting technique which takes periodic snapshots of the dataset or by “append only file” where it appends to a file all the operations that take place and recreates the dataset in case of failure.
In my case I didn’t need to persist the data but maintain a snapshot of it in memory. So assuming some basic understanding of Redis and the redis gem I’ll describe the approach.
We created a SET with each tag name as key in REDIS so that every tag contains a set of the post_ids of all posts that have that tag. Inorder to identify a post having two tags in common all that was needed was the intersection of tags sets and REDIS provides built methods for these operations. Thats it!
{"communication" => ['12','219', .. , '1027']} #sample SET
Fetching the similar posts
def find_similar_posts
similar_posts = []
if self.tags.present? && self.tags.first.present?
tags = self.tags.first.name.split
tags_clone = tags.clone
tags_clone.shift
tags.each do |tag|
tags_clone.each do |tag_clone|
similar_posts << REDIS_API.sinter(tag.to_s, tag_clone.to_s)
end
tags_clone.shift
end
else
puts "No matching posts."
end
similar_posts -= [self.id.to_s]
Post.find(similar_posts)
Benchmark
>> Benchmark.measure { p.find_similar_posts }
Benchmark::Tms:0x1100636f8 @cstime=0.0, @total=0.0100000000000007, @cutime=0.0, @label="",
@stime=0.0, @real=0.163993120193481, @utime=0.0100000000000007
Which is pretty good considering that we do all the computation within this request and nothing is pre-populated.