Image management that will scale
Posted by Craig Ambrose on November 27, 2007 at 02:59 AM
There is a lot of conflicting information around about handling user uploaded images in rails applications. I’ve done it a number of different ways, and the good news is that it’s not too hard to move from one system to another. However, dealing with scaling issues is a pain and it’s nice to get it right first go. So, here are some problems that I’ve encountered recently, along with some solutions.
Files Per Directory Limit
Depending on which OS you use for hosting, you’ve probably got a limit to the number of files (or directories) you can put inside a given directory. It’s usually about 32,000. While this seems like a long way off, if your site accepts user content then hopefully this will eventually become a problem for you. There have been various talks and articles written about different hashing systems for file names, but it’s worth mentioning that this is basically a solved problem, and you shouldn’t have to tackle it yourself.
If you’re still using file_column, as I am for a few things, then this one might bite you. The simplest solution, I think, is to migrate to attachment_fu. The file system store for attachment_fu implements file name based hashing, and the s3 and database stores don’t suffer from the problem at all. Also, the way in which attachment_fu handles pluggable storage classes means that you could also slip in your own custom storage system later without having to change the way that you use attachment_fu in your models.
If you’re thinking of making the switch, here’s an article I wrote on migrating from file column to attachment_fu.
RMagick Memory Leaks
RMagick is really handy, and so just about every rails image handling tutorial on the internet recommends it’s use. I’m using it all over the place. My advice to you, is don’t ever do this. It turns out that RMagick leaks memory every time it manipulates an image. I haven’t measured the amount myself, but I’m told it’s quite a bit. Certainly I’ve been having resource consumption problems with scripts using RMagick heavily. So, say goodbye to it.
DHH recommended just using the image magick binaries manually. That’s basically a good idea, but a slightly easier way of doing that is to use the mini_magick gem. Mini magick provides a ruby API, but under the hood it just calls the image magick command line tools. Attachment_fu comes with a mini magic processor, so you can just add ”:processor => :MiniMagick” to your call to has_attachment and you’re in business. Khamsouk Souvanlasy wrote a good tutorial on using mini_magick with acts_as_attachment.
Cropping
The one thing I noticed in using attachment_fu instead of file column is that file column resizes images and crops them nicelly, the way that you would expect. By default, attachment_fu tends to stretch them. This has been covered better by other people, so I just want to mention it because stretching is almost certainly what you want, and until Rick fixes it, I’d suggest making a small change to the plugin yourself. There are a number of articles on the subject, but I think the best one is probably over at toolman tim’s blog.
Don’t just go with Tim’s solution though, have a look at the comments, and you will find options for the different image processesors. I used “labrat’s” suggested fix for mini magick (paste here)
Amazon S3
Amazon S3 appears to be a great solution for handling user generated images, and I’m starting to use it a fair bit. One word of warning however, is that I’ve already started to encounter an occasional communication error with amazon, as discussed in this thread and I don’t yet know how serious it is or how easily fixed. I’ll post some more on this subject when I’m better informed.
Caching Makes Your Brain Explode
Posted by Craig Ambrose on November 13, 2007 at 04:20 AM
I’ve been spending a lot of time recently trying to make boxedup.com scale. Before I started, I’d watched the right screen-casts, read the right books, and I thought I knew what had to be done to speed up rails applications when the need arose.
Boy, was I wrong.
A quick look at the three methods of caching rails pages reveals that page caching is of no use to a site which insists on displaying the current user on all pages (as most of them seem to). Next up is action caching, which does let me execute before and after filters, allowing me to handle to logged in user, but caches the entire rendered action, including the layout, so once again I can’t display the currently logged in user. There are possibly some ways around this, but since action caching is really just a specialised form of fragment caching, lets talk about that.
Fragment Caching
Fragment caching does work. In fact, my first attempts at it benchmarked so well in my simplistic “load this page 100 times in httperf” tests that I dived in head first. The books on this subject, particularly the pragprog one, give the impression that this is pretty straight forward. It’s not. There are some massive gotchas that will bring even a fairly low traffic site to it’s knees if you don’t watch out for them.
It’s All About Expiry
You can’t consider caching without thinking about cache expiry. In rails, this is typically done with cache sweepers. For fragment caching, the sweepers call the expire fragment method. This can take a string, which matches the fragment name exactly, or it can take a regular expression.
Gotcha #1, Don’t Use expire_fragment With A Regex
First up, this doesn’t work with memcache anyway, it only works with the file system cache. There’s nothing that wrong with the file system cache. Reading from it is faster than rendering a template. Expiring from it, however, is pretty slow. Expiring from it using a regex is absolutely appalling. The reason why is better explained in this article by Adam Doppelt.
So, if you can’t expire it with a regex, that leaves you the following options for expiry:
- Time based expiry. There are some plugins that add this feature to the file system store. Memcache gives it to you for free, and if you’re relying on this heavily, I’d use memcache.
- Being in one of those good situations where the number of possible fragments is known, and you can expire them each explicitly. This didn’t work for me in some of the critical areas that I needed to cache.
- Storing a list (in the database) of the caches that you built up which need to be expired if a certain thing is changed.
Don’t Expire, Just Render it Obsolete
This article so far doesn’t really capture how much pain this stuff has caused me, and I’ll try and cover some other points in other articles. For now, lets jump straight to the good bit.
I’ve read a lot of articles on caching, but this is the best, go read it:
The Secret To Memcached – by Tobias Lütke
Tobias also struggled with expiry, and his solution is to take advantage of the fact that if you’re using memcache, then you can never cache too many items. The oldest ones get pushed out when you run out of of space.
So, here’s my first bit of advice. If you’re building a real site, go straight to memcache. If you’re not building a site for big traffic, don’t cache, just optimise any really stupid queries that are giving you trouble. If you’re using memcache, be sure to run monit too.
So, we’re running memcache, and we don’t want to expire our fragments. Instead, try and find fragment keys that don’t need to be expired, because they will be replaced if the data changes.
The one I’ve just implemented was a stream of recent activity, much like facebook. Each little type of activity had a different template, and the rendering of this took up a lot of time. Fetching the data was also non-trivial. However, if I wanted to expire a cache of the activity stream, then I’d need to do so anytime something occured on the site that triggered an activity for this user.
<% cache ["activity_stream", @latest_activity.id, @user.id].to_s do %>
... render the activities
<% end %>
There’s the code. The real example had a few more parameters, but you can see here that the magic is in the fact that I used @latest_activity.id as part of the key. I’m still having to query that from the database, but it’s pretty simple to do, and all I really need is one little integer, instead of all the activities and their associated objects. If a new activity is created for this user, then this id will have changed, and so I’ll me asking for a different cache key.
Benchmarks for this are looking really good. I’ll let you know how it goes in the wild, but I’m not expecting too many problems as most of my previous troubles have been to do with expiry, and this code doesn’t need any expiry. No sweepers, no regular expressions. It’s simple, and it scales.
