How I spent two weeks hunting a memory leak in Ruby

(This post was translated, the original version is in my Russian blog.)

Foreword

This is a story about hunting a memory leak. A long story, because I go into much detail.

Why describe my adventures? Not that I wanted to save all those tiny code pieces and scripts only. It rather occurred to me that it was UNIX way which I had pursued. Every step was related to yet another small utility, or a library, which solves its task well. And finally I succeeded.

Also it was an interesting journey! Sometimes I got in and out of bed with thoughts about what happens to memory.

I am grateful to my colleagues in Shuttlerock company which actually worked (as in “being productive for customers”), while I was playing memory detective. Grats to the company itself too!

Introduction

So we found ourselves in an unpleasant situation: our Rails app on Heroku platform had a memory leak. One can see it on this graph:

Now, every deploy or restart makes memory line go up fast (see around 3 AM, 8 AM). That’s totally normal, because Ruby is a dynamic language. Some requires are still being fired, something loads up. When everything is loaded, memory consumption should be quasi-constant. Requests are coming, objects get created, the line goes up. Garbage collector does its job – it goes down. The app should be able to stay forever in such a dynamic balance.

But in our case there’s an apparent upward trend. A leak! When our Heroku 2X dyno (we use puma in cluster mode with 2 worker processes) reaches 1 GB, it becomes slow as hell, unresponsive and needs to be restarted.

Reproduce the thing locally

Thou shalt not make experiments on the production server. First thing I did was downloading latest production database dump and running the app in production environment on my local Linux server which resides in the closet (you should have one too!). Our app is a SaaS product and has a special middleware which detects customer sites by domain, therefore I had to introduce small changes to app code enabling me to make requests like so: curl http://localhost:3000/…. Environment variables are best for this purpose.

As you can see, if environment variable QUERY_SITE_ID is set, site ID gets taken from query parameter site_id:

curl http://localhost:3000/?site_id=123

In most cases you’ll have to specify config.force_ssl = false in config/environments/production.rb, set DEVISE_SECRET_KEY variable (if you use devise), maybe something else. The curl command must finally work.

So, server works locally, what’s next? You have to supply incoming requests. The wonderful siege utility allows to put load on web servers in different modes and record statistics.

I decided to not hit a single route, but rather collect real URLs used by web clients. Easy-peasy: run heroku logs -t | tee log/production.log for some time, then extract URLs from the log. I wrote a small utility which parsed log, collected paths, site_id values reported by our middleware and saved everything into urls.txt like this:

You can create such a file by hand, or resort to awk, grep, sed.

Let’s run siege:

siege -v -c 15 --log=/tmp/siege.log -f urls.txt

Here siege will create 15 parallel clients and use urls.txt as its input.

If you do everything right, you’ll experience the memory leak. It can be seen with top and ps utilities — the number to look for is called RSS (Resident Set Size). To save myself from running them, though, I’ve added the following code to the app:

Accurate records with increasing RSS started to appear in the log. More on GC.stat[:heap_live_slots] later.

After first experiments I swiched puma into single mode, because it leaked too and single process is easier to deal with.

Hunting leak in Ruby code

Being certain in leak existence, I started playing the Sherlock Holmes part.

Some words need to be said about how MRI works with memory in general. Objects are stored in a heap controlled by the interpreter. The heap consists of separate pages, each being 16 Kb in size and every object using 40 bytes of memory. When object is created, MRI searches for a free slot: if there’s none, extra page is added to the heap. Now, not every object fits into 40 bytes. If it needs more, additional memory is allocated (via malloc).

Memory is freed automatically when garbage collector (GC) runs. Modern MRIs have quite effective incremental GC with generations and two phases: minor and major. Based on heuristic principle “Most objects die young”, minor GC tries to find unneeded objects only among newly created ones. This allows to run major GC less often, and this guy performs classic Mark-and-Sweep algorithm for ALL objects.

It must be noted that intricacies of different generations have nothing to do with hunting memory leaks. There’s one important thing: are all objects freed, which are created while processing web request, or not? Actually, in web server context all objects can be divided into three groups:

1. Statics. All loaded gems, esp. Rails, and app code. All this gets loaded once in production environment and doesn’t change.
2. Slow dynamics. There’s a certain amount of long-lived objects, e.g. cache of prepared SQL statements in ActiveRecord. This cache is per DB connection and is max. 1000 statements by default. It will grow slowly, and total number of objects will increase until cache reaches full size (2000 strings * number of DB connections)
3. Fast dynamics. Objects created during request processing and response generation. When response is ready, all these objects can be freed.

In third case, if any object doesn’t get freed, you’ll have a leak. Here’s an illustration:

Constants are not garbage-collected, therefore consecutive MyController#index calls will lead to FOO inflation. Heap will grow to accomodate more and more "haha" strings.

If there are no leaks, heap size will oscillate. The minimum size corresponds to objects from groups 1 and 2 (see above). For example, in our app this size is slightly above 500,000, while an empty app created with rails new app instantiates roughly 300,000 objects. Maximum heap size depends on how often major GC runs. But: after every major GC the number of objects is always back to the lowest. Leaks will lead to low boundary increased over time. The GC.stat[:heap_live_slots] figure reflects current heap size.

The most convenient way to explore GC-related things is using gc_tracer gem, built by Koichi Sasada, the Ruby developer team member and the author of incremental GC implementation in Ruby 2.1 and 2.2. Having added

to config/application.rb, we get log/gc.log file which is filled by GC stats and also getrusage results (the latter is useful because one of fields contains RSS figure which we are so interested in!).

Every line of this log has over 50 numbers, but here’s some simple UNIX magic to the rescue. The command I ran along with puma and siege was:

tail -f log/gc.log | cut -f 2,4,7,8,9,15,16,19,21,22,25,26,27,28,29,30,36

which gives this:

First number in line is timestamp in milliseconds. Second is the number of pages. Third – heap size (in objects). The 11th (with values in 581634…613199 range) is the number of “old” objects, i.e. objects which are not inspected during minor GC run. The last number in line is RSS in kilobytes.

Still so many numbers! Let’s plot them. We could load this log directly into Excel (sorry, LibreOffice Calc), but that’s not classy. Let’s use gnuplot instead, which is able to plot directly from files.

Unfortunately, gnuplot doesn’t support timestamps in milliseconds, so I had to write a small script for timestamp conversion:

Along with turning milliseconds into seconds some extra information is discarded here. gc_tracer generates data on all stages of GC but we are interested only in the final data (first column contains “end_sweep”).

This gnuplot script

gets us this picture:

Red curve (left scale) displays the number of “old” objects. It behaves like there’s no leak. Blue curve (right scale) is RSS, which never stops rising.

The conclusion is: there are no leaks in Ruby code. But I wasn’t keen enough to grasp that at the moment and spent more time pursuing a false path.

False path: messing with heap dumps

Modern versions of MRI are equipped with powerful means for memory analysis. For instance, you can enable object creation tracing, and for each newly created object the location where it was instantiated (source file, line) will be saved and accessible later:

Now we can dump the heap into a file:

Location information will be dumped too, if it’s available. The dump itself is a JSON Lines file: data for each object are in JSON format and occupy one line in the file.

There are gems which make use of allocation tracing, like memory_profiler and derailed. I’ve decided to investigate what happens with the heap in our app as well. Having had no luck with memory_profiler, I went for generating and analyzing dumps myself.

First I wrote my own benchmark like derailed does, but later switched to generating dumps from live app, using rbtrace. If you have gem 'rbtrace' in your Gemfile, here’s a way to generate dump:

Now let’s assume we have three dumps (1, 2, 3), generated at different moments in time. How can we spot a leak? The following scheme has been suggested: let’s take dump 2 and remove all objects which are present in dump 1. Then let’s also remove all objects which are missing from dump 3. What’s left is objects which were possibly leaked during the time between dump 1 and 2 creation.

I even wrote my own utility for this differential analysis procedure. It’s implemented in… Clojure, because I like Clojure.

But everything I could find was the prepared SQL statements cache mentioned above. Its contents, SQL strings, weren’t leaked, they just lived long enough to appear in dump 3.

Finally I had to admit that there are no leaks in Ruby code and had to look for them elsewhere.

Learning jemalloc

I made following hypothesis: if there are no leaks on Ruby side, but memory is somehow still leaking, there must be leaks in C code. It could be either some gem’s native code or MRI itself. In this case C heap must be growing.

But how to detect leaks in C code? I tried valgrind on Linux and leaks on OS X. They haven’t got me anything interesting enough. Still looking for something, I’ve stumbled upon jemalloc.

Jemalloc is a custom implementation of malloc, free, and realloc that is trying to be more effective than standard system implementation (not to count FreeBSD, where jemalloc is the system implementation). It uses a bag full of tricks to achieve this goal. There’s an own page system with pages allocated via mmap; the allocator uses independent “arenas” with thread affinity allowing for less synchronization between threads. Allocated blocks are divided into three size classes: small (< 3584 bytes), large (< 4 Mb), and huge – each class is handled differently. But, most importantly, jemalloc has statistics and profiling. Finally, MRI 2.2.0 supports jemalloc natively! The LD_PRELOAD hack is not needed anymore (by the way, I couldn’t make it work).

I quickly went into installing jemalloc and then MRI with jemalloc enabled. Not without trouble. Ubuntu and Homebrew-provided jemalloc library is built without profiling. Also you can’t build certain Ruby gems with most recent jemalloc version 4.0.0, released in August 2015. E.g. pg gem doesn’t like C99 <stdbool.h> included into <jemalloc/jemalloc.h>. But everything is smooth with jemalloc 3.6.0.

You can follow this instruction to build MRI with rbenv, however at the end I opted for building it myself:

You’ll also need --with-openssl-dir=/usr/local/opt/openssl for Homebrew.

This Ruby works as normal. But it reacts to MALLOC_CONF environment variable:

When Ruby is done, you’ll get quite detailed memory allocation stats printed to stderr. I used this setting to run puma overnight (with siege attacking it). By morning RSS grew up to 2 gigs. Hitting Ctrl-C brought me to the stats with allocated memory figure being close to same 2 gigs. Eureka! The C code leak hypothesis has been confirmed!

Profiling and more detailed stats

Next question was: where in C code are those 2 gigs allocated? Profiling helped here. The jemalloc profiler stores addresses which called malloc, calculates how much memory has been allocated from each address, and stores everything to the dump. You can enable it with same MALLOC_CONF, specifying prof:true flag. In this case final dump will be generated when the process exits. Dump is to be analyzed with pprof program.

Unfortunately, pprof couldn’t decode addresses:

I had to subtract the start of code segment (this information is printed when the process exists) from these numbers and use info symbol command in gdb (e.g. info symbol 0x2b5f9). The address appeared to belong to objspace_xmalloc function (it’s declared static, maybe that’s the reason for non-showing). A more representative profile, with puma being hit by siege for 2 hours, showed that this function allocated 97.9 % of total memory allocated. Now, the leak has something to do with Ruby indeed!

Having become more certain in my search area, I’ve decided to investigate statistical patterns of allocated blocks. Feeling not inclined to parse jemalloc textual stats output, I wrote my own gem, jemal. The main functionality is in Jemal.stats method which returns all statistics of interest as one big hash.

What was left is adding a small piece of code into the app:

…and run puma and siege overnight again.

By morning log/jemalloc.log was big enough and could be analyzed. jq tool proved being extremely helpful. First I decided to see how memory grows:

Look how UNIX way works here! jq parses JSON in every line and outputs ts values and allocated values in turn:

Then paste - - joins its input line by line, separating two values with TAB:

Such a file can be fed to gnuplot:

Linear growth! What’s with block sizes?

jq can extract data even from deeply embedded structures.

The plot testifies that it’s small objects who’s being leaked. But that’s not all: jemalloc provides separate stats for different block sizes within one class! In small class, every allocated block has one of 28 fixed sizes (the requested size is simply rounded up): 8, 16, 32, 48, 64, 80, …, 256, 320, 384, …, 3584. Stats for every size are kept separately. By staring at the log I noted some anomaly with size 320. Let’s plot it as well:

Wow! Memory is consumed by objects of one size. Everything else is just constant, it’s evident from the fact that lines are parallel. But what’s up with size 320? Apart from total memory allocated, jemalloc calculates 8 more indices, including the number of allocations (basically, malloc calls) and deallocations. Let’s plot them too:

Same indices for blocks with neighbor size, 256, are included for comparison. It’s apparent than blue and orange curves blended, which means that the number of deallocations for 256-sized blocks is approx. equal to the number of allocations (that’s what we call healthy!). Compare this to size 320, where the number of allocations (magenta curve) runs away from the number of deallocations (green curve). Which finally proves the memory leak existence.

Where’s the leak?

We’ve squeezed everything we could from stats. Now the leak source is to be found.

I had no better idea than adding some debug prints to gc.c:

Not wanting to drown in information, I started to run curl by hand instead of using siege. I was interested in how many blocks get leaked during one request processing. Yet another middleware was injected to the app:

I’ve used initializer (not config/application.rb) here to ensure that the middleware is exactly at the top of middleware stack.

Having run curl a few times, I saw that M-D value increases by 80-90 every request. Allocations and deallocations also appeared on stderr (and in the log by virtue of tee). Cutting last portion of log between ------------------------- and MAM BEFORE ..., I ran it through this simple script:

Now here they are, the possible leak addresses:

There are quite many blocks with size 312. What does it mean? Nothing! I wanted to look at the memory contents and used gdb to connect to live process. So, taking some address and looking what’s there:

Looks like a bunch of pointers. Intel is little-endian, so most significant byte is at the end and first line represents 0x7f1139dee134 number (the 0x7f113.. thing made me believe it’s an address). Helpful? Not much.

Then I wanted to see the backtraces of calls which allocated those blocks. Lazy googling revealed the following code which works with gcc:

I’ve added the call of this function to objspace_xmalloc:

Then I repeated everything again, ran the possible leak detection script and started to look up the found addresses in the log, as backtraces were neatly printed near them.

And what I saw was…

More then a dozen addresses related to redcarpet.so showed up. Now that’s who ate all our memory!!! It was redcarpet gem, a Markdown to HTML renderer.

Verification and fix

It was easy after finding the offender. I ran 10000 renderings in the console – leak confirmed. Ditched Rails, loaded the gem separately and repeated the same – the leak is still there!

The only place where redcarpet’s native code allocated memory through Ruby interface is in rb_redcarpet_rbase_alloc function which is basically a C constructor for Redcarpet::Render::Base. The allocated memory wasn’t freed during garbage collection. Quick googling revealed an example of how to write such a constructor correctly in tenderlove’s blog. And the fix was simple.

Bingo!

Conclusion

1. Probably, it shouldn’t have taken 2 weeks. I’d spend much less now.
2. False paths are getting you out of the way, and you have to get back. Aside from dumps I’ve also spent some time trying to configure garbage collector with environment variables. The only thing I could take with me was RUBY_GC_HEAP_INIT_SLOTS=1000000. With this setting our app fits into the heap and doesn’t need more slots.
3. It seems that you can debug anything in our times. The number of helpful tools and libraries is incredible. If you don’t succeed, just try more.

P.S.

Redcarpet guys still did nothing about my pull request as of now (Sep 21, 2015), but we’ve deployed the fix to production long ago. Compare this leak-y picture (same as in the beginning of this post)

with normal memory consumption graph:

Here’s no trend at all. As it should be.

Update on 2015/09/29. Redcarpet fix has been released.

<< Older