rperf guide
A safepoint-based sampling performance profiler for Ruby.
rperf profiles your Ruby programs by sampling at safepoints and using actual time deltas as weights to correct safepoint bias. It provides a perf-like CLI, a Ruby API, and outputs pprof, collapsed stacks, and text report formats.
Key features:
- CPU and wall-clock profiling modes
- Time-weighted sampling that corrects safepoint bias
- GVL contention and GC phase tracking
- pprof, collapsed stacks, and text output formats
- Low overhead (sampling callback cost < 0.2% at default 1000 Hz)
- Requires Ruby >= 3.4.0 on POSIX systems (Linux, macOS)
Getting Started
This part covers what rperf is, why it exists, and how to install it on your system.
1. Introduction
1.1 What is rperf?
rperf is a safepoint-based sampling performance profiler for Ruby. It helps you find where your Ruby program spends its time — whether that’s CPU computation, I/O waits, GVL contention, or garbage collection.
Unlike traditional sampling profilers that count samples uniformly, rperf uses actual time deltas (in nanoseconds) as weights for each sample. This corrects the safepoint bias problem inherent in postponed job-based sampling, producing more accurate results.
rperf is inspired by Linux perf[Arnaldo, 2010], providing a familiar CLI interface with subcommands like record, stat, and report.
1.1.1 Advantages
- Accurate profiling: Time-delta weighting corrects safepoint bias, producing results closer to real time distribution than traditional count-based profilers.
- GVL / GC visibility: In wall mode, tracks off-GVL blocking, GVL contention, and GC marking/sweeping as distinct frames — no separate tool needed.
- Standard output formats: Outputs pprof protobuf (compatible with
go tool pprof), collapsed stacks (for flame graphs / speedscope), and human-readable text. - Low overhead: Default 1000 Hz sampling callback cost is < 0.2%, suitable for production.
- Simple CLI:
rperf statfor a quick overview,rperf record+rperf reportfor detailed analysis.
1.1.2 Limitations
- Ruby 3.4+ only: Requires APIs introduced in Ruby 3.4.
- POSIX only: Linux and macOS. Windows is not supported.
- Method-level granularity: No line-number resolution — profiles show method names only.
- Single session: Only one profiling session can be active at a time (global state in the C extension).
- Safepoint latency: Samples are still deferred to safepoints. The time-delta weighting corrects the bias but cannot recover the exact interrupted instruction pointer.
1.2 Why another profiler?
Ruby already has profiling tools like stackprof[Aman, 2014]. So why rperf?
1.2.1 The safepoint bias problem
Most Ruby sampling profilers collect backtraces by calling rb_profile_frames directly in the signal handler. This approach yields backtraces at the actual signal timing, but relies on undocumented internal VM state — rb_profile_frames is not guaranteed to be async-signal-safe, and the results can be unreliable if the VM is in the middle of updating its internal structures.
rperf takes a different approach: it uses the postponed job mechanism (rb_postponed_job), which is the Ruby VM’s official API for safely deferring work from signal handlers. Backtrace collection is deferred to the next safepoint — a point where the VM is in a consistent state and rb_profile_frames can return reliable results. The trade-off is that when a timer fires between safepoints, the actual sample is delayed until the next safepoint.
If each sample were counted equally (weight = 1), a sample delayed by a long-running C method would get the same weight as one taken immediately. This is the safepoint bias[Todd, 2010] problem: functions that happen to be running when the thread reaches a safepoint appear more often than they should, while functions between safepoints are under-represented.
cannot stop yet VM->>Profiler: Safepoint reached (t=3ms) Note over Profiler: Traditional: weight = 1
rperf: weight = 3ms Timer->>VM: Signal (t=1ms) VM->>Profiler: Safepoint reached (t=1ms) Note over Profiler: Traditional: weight = 1
rperf: weight = 1ms
mermaid source
sequenceDiagram
participant Timer
participant VM
participant Profiler
Timer->>VM: Signal (t=0ms)
Note over VM: Running C method...<br/>cannot stop yet
VM->>Profiler: Safepoint reached (t=3ms)
Note over Profiler: Traditional: weight = 1<br/>rperf: weight = 3ms
Timer->>VM: Signal (t=1ms)
VM->>Profiler: Safepoint reached (t=1ms)
Note over Profiler: Traditional: weight = 1<br/>rperf: weight = 1ms
rperf solves this by recording clock_now - clock_prev as the weight of each sample. A sample delayed by 3ms gets 3x the weight of a 1ms sample, accurately reflecting where time was actually spent.
1.2.2 Other advantages
- GVL and GC awareness: In wall mode, rperf tracks time spent blocked off the GVL, waiting to reacquire the GVL, and in GC marking/sweeping phases — each as distinct synthetic frames.
- perf-like CLI: The
rperf statcommand gives you a quick performance overview (likeperf stat), whilerperf record+rperf reportgives you detailed profiling. - Standard output: rperf outputs pprof protobuf format, compatible with Go’s
pproftool ecosystem, as well as collapsed stacks for flame graphs[Brendan, 2016] and speedscope. - Low overhead: Default 1000 Hz sampling callback cost is < 0.2%, suitable for production use.
1.3 Requirements
- Ruby >= 3.4.0
- POSIX system (Linux or macOS)
- Go (optional, for
rperf reportandrperf diffsubcommands)
1.4 Quick start
Profile a Ruby script and view the results:
# Install rperf
gem install rperf
# Quick performance overview
rperf stat ruby my_app.rb
# Record a profile and view it interactively
rperf record ruby my_app.rb
rperf report
Or use rperf from Ruby code:
require "rperf"
Rperf.start(output: "profile.pb.gz") do
# code to profile
end
2. Installation
2.1 Installing the gem
rperf is distributed as a Ruby gem with a C extension:
gem install rperf
Or add it to your Gemfile:
gem "rperf"
Then run:
bundle install
2.2 Verifying the installation
Check that rperf is installed correctly:
rperf --help
You should see:
Usage: rperf record [options] command [args...]
rperf stat [options] command [args...]
rperf report [options] [file]
rperf diff [options] base.pb.gz target.pb.gz
rperf help
Run 'rperf help' for full documentation
2.3 Platform support
rperf supports POSIX systems:
| Platform | Timer implementation | Notes |
|---|---|---|
| Linux | timer_create + signal (default) |
Best precision (~1000us at 1000Hz) |
| Linux | nanosleep thread (with signal: false) |
Fallback, ~100us drift/tick |
| macOS | nanosleep thread |
Signal-based timer not available |
On Linux, rperf uses timer_create with SIGEV_SIGNAL and a sigaction handler by default. This provides precise interval timing with no extra thread. The signal number defaults to SIGRTMIN+8 and can be changed via the signal: keyword argument to Rperf.start in the Ruby API.
On macOS (and when signal: false is set on Linux), rperf falls back to a dedicated pthread with a nanosleep loop.
2.4 Optional: Go toolchain
The rperf report and rperf diff subcommands are thin wrappers around go tool pprof. If you want to use these commands, install Go from go.dev.
Without Go, you can still use all other rperf features. You can view pprof files with other tools like speedscope or generate text/collapsed output directly.
How to Use
This part explains how to profile Ruby programs with rperf — from the command line and as a library — and describes the available output formats.
3. CLI Usage
rperf provides a perf-like command-line interface with five main subcommands: record, stat, exec, report, and diff.
3.1 rperf stat
rperf stat is the quickest way to get a performance overview. It runs your command with wall-mode profiling and prints a summary to stderr.
rperf stat ruby my_app.rb
3.1.1 Example: CPU-bound program
Here is a simple Fibonacci computation:
# fib.rb
def fib(n)
return n if n <= 1
fib(n - 1) + fib(n - 2)
end
fib(35)
Running rperf stat:
rperf stat ruby fib.rb
Performance stats for 'ruby fib.rb':
744.4 ms user
32.0 ms sys
491.0 ms real
481.0 ms 100.0% CPU execution
12.0 ms [Ruby] GC time (9 count: 6 minor, 3 major)
154,468 [Ruby] allocated objects
66,596 [Ruby] freed objects
25 MB [OS] peak memory (maxrss)
31 [OS] context switches (10 voluntary, 21 involuntary)
0 MB [OS] disk I/O (0 MB read, 0 MB write)
481 samples / 481 triggers, 0.1% profiler overhead
The output tells you:
- user/sys/real: Standard timing (like the
timecommand) - Time breakdown: Where wall time was spent — CPU execution, GVL blocked (I/O/sleep), GVL wait (contention), GC marking, GC sweeping
- Ruby stats: GC count, allocated/freed objects, YJIT ratio (if enabled)
- OS stats: Peak memory, context switches, disk I/O
Use --report to add flat and cumulative top-50 function tables to the output.
3.1.2 Example: Mixed CPU and I/O
# mixed.rb
def cpu_work(n)
sum = 0
n.times { |i| sum += i * i }
sum
end
def io_work
sleep(0.05)
end
5.times do
cpu_work(500_000)
io_work
end
Running rperf stat:
rperf stat ruby mixed.rb
Because stat always uses wall mode, you can see how time is divided between CPU and I/O. The [Ruby] GVL blocked line shows time spent sleeping/in I/O, while CPU execution shows compute time.
3.1.3 stat options
rperf stat [options] command [args...]
| Option | Description |
|---|---|
-o PATH |
Also save profile to file (default: none) |
-f HZ |
Sampling frequency in Hz (default: 1000) |
-m MODE |
cpu or wall (default: wall) |
--report |
Include flat/cumulative profile tables in output |
-v |
Print additional sampling statistics |
3.2 rperf exec
rperf exec runs a command with profiling and prints a full performance report to stderr — equivalent to rperf stat --report. It uses wall mode by default and does not save to a file.
rperf exec ruby my_app.rb
This prints everything stat shows (timing, time breakdown, GC/memory/OS stats) plus flat and cumulative top-50 function tables.
3.2.1 exec options
rperf exec [options] command [args...]
| Option | Description |
|---|---|
-o PATH |
Also save profile to file (default: none) |
-f HZ |
Sampling frequency in Hz (default: 1000) |
-m MODE |
cpu or wall (default: wall) |
-v |
Print additional sampling statistics |
3.3 rperf record
rperf record profiles a command and saves the result to a file. This is the primary way to capture profiles for detailed analysis.
rperf record ruby my_app.rb
By default, it saves to rperf.data in pprof format with CPU mode.
3.3.1 Example: Recording a profile
rperf record ruby fib.rb
This creates rperf.data. You can then analyze it with rperf report or other pprof-compatible tools.
3.3.2 Choosing a profiling mode
rperf supports two profiling modes:
- cpu (default): Measures per-thread CPU time. Best for finding functions that consume CPU cycles. Ignores time spent sleeping, in I/O, or waiting for the GVL.
- wall: Measures wall-clock time. Best for finding where wall time goes, including I/O, sleep, and GVL contention.
# CPU mode (default)
rperf record ruby my_app.rb
# Wall mode
rperf record -m wall ruby my_app.rb
3.3.3 Choosing an output format
rperf auto-detects the format from the file extension:
# pprof format (default)
rperf record -o profile.pb.gz ruby my_app.rb
# Collapsed stacks (for FlameGraph / speedscope)
rperf record -o profile.collapsed ruby my_app.rb
# Human-readable text
rperf record -o profile.txt ruby my_app.rb
You can also force the format explicitly:
rperf record --format text -o profile.dat ruby my_app.rb
3.3.4 Example: Text output
rperf record -o profile.txt ruby fib.rb
The text output looks like this:
Total: 509.5ms (cpu)
Samples: 509, Frequency: 1000Hz
Flat:
509.5 ms 100.0% Object#fib (fib.rb)
Cumulative:
509.5 ms 100.0% Object#fib (fib.rb)
509.5 ms 100.0% <main> (fib.rb)
3.3.5 Example: Wall mode text output
rperf record -m wall -o wall_profile.txt ruby mixed.rb
Total: 311.8ms (wall)
Samples: 80, Frequency: 1000Hz
Flat:
250.6 ms 80.4% [GVL blocked] (<GVL>)
44.1 ms 14.1% Object#cpu_work (mixed.rb)
13.9 ms 4.5% Integer#times (<internal:numeric>)
3.2 ms 1.0% Kernel#sleep (<C method>)
0.0 ms 0.0% [GVL wait] (<GVL>)
Cumulative:
311.8 ms 100.0% Integer#times (<internal:numeric>)
311.8 ms 100.0% block in <main> (mixed.rb)
311.8 ms 100.0% <main> (mixed.rb)
253.8 ms 81.4% Kernel#sleep (<C method>)
253.8 ms 81.4% Object#io_work (mixed.rb)
250.6 ms 80.4% [GVL blocked] (<GVL>)
58.0 ms 18.6% Object#cpu_work (mixed.rb)
0.0 ms 0.0% [GVL wait] (<GVL>)
In wall mode, [GVL blocked] appears as the dominant cost — this is the sleep time in io_work. The CPU time for cpu_work is clearly separated.
3.3.6 Verbose output
The -v flag prints sampling statistics to stderr during profiling:
rperf record -v ruby my_app.rb
[rperf] mode=cpu frequency=1000Hz
[rperf] sampling: 98 calls, 0.11ms total, 1.1us/call avg
[rperf] samples recorded: 904
[rperf] top 10 by flat:
[rperf] 53.4ms 50.1% Object#cpu_work (-e)
[rperf] 17.0ms 15.9% Integer#times (<internal:numeric>)
...
3.3.7 record options
rperf record [options] command [args...]
| Option | Description |
|---|---|
-o PATH |
Output file (default: rperf.data) |
-f HZ |
Sampling frequency in Hz (default: 1000) |
-m MODE |
cpu or wall (default: cpu) |
--format FMT |
pprof, collapsed, or text (default: auto from extension) |
-p, --print |
Print text profile to stdout (same as --format=text --output=/dev/stdout) |
-v |
Print sampling statistics to stderr |
3.4 rperf report
rperf report opens a pprof profile for analysis. It wraps go tool pprof and requires Go to be installed.
# Open interactive web UI (default)
rperf report
# Open a specific file
rperf report profile.pb.gz
# Print top functions
rperf report --top
# Print pprof text summary
rperf report --text
3.4.1 Example: Top and text output
Using the fib.rb profile recorded earlier:
rperf report --top rperf.data
Type: cpu
Showing nodes accounting for 577.31ms, 100% of 577.31ms total
flat flat% sum% cum cum%
577.31ms 100% 100% 577.31ms 100% Object#fib
0 0% 100% 577.31ms 100% <main>
The default behavior (without --top or --text) opens an interactive web UI in your browser with flame graphs, top function views, and call graph visualizations powered by pprof[Gang, 2010].
3.4.2 report options
| Option | Description |
|---|---|
--top |
Print top functions by flat time |
--text |
Print pprof text summary |
| (default) | Open interactive web UI in browser |
3.5 rperf diff
rperf diff compares two pprof profiles, showing the difference (target - base). This is useful for measuring the impact of optimizations.
# Open diff in browser
rperf diff before.pb.gz after.pb.gz
# Print top functions by diff
rperf diff --top before.pb.gz after.pb.gz
# Print text diff
rperf diff --text before.pb.gz after.pb.gz
3.5.1 Workflow example
# Profile the baseline
rperf record -o before.pb.gz ruby my_app.rb
# Make your optimization changes...
# Profile again
rperf record -o after.pb.gz ruby my_app.rb
# Compare
rperf diff before.pb.gz after.pb.gz
3.6 rperf help
rperf help prints the full reference documentation, including profiling modes, output formats, synthetic frames, and diagnostic tips.
rperf help
This outputs detailed documentation suitable for both human reading and AI-assisted analysis.
4. Ruby API
rperf provides a Ruby API for programmatic profiling. This is useful when you want to profile specific sections of code, integrate profiling into test suites, or build custom profiling workflows.
4.1 Basic usage
4.1.1 Block form (recommended)
The simplest way to use rperf is with the block form of Rperf.start. It profiles the block and returns the profiling data:
require "rperf"
data = Rperf.start(output: "profile.pb.gz", frequency: 1000, mode: :cpu) do
# code to profile
end
When output: is specified, the profile is automatically written to the file when the block finishes. The method also returns the raw data hash for further processing.
4.1.2 Example: Profiling a Fibonacci function
require "rperf"
def fib(n)
return n if n <= 1
fib(n - 1) + fib(n - 2)
end
data = Rperf.start(frequency: 1000, mode: :cpu) do
fib(33)
end
Rperf.save("profile.txt", data)
Running this produces:
Total: 192.7ms (cpu)
Samples: 192, Frequency: 1000Hz
Flat:
192.7ms 100.0% Object#fib (example.rb)
Cumulative:
192.7ms 100.0% Object#fib (example.rb)
192.7ms 100.0% block in <main> (example.rb)
192.7ms 100.0% Rperf.start (lib/rperf.rb)
192.7ms 100.0% <main> (example.rb)
4.1.3 Manual start/stop
For cases where block form is awkward, you can manually start and stop profiling:
require "rperf"
Rperf.start(frequency: 1000, mode: :wall)
# ... code to profile ...
data = Rperf.stop
Rperf.stop returns the data hash, or nil if the profiler was not running.
4.2 Rperf.start parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
frequency: |
Integer | 1000 | Sampling frequency in Hz |
mode: |
Symbol | :cpu |
:cpu or :wall |
output: |
String | nil |
File path to write on stop |
verbose: |
Boolean | false |
Print statistics to stderr |
format: |
Symbol | nil |
:pprof, :collapsed, :text, or nil (auto-detect from output extension) |
signal: |
Integer/Boolean | nil |
Linux only: nil = timer signal (default), false = nanosleep thread, positive integer = specific RT signal number |
aggregate: |
Boolean | true |
Aggregate identical stacks during profiling to reduce memory. false returns raw per-sample data |
4.3 Rperf.stop return value
Rperf.stop returns nil if the profiler was not running. Otherwise it returns a Hash:
{
mode: :cpu, # or :wall
frequency: 1000,
sampling_count: 1234, # number of timer callbacks
sampling_time_ns: 56789, # total time spent sampling (overhead)
trigger_count: 1234, # number of timer triggers
detected_thread_count: 4, # threads seen during profiling
start_time_ns: 17740..., # CLOCK_REALTIME epoch nanos
duration_ns: 10000000, # profiling duration in nanos
unique_frames: 42, # unique frame count (aggregate: true only)
unique_stacks: 120, # unique stack count (aggregate: true only)
samples: [ # Array of [frames, weight, thread_seq, label_set_id]
[frames, weight, seq, lsi], # frames: [[path, label], ...] deepest-first
... # weight: Integer (nanoseconds)
], # seq: Integer (thread sequence, 1-based)
# lsi: Integer (label set ID, 0 = no labels)
label_sets: [{}, {request: "abc"}], # label set table (index = label_set_id, present when labels used)
}
Each sample has:
- frames: An array of
[path, label]pairs, ordered deepest-first (leaf frame at index 0) - weight: Time in nanoseconds attributed to this sample
- thread_seq: Thread sequence number (1-based, assigned per profiling session)
- label_set_id: Label set ID (0 = no labels). Index into the
label_setsarray
When aggregate: true (default), identical stacks are merged and their weights summed. The samples array contains one entry per unique (stack, thread_seq, label_set_id) combination. When aggregate: false, every raw sample is returned individually.
4.4 Rperf.save
Rperf.save writes profiling data to a file in any supported format:
Rperf.save("profile.pb.gz", data) # pprof format
Rperf.save("profile.collapsed", data) # collapsed stacks
Rperf.save("profile.txt", data) # text report
The format is auto-detected from the file extension. You can override it with the format: keyword:
Rperf.save("output.dat", data, format: :text)
4.5 Rperf.snapshot
Rperf.snapshot returns the current profiling data without stopping. Only works in aggregate mode (the default). Returns nil if not profiling.
Rperf.start(frequency: 1000)
# ... work ...
snap = Rperf.snapshot
Rperf.save("snap.pb.gz", snap)
# ... more work (profiling continues) ...
data = Rperf.stop
With clear: true, the aggregated data is reset after taking the snapshot. This enables interval-based profiling where each snapshot covers only the period since the last clear:
Rperf.start(frequency: 1000)
loop do
sleep 10
snap = Rperf.snapshot(clear: true)
Rperf.save("profile-#{Time.now.to_i}.pb.gz", snap)
end
4.6 Sample labels
Rperf.label attaches key-value labels to the current thread’s samples. Labels appear in pprof sample labels, enabling per-context filtering (e.g., per-request profiling). If profiling is not running, label is silently ignored — safe to call unconditionally (e.g., from Rack middleware).
4.6.1 Block form
With a block, labels are automatically restored when the block exits — even if an exception is raised:
Rperf.label(request: "abc-123", endpoint: "/api/users") do
handle_request # samples inside get these labels
end
# labels are restored to previous state here
4.6.2 Without block
Without a block, labels persist on the current thread until changed:
Rperf.label(request: "abc-123")
# all subsequent samples on this thread have request="abc-123"
4.6.3 Merging and deleting
New labels merge with existing ones. Set a value to nil to remove a key:
Rperf.label(request: "abc")
Rperf.label(phase: "db") # adds phase, keeps request
Rperf.labels #=> {request: "abc", phase: "db"}
Rperf.label(request: nil) # removes request
Rperf.labels #=> {phase: "db"}
4.6.4 Nested blocks
Each block creates a scope. When it exits, the labels are restored to the state before the block — regardless of what happened inside:
Rperf.label(request: "abc") do
Rperf.label(phase: "db") do
Rperf.labels #=> {request: "abc", phase: "db"}
end
Rperf.labels #=> {request: "abc"}
end
Rperf.labels #=> {}
4.6.5 Filtering by label in pprof
Labels are written into pprof sample labels. Use go tool pprof to filter:
# Filter to specific label value
go tool pprof -tagfocus=request=abc-123 profile.pb.gz
# Group by label at stack root ("which requests are slow?")
go tool pprof -tagroot=request profile.pb.gz
# Group by label at stack leaf ("who calls this function?")
go tool pprof -tagleaf=request profile.pb.gz
# Exclude specific label value
go tool pprof -tagignore=request=healthcheck profile.pb.gz
4.6.6 Reading labels
Rperf.labels returns the current thread’s labels as a Hash:
Rperf.labels #=> {request: "abc", phase: "db"}
Returns an empty Hash if no labels are set.
4.7 Practical examples
4.7.1 Profiling a web request handler
require "rperf"
class ApplicationController
def profile_action
data = Rperf.start(mode: :wall) do
# Simulate a typical request
users = User.where(active: true).limit(100)
result = users.map { |u| serialize_user(u) }
render json: result
end
Rperf.save("request_profile.txt", data)
end
end
Using wall mode here captures not just CPU time but also database I/O and any GVL contention.
4.7.2 Comparing CPU and wall profiles
require "rperf"
def workload
# Mix of CPU and I/O
100.times do
compute_something
sleep(0.001)
end
end
# CPU profile: shows where CPU cycles go
cpu_data = Rperf.start(mode: :cpu) { workload }
Rperf.save("cpu.txt", cpu_data)
# Wall profile: shows where wall time goes
wall_data = Rperf.start(mode: :wall) { workload }
Rperf.save("wall.txt", wall_data)
The CPU profile will focus on compute_something, while the wall profile will show the sleep calls as [GVL blocked] time.
4.7.3 Processing samples
You can work with the sample data programmatically. By default, samples are aggregated (identical stacks merged):
require "rperf"
data = Rperf.start(mode: :cpu) { workload }
# data[:aggregated_samples] contains aggregated entries (one per unique stack)
# Find the hottest function
flat = Hash.new(0)
data[:aggregated_samples].each do |frames, weight, thread_seq|
leaf_label = frames.first&.last # frames[0] is the leaf
flat[leaf_label] += weight
end
top = flat.sort_by { |_, w| -w }.first(5)
top.each do |label, weight_ns|
puts "#{label}: #{weight_ns / 1_000_000.0}ms"
end
To get raw (non-aggregated) per-sample data, pass aggregate: false. Each timer tick produces a separate entry:
data = Rperf.start(mode: :cpu, aggregate: false) { workload }
# data[:raw_samples] contains one entry per timer sample
data[:raw_samples].each do |frames, weight, thread_seq|
puts "thread=#{thread_seq} weight=#{weight}ns depth=#{frames.size}"
end
4.7.4 Generating collapsed stacks for FlameGraph
require "rperf"
data = Rperf.start(mode: :cpu) { workload }
Rperf.save("profile.collapsed", data)
The collapsed format is one line per unique stack, compatible with Brendan Gregg’s FlameGraph[Brendan, 2016] tools and speedscope:
frame1;frame2;...;leaf weight_ns
You can generate a flame graph SVG:
flamegraph.pl profile.collapsed > flamegraph.svg
Or open the .collapsed file directly in speedscope.
5. Interpreting Results
5.1 Flat vs. cumulative time
Understanding the difference between flat and cumulative time is essential for effective profiling, as described in classic profiling literature gprof[Susan, 1982].
- Flat time: Time attributed directly to a function — it was the leaf (deepest frame) in the sample. High flat time means the function itself is doing expensive work.
- Cumulative time: Time for all samples where the function appears anywhere in the stack. High cumulative time means the function (or something it calls) is expensive.
mermaid source
graph TD
A["main()"] -->|calls| B["process()"]
B -->|calls| C["compute()"]
B -->|calls| D["format()"]
style C fill:#ff6b6b,color:#fff
style D fill:#ffa07a,color:#fff
If compute() has high flat time, optimize compute() itself. If process() has high cumulative but low flat time, the cost is in its children (compute() and format()).
5.2 Weight unit
All weights in rperf are in nanoseconds, regardless of profiling mode:
- 1,000 ns = 1 us (microsecond)
- 1,000,000 ns = 1 ms (millisecond)
- 1,000,000,000 ns = 1 s (second)
5.3 Synthetic frames
In addition to normal Ruby frames, rperf records synthetic frames that represent non-CPU activity. These always appear as the leaf (deepest) frame in a sample.
5.3.1 [GVL blocked]
Mode: wall only
Time the thread spent off the GVL — during I/O operations, sleep, or C extensions that release the GVL. This is one of the synthetic frames. This time is attributed to the stack captured at the SUSPENDED event (when the thread released the GVL).
High [GVL blocked] time indicates your program is I/O bound. Look at the cumulative view to find which functions are triggering the I/O.
5.3.2 [GVL wait]
Mode: wall only
Time the thread spent waiting to reacquire the GVL after becoming ready. This indicates GVL contention — another thread is holding the GVL while this thread wants to run.
High [GVL wait] time means your threads are serialized on the GVL. Consider reducing GVL-holding work, using Ractors, or moving work to child processes.
5.3.3 [GC marking]
Mode: cpu and wall
Time spent in the GC marking phase. Always measured in wall time. Attributed to the stack that triggered GC.
High [GC marking] time means too many live objects. Reduce the number of long-lived allocations.
5.3.4 [GC sweeping]
Mode: cpu and wall
Time spent in the GC sweeping phase. Always measured in wall time. Attributed to the stack that triggered GC.
High [GC sweeping] time means too many short-lived objects. Consider reusing objects or using object pools.
5.4 Diagnosing common problems
5.4.1 High CPU usage
Mode: cpu
Look for functions with high flat CPU time. These are the functions consuming CPU cycles.
Action: Optimize the hot function (better algorithm, caching) or call it less frequently.
5.4.2 Slow requests / high latency
Mode: wall
Look for functions with high cumulative wall time.
- If
[GVL blocked]is dominant: I/O or sleep is the bottleneck. Check database queries, HTTP calls, file I/O. - If
[GVL wait]is dominant: GVL contention. Reduce GVL-holding work or move to Ractors/child processes.
5.4.3 GC pressure
Mode: cpu or wall
Look for [GC marking] and [GC sweeping] samples.
- High
[GC marking]: Too many live objects. Reduce allocations of long-lived objects. - High
[GC sweeping]: Too many short-lived objects. Reuse or pool objects.
The rperf stat output also shows GC counts and allocated/freed object counts, which can help diagnose allocation-heavy code.
5.4.4 Multithreaded app slower than expected
Mode: wall
Look for [GVL wait] time across threads.
rperf stat ruby threaded_app.rb
Example output for a GVL-contended workload:
Performance stats for 'ruby threaded_app.rb':
89.9 ms user
14.0 ms sys
41.2 ms real
0.5 ms 1.2% CPU execution
9.8 ms 23.8% [Ruby] GVL blocked (I/O, sleep)
30.8 ms 75.0% [Ruby] GVL wait (contention)
Here, 75% of wall time is GVL contention. The four threads are fighting for the GVL, serializing their CPU work.
5.5 Tips for effective profiling
- Default frequency (1000 Hz) works well for most cases. For long-running production profiling, 100-500 Hz reduces overhead further.
- Profile representative workloads, not micro-benchmarks. Profiling a real request or batch job gives more actionable results than profiling a tight loop.
- Compare cpu and wall profiles to distinguish CPU-bound from I/O-bound code. If a function is hot in wall mode but not in CPU mode, it’s blocked on I/O.
- Use
rperf diffto measure the impact of optimizations. Profile before and after, then compare. - Use the verbose flag (
-v) to check profiler overhead and see top functions immediately.
6. Output Formats
rperf supports three output formats. The format is auto-detected from the file extension, or can be set explicitly with the --format flag (CLI) or format: parameter (API).
6.1 pprof (default)
The pprof format is a gzip-compressed Protocol Buffers binary. This is the standard format used by Go’s pprof tooling and is the default output of rperf.
Extension convention: .pb.gz
How to view:
# Interactive web UI (requires Go)
rperf report profile.pb.gz
# Top functions
rperf report --top profile.pb.gz
# Text report
rperf report --text profile.pb.gz
# Or use go tool pprof directly
go tool pprof -http=:8080 profile.pb.gz
go tool pprof -top profile.pb.gz
You can also import pprof files into speedscope via its web interface.
Advantages: Standard format supported by a wide ecosystem of tools. Supports diff comparison between two profiles. Interactive exploration with flame graphs, call graphs, and source annotations.
6.1.1 Embedded metadata
rperf embeds the following metadata in each pprof profile:
| Field | Description |
|---|---|
comment |
rperf version, profiling mode, frequency, Ruby version |
time_nanos |
Profile collection start time (epoch nanoseconds) |
duration_nanos |
Profile duration (nanoseconds) |
doc_url |
Link to rperf documentation |
View comments with: go tool pprof -comments profile.pb.gz
6.1.2 Sample labels
Each sample carries a thread_seq numeric label — a thread sequence number (1-based) assigned when rperf first sees each thread during a profiling session. When Rperf.label is used, custom key-value string labels are also attached to samples.
# Group flame graph by thread
go tool pprof -tagroot=thread_seq profile.pb.gz
# Filter by custom label
go tool pprof -tagfocus=request=abc-123 profile.pb.gz
# Group by label at root ("which requests are slow?")
go tool pprof -tagroot=request profile.pb.gz
# Group by label at leaf ("who calls this function?")
go tool pprof -tagleaf=request profile.pb.gz
# Exclude by label
go tool pprof -tagignore=request=healthcheck profile.pb.gz
6.2 Collapsed stacks
The collapsed stacks format is a plain text format with one line per unique stack trace. Each line contains a semicolon-separated stack (bottom-to-top) followed by a space and the weight in nanoseconds.
Extension convention: .collapsed
Format:
bottom_frame;...;top_frame weight_ns
Example output:
<main>;Integer#times;block in <main>;Object#cpu_work;Integer#times;Object#cpu_work 53419170
<main>;Integer#times;block in <main>;Object#cpu_work;Integer#times 16962309
<main>;Integer#times;block in <main>;Object#io_work;Kernel#sleep 2335151
How to use:
# Generate a FlameGraph SVG
rperf record -o profile.collapsed ruby my_app.rb
flamegraph.pl profile.collapsed > flamegraph.svg
# Open in speedscope (drag-and-drop the .collapsed file)
# macOS: open https://www.speedscope.app/
# Linux: xdg-open https://www.speedscope.app/
Advantages: Simple text format, easy to process with command-line tools. Compatible with Brendan Gregg’s FlameGraph[Brendan, 2016] tools and speedscope.
6.2.1 Parsing collapsed stacks programmatically
File.readlines("profile.collapsed").each do |line|
stack, weight = line.rpartition(" ").then { |s, _, w| [s, w.to_i] }
frames = stack.split(";")
# frames[0] is bottom (main), frames[-1] is leaf (hot)
puts "#{frames.last}: #{weight / 1_000_000.0}ms"
end
6.3 Text report
The text format is a human-readable (and AI-readable) report showing flat and cumulative top-N tables.
Extension convention: .txt
Example output:
Total: 509.5ms (cpu)
Samples: 509, Frequency: 1000Hz
Flat:
509.5 ms 100.0% Object#fib (fib.rb)
Cumulative:
509.5 ms 100.0% Object#fib (fib.rb)
509.5 ms 100.0% <main> (fib.rb)
Sections:
- Header: Total profiled time, sample count, and frequency
- Flat table: Functions sorted by self time (the function was the leaf/deepest frame)
- Cumulative table: Functions sorted by total time (the function appeared anywhere in the stack)
Advantages: No tooling required — readable with cat. Top 50 entries per table by default. Good for quick analysis, sharing in issue reports, or feeding to AI assistants.
6.4 Format comparison
| Feature | pprof | collapsed | text |
|---|---|---|---|
| File size | Small (binary + gzip) | Medium (text) | Small (text) |
| Flame graph | Yes (via pprof web UI) | Yes (via flamegraph.pl) | No |
| Call graph | Yes | No | No |
| Diff comparison | Yes (rperf diff) |
No | No |
| No tools needed | No (requires Go) | No (requires flamegraph.pl) | Yes |
| Programmatic parsing | Complex (protobuf) | Simple | Simple |
| AI-friendly | No | Yes | Yes |
6.5 Auto-detection rules
| File extension | Format |
|---|---|
.collapsed |
Collapsed stacks |
.txt |
Text report |
| Anything else | pprof |
The default output file (rperf.data) uses pprof format.
7. Framework Integration
rperf provides optional integrations that automatically label samples with context from web frameworks and job processors. These integrations only set labels — they do not start or stop profiling. Start profiling separately (e.g., in an initializer).
7.1 Rack middleware
Rperf::Middleware labels each request with its endpoint (METHOD /path).
require "rperf/middleware"
7.1.1 Rails
# config/initializers/rperf.rb
require "rperf/middleware"
Rperf.start(mode: :wall, frequency: 99)
Rails.application.config.middleware.use Rperf::Middleware
at_exit do
data = Rperf.stop
Rperf.save("tmp/profile.pb.gz", data) if data
end
Then filter the profile by endpoint:
go tool pprof -tagfocus=endpoint="GET /api/users" tmp/profile.pb.gz
go tool pprof -tagroot=endpoint tmp/profile.pb.gz # group by endpoint
7.1.2 Sinatra
require "sinatra"
require "rperf/middleware"
Rperf.start(mode: :wall, frequency: 99)
use Rperf::Middleware
at_exit do
data = Rperf.stop
Rperf.save("profile.pb.gz", data) if data
end
get "/hello" do
"Hello, world!"
end
7.1.3 Customizing the label key
By default the middleware uses the label key :endpoint. You can change it:
use Rperf::Middleware, label_key: :route
7.2 Active Job
Rperf::ActiveJobMiddleware labels each job with its class name (e.g., SendEmailJob). Works with any Active Job backend — Sidekiq, GoodJob, Solid Queue, etc.
require "rperf/active_job"
Include it in your base job class:
# app/jobs/application_job.rb
class ApplicationJob < ActiveJob::Base
include Rperf::ActiveJobMiddleware
end
All subclasses inherit the label automatically:
class SendEmailJob < ApplicationJob
def perform(user)
# samples here get job="SendEmailJob"
end
end
Filter by job:
go tool pprof -tagfocus=job=SendEmailJob profile.pb.gz
go tool pprof -tagroot=job profile.pb.gz # group by job class
7.3 Sidekiq
Rperf::SidekiqMiddleware labels each job with its worker class name. This covers both Active Job-backed workers and plain Sidekiq workers.
require "rperf/sidekiq"
Register it as a Sidekiq server middleware:
# config/initializers/sidekiq.rb
Sidekiq.configure_server do |config|
config.server_middleware do |chain|
chain.add Rperf::SidekiqMiddleware
end
end
Note
If you use Active Job with Sidekiq, choose one or the other — using both will result in duplicate labels. The Sidekiq middleware is more general (covers non-Active Job workers too).
7.4 Full Rails example
A typical Rails setup with both web and job profiling:
# config/initializers/rperf.rb
require "rperf/middleware"
require "rperf/sidekiq"
Rperf.start(mode: :wall, frequency: 99)
# Label web requests
Rails.application.config.middleware.use Rperf::Middleware
# Label Sidekiq jobs
Sidekiq.configure_server do |config|
config.server_middleware do |chain|
chain.add Rperf::SidekiqMiddleware
end
end
# Export profiles periodically
Thread.new do
loop do
sleep 60
snap = Rperf.snapshot(clear: true)
Rperf.save("tmp/profile-#{Time.now.to_i}.pb.gz", snap) if snap
end
end
Then compare where time goes across endpoints and jobs:
go tool pprof -tagroot=endpoint tmp/profile-*.pb.gz # web breakdown
go tool pprof -tagroot=job tmp/profile-*.pb.gz # job breakdown
Deep Dive
This part explores how to read and act on profiling results, and explains how rperf works under the hood.
8. Architecture Overview
This chapter describes rperf’s high-level architecture and the core data structures that underpin profiling. Understanding these internals helps you interpret edge cases in profiling results and appreciate the design trade-offs.
8.1 System diagram
rperf consists of a C extension and a Ruby wrapper:
(signal or pthread_cond_timedwait)"] PJ["rb_postponed_job_trigger"] Sample["Sample callback
(rperf_sample_job)"] ThreadHook["Thread event hook
(SUSPENDED/READY/RESUMED/EXITED)"] GCHook["GC event hook
(ENTER/EXIT/START/END_MARK/END_SWEEP)"] BufA["Sample buffer A"] BufB["Sample buffer B"] Worker["Worker thread
(timer + aggregation)"] FrameTable["Frame table
(VALUE → frame_id)"] AggTable["Aggregation table
(stack → weight)"] end subgraph "Ruby (lib/rperf.rb)" API["Rperf.start / Rperf.stop"] Encoders["Encoders
(PProf / Collapsed / Text)"] StatOut["stat output"] end Timer -->|triggers| PJ PJ -->|at safepoint| Sample Sample -->|writes| BufA ThreadHook -->|writes| BufA GCHook -->|writes| BufA BufA -->|swap when full| BufB BufB -->|aggregated by| Worker Worker -->|inserts| FrameTable Worker -->|inserts| AggTable API -->|calls| Sample AggTable -->|read at stop| Encoders FrameTable -->|resolved at stop| Encoders Encoders --> StatOut
mermaid source
graph TD
subgraph "C Extension (ext/rperf/rperf.c)"
Timer["Timer<br/>(signal or pthread_cond_timedwait)"]
PJ["rb_postponed_job_trigger"]
Sample["Sample callback<br/>(rperf_sample_job)"]
ThreadHook["Thread event hook<br/>(SUSPENDED/READY/RESUMED/EXITED)"]
GCHook["GC event hook<br/>(ENTER/EXIT/START/END_MARK/END_SWEEP)"]
BufA["Sample buffer A"]
BufB["Sample buffer B"]
Worker["Worker thread<br/>(timer + aggregation)"]
FrameTable["Frame table<br/>(VALUE → frame_id)"]
AggTable["Aggregation table<br/>(stack → weight)"]
end
subgraph "Ruby (lib/rperf.rb)"
API["Rperf.start / Rperf.stop"]
Encoders["Encoders<br/>(PProf / Collapsed / Text)"]
StatOut["stat output"]
end
Timer -->|triggers| PJ
PJ -->|at safepoint| Sample
Sample -->|writes| BufA
ThreadHook -->|writes| BufA
GCHook -->|writes| BufA
BufA -->|swap when full| BufB
BufB -->|aggregated by| Worker
Worker -->|inserts| FrameTable
Worker -->|inserts| AggTable
API -->|calls| Sample
AggTable -->|read at stop| Encoders
FrameTable -->|resolved at stop| Encoders
Encoders --> StatOut
The C extension handles all performance-critical operations: timer management, sample recording, and aggregation. The Ruby layer provides the user-facing API, output encoding, and statistics formatting.
8.2 Global profiler state
rperf uses a single global rperf_profiler_t struct. Only one profiling session can be active at a time (single session limitation). The struct holds:
- Timer configuration (frequency, mode, signal number)
- Double-buffered sample storage (two sample buffers + frame pools, swapped when full)
- Frame table (VALUE → uint32 frame_id, deduplicates frame references)
- Aggregation table (unique stack → accumulated weight)
- Worker thread handle (unified timer + aggregation thread)
- Thread-specific key for per-thread data
- GC phase tracking state
- Sampling overhead counters
8.3 Per-thread data
Each thread gets a rperf_thread_data_t struct stored via Ruby’s thread-specific data API (rb_internal_thread_specific_set). This tracks:
prev_time_ns: Previous time reading (for computing weight)prev_wall_ns: Previous wall time readingsuspended_at_ns: Wall timestamp when thread was suspendedready_at_ns: Wall timestamp when thread became readysuspended_frame_start/depth: Saved backtrace from SUSPENDED eventlabel_set_id: Current label set ID (0 = no labels)
Thread data is created lazily on first encounter and freed on the EXITED event or at profiler stop.
8.4 GC safety
Frame VALUEs must be protected from garbage collection. rperf wraps the profiler struct in a TypedData object with a custom dmark function that marks three regions:
- Both frame pools (active and standby buffers)
- Frame table keys (unique frame VALUEs, excluding synthetic frame slots)
The frame table keys array starts at 4,096 entries and grows by 2× when full. Growth allocates a new array, copies existing data, and swaps the pointer atomically (memory_order_release). The old array is kept alive until stop to prevent use-after-free if GC’s mark phase is reading it concurrently. The dmark function loads the keys pointer with memory_order_acquire and the count with memory_order_acquire to ensure a consistent view.
8.5 Fork safety
rperf registers a pthread_atfork child handler that silently stops profiling in the forked child process (fork safety):
- Clears the timer/signal state
- Removes event hooks
- Frees sample buffers, frame table, and aggregation table
The parent process continues profiling unaffected. The child can start a fresh profiling session if needed.
8.6 Known limitations
8.6.1 Running EC race
There is a known race condition in the Ruby VM where rb_postponed_job_trigger from the timer thread may set the interrupt flag on the wrong thread’s execution context. This happens when a new thread’s native thread starts before acquiring the GVL. The result is that timer samples may miss threads doing C busy-wait, with their CPU time leaking into the next SUSPENDED event’s stack.
This is a Ruby VM bug, not a rperf bug, and affects all postponed-job-based profilers.
8.6.2 Single session
Only one profiling session can be active at a time due to the global profiler state. Calling Rperf.start while already profiling is not supported.
8.6.3 Method-level granularity
rperf profiles at the method level, not the line level. Frame labels use rb_profile_frame_full_label for qualified names (e.g., Integer#times, MyClass#method_name). Line numbers are not included.
9. Sampling
This chapter explains how rperf collects samples: the timer mechanism that triggers sampling, the sampling callback itself, and the event hooks that capture GVL and GC activity.
9.1 Timer and worker thread
rperf uses a single worker thread that handles both timer triggering and periodic sample aggregation. The timer mechanism depends on the platform.
9.1.1 Linux: Signal-based timer (default)
On Linux, rperf uses timer_create with SIGEV_THREAD_ID to deliver a real-time signal (default: SIGRTMIN+8) exclusively to the worker thread at the configured frequency. The signal handler calls rb_postponed_job_trigger to schedule the sampling callback.
Using SIGEV_THREAD_ID ensures the timer signal only targets the worker thread, preventing it from interrupting nanosleep, read, or other blocking syscalls in Ruby threads (e.g., inside rb_thread_call_without_gvl).
when swap_ready is set
mermaid source
sequenceDiagram
participant Kernel as Linux Kernel
participant Worker as Worker Thread
participant VM as Ruby VM
participant Job as Postponed Job
Kernel->>Worker: SIGRTMIN+8 (every 1ms at 1000Hz)
Worker->>VM: rb_postponed_job_trigger()
Note over VM: Continues until safepoint...
VM->>Job: rperf_sample_job()
Job->>Job: rb_profile_frames() + record sample
Note over Worker: Also aggregates samples<br/>when swap_ready is set
This approach provides precise timing (~1000us median interval at 1000Hz).
9.1.2 Fallback: pthread_cond_timedwait
On macOS, or when signal: false is set on Linux, the worker thread uses pthread_cond_timedwait with an absolute deadline as the timer:
- Timeout (deadline reached): trigger
rb_postponed_job_triggerand advance deadline - Signal (swap_ready set): aggregate the standby buffer immediately
The deadline-based approach avoids drift when aggregation takes time. This mode has ~100us timing imprecision compared to the signal-based approach.
9.2 Sampling callback
When the postponed job fires, rperf_sample_job runs on whatever thread currently holds the GVL. It only samples that thread using rb_thread_current().
This is a deliberate design choice:
rb_profile_framescan only capture the current thread’s stack- There’s no need to iterate
Thread.list— combined with GVL event hooks, rperf gets broad coverage of all threads (though a known race in the Ruby VM can cause occasional missed samples)
The sampling callback:
- Gets or creates per-thread data (
rperf_thread_data_t) - Reads the current clock (
CLOCK_THREAD_CPUTIME_IDfor CPU mode,CLOCK_MONOTONICfor wall mode) - Computes weight as
time_now - prev_time - Captures the backtrace with
rb_profile_framesdirectly into the frame pool - Records the sample (frame start index, depth, weight, type)
- Updates
prev_time
9.3 GVL event tracking (wall mode)
In wall mode, rperf hooks into Ruby’s thread event API to track GVL transitions. This captures time that sampling alone would miss — time spent off the GVL.
(capture backtrace + timestamp) Suspended --> Ready: READY event
(record wall timestamp) Ready --> Running: RESUMED event
(record GVL blocked + wait samples) Running --> [*]: EXITED event
(cleanup thread data)
mermaid source
stateDiagram-v2
[*] --> Running: Thread holds GVL
Running --> Suspended: SUSPENDED event<br/>(capture backtrace + timestamp)
Suspended --> Ready: READY event<br/>(record wall timestamp)
Ready --> Running: RESUMED event<br/>(record GVL blocked + wait samples)
Running --> [*]: EXITED event<br/>(cleanup thread data)
9.3.1 SUSPENDED
When a thread releases the GVL (e.g., before I/O):
- Capture the current backtrace into the frame pool
- Record a normal sample (time since last sample)
- Save the backtrace and wall timestamp for later use
9.3.2 READY
When a thread becomes ready to run (e.g., I/O completed):
- Record the wall timestamp (no GVL needed — only simple C operations)
9.3.3 RESUMED
When a thread reacquires the GVL:
- Record a
[GVL blocked]sample: weight =ready_at - suspended_at(off-GVL time) - Record a
[GVL wait]sample: weight =resumed_at - ready_at(GVL contention time) - Both samples reuse the backtrace captured at SUSPENDED
This way, off-GVL time and GVL contention are accurately attributed to the code that triggered them, even though no timer-based sampling can occur while the thread is off the GVL.
9.4 GC phase tracking
rperf hooks into Ruby’s internal GC events to track garbage collection time:
| Event | Action |
|---|---|
GC_START |
Set phase to marking |
GC_END_MARK |
Set phase to sweeping |
GC_END_SWEEP |
Clear phase |
GC_ENTER |
Capture backtrace + wall timestamp |
GC_EXIT |
Record [GC marking] or [GC sweeping] sample |
GC samples always use wall time regardless of the profiling mode, because GC time is real elapsed time that affects application latency.
9.5 Deferred string resolution
During sampling, rperf stores raw frame VALUEs (Ruby internal object references) in the frame pool — not strings. This deferred string resolution keeps the hot path allocation-free and fast.
String resolution happens at stop time:
Rperf.stopcalls_c_stop- The frame table maps each unique frame VALUE to a
[path, label]string pair viarb_profile_frame_full_labelandrb_profile_frame_path - These resolved strings are passed to the Ruby encoders
This means sampling only writes integers (VALUE pointers and timestamps) to pre-allocated buffers. No Ruby objects are created, no GC pressure is added during profiling.
10. Aggregation
This chapter explains how rperf processes raw samples into compact, deduplicated data structures. Aggregation keeps memory usage bounded during long profiling sessions.
10.1 Overview
By default (aggregate: true), rperf periodically aggregates samples in the background. Raw samples — each containing a stack trace, weight, and metadata — are merged into two hash tables: a frame table and an aggregation table. Identical stacks are combined by summing their weights.
10.2 Double buffering
Two sample buffers alternate roles to allow concurrent sampling and aggregation:
- The active buffer receives new samples from sampling callbacks
- When the active buffer reaches 10,000 samples, the buffers swap
- The standby buffer is processed by the worker thread in the background
(receiving samples)"] end subgraph "Standby" B["Sample buffer B
(being aggregated)"] end A -->|"swap when
10K samples"| B B -->|"swap back
after aggregation"| A
mermaid source
graph LR
subgraph "Active"
A["Sample buffer A<br/>(receiving samples)"]
end
subgraph "Standby"
B["Sample buffer B<br/>(being aggregated)"]
end
A -->|"swap when<br/>10K samples"| B
B -->|"swap back<br/>after aggregation"| A
If the worker thread hasn’t finished processing the standby buffer, the swap is skipped and the active buffer continues growing (fallback to unbounded mode).
Each buffer has its own frame pool. The active buffer’s frame pool receives raw VALUE references from rb_profile_frames. When buffers swap, the standby buffer’s frame pool is processed and then cleared for reuse.
10.3 Frame table
The frame table (VALUE → uint32_t frame_id) deduplicates frame references. Each unique frame VALUE gets a small integer ID.
- Keys: raw frame VALUEs (the only GC mark target for aggregated data)
- Values: compact uint32_t IDs used in the aggregation table
- Initial capacity: 4,096 entries, grows by 2× when full
- Growth uses atomic pointer swaps for GC dmark safety (see GC safety)
Using uint32_t frame IDs instead of full VALUEs halves the memory needed for stack storage in the aggregation table.
10.4 Aggregation table
The aggregation table merges identical stacks by summing their weights:
- Key:
(frame_ids[], thread_seq, label_set_id)— the stack (as frame IDs), thread sequence number, and label set - Value: accumulated weight in nanoseconds
Frame IDs are stored contiguously in a separate stack pool. Each aggregation table entry points into this pool (start index + depth).
Synthetic frames ([GVL blocked], [GVL wait], [GC marking], [GC sweeping]) are converted to reserved frame IDs during aggregation, eliminating the type field from the sample representation. This unifies the treatment of real and synthetic frames.
The label_set_id is part of the key, so samples with the same stack but different labels are kept separate.
10.5 Memory usage
| Buffer | Initial size | Element size | Initial memory |
|---|---|---|---|
| Sample buffer (×2) | 16,384 | 32B | 512KB × 2 |
| Frame pool (×2) | 131,072 | 8B (VALUE) | 1MB × 2 |
| Frame table keys | 4,096 | 8B (VALUE) | 32KB |
| Frame table buckets | 8,192 | 4B (uint32) | 32KB |
| Agg table buckets | 2,048 | 28B | 56KB |
| Stack pool | 4,096 | 4B (uint32) | 16KB |
Total: ~3.6MB with aggregate: true, ~1.5MB with aggregate: false (single buffer only). Frame table and aggregation table grow dynamically as needed.
With aggregation enabled, memory usage stays roughly constant regardless of profiling duration — only the number of unique stacks matters, not the total number of samples collected.
11. Output Encoding
This chapter describes how rperf converts its internal aggregated data into output formats. Encoding runs once at stop time, so it is not performance-critical.
11.1 From aggregated data to output
When Rperf.stop is called, the C extension returns the aggregated data to Ruby:
- Frame table: An array of
[path, label]string pairs, indexed by frame ID. Strings are resolved from raw VALUEs viarb_profile_frame_full_labelandrb_profile_frame_path. - Aggregation table: An array of
[frame_ids, weight, thread_seq, label_set_id]entries. - Label sets: An array of frozen Hashes mapping label keys to values.
The Ruby encoders (Rperf::PProf, Rperf::Collapsed, Rperf::Text) consume these arrays to produce the final output.
11.2 pprof encoder
rperf encodes the pprof[Gang, 2010] protobuf format entirely in Ruby, with no protobuf gem dependency. The encoder in Rperf::PProf.encode:
- Builds a string table (index 0 is always the empty string)
- Converts string frames to index frames and merges identical stacks
- Builds location and function tables
- Encodes the Profile protobuf message field by field
- Compresses the result with gzip
This hand-written encoder is simple (~100 lines) and handles only the subset of the pprof protobuf schema that rperf needs.
11.2.1 Embedded metadata
Each pprof profile includes metadata as comments: rperf version, profiling mode, frequency, and Ruby version. The time_nanos and duration_nanos fields record when and how long the profile was collected.
11.2.2 Thread sequence labels
Every sample carries a thread_seq numeric label — a 1-based thread sequence number assigned when rperf first sees each thread. This allows grouping flame graphs by thread in pprof tools.
11.3 Sample labels
Rperf.label enables per-context annotation of samples. The implementation is split between Ruby and C to keep the hot path minimal:
-
Ruby side: Manages label sets as frozen Hash objects in an array (
@label_set_table). A deduplication index (@label_set_index) maps each unique Hash to an integer ID.Rperf.label(key: value)merges the new labels with the current set, interns the result, and passes the integer ID to C. -
C side: Each
rperf_thread_data_tstores alabel_set_id(integer). When a sample is recorded, the current thread’slabel_set_idis copied into the sample — a single integer, adding zero allocation overhead to the hot path. The aggregation table includeslabel_set_idin its hash key, so identical stacks with different labels remain separate entries. -
Encoding: At encode time, the Ruby PProf encoder reads the
label_setsarray and writes each label as a pprofSample.Label(key-value string pair).go tool pprof -tagfocusand-tagrootcan filter and group by label.
11.4 Collapsed stacks encoder
The collapsed stacks encoder produces one line per unique stack: frames joined by semicolons (bottom-to-top), followed by a space and the weight in nanoseconds. This format is consumed by FlameGraph tools and speedscope.
When label sets are present, labels are not included in the collapsed output — use pprof format for label-aware analysis.
11.5 Text report encoder
The text report encoder produces a human-readable summary with:
- Header: Total profiled time, sample count, and frequency
- Flat table: Top 50 functions sorted by self time (the function was the leaf/deepest frame)
- Cumulative table: Top 50 functions sorted by total time (the function appeared anywhere in the stack)
This format requires no external tools and is suitable for quick analysis, issue reports, or AI-assisted analysis.
A. Quick Reference
A.1 CLI cheat sheet
# Quick performance overview
rperf stat ruby my_app.rb
# Performance overview with profile tables
rperf stat --report ruby my_app.rb
# Full performance report (same as stat --report)
rperf exec ruby my_app.rb
# Record to default file (rperf.data, pprof format, cpu mode)
rperf record ruby my_app.rb
# Record with options
rperf record -m wall -f 500 -o profile.pb.gz ruby my_app.rb
# Record and print text profile to stdout
rperf record -p ruby my_app.rb
# Record to text format
rperf record -o profile.txt ruby my_app.rb
# Record to collapsed stacks
rperf record -o profile.collapsed ruby my_app.rb
# View profile in browser (requires Go)
rperf report
# Print top functions
rperf report --top profile.pb.gz
# Compare two profiles
rperf diff before.pb.gz after.pb.gz
# Full documentation
rperf help
A.2 Ruby API cheat sheet
require "rperf"
# Block form
data = Rperf.start(output: "profile.pb.gz", mode: :cpu) do
# code to profile
end
# Manual form
Rperf.start(frequency: 1000, mode: :wall)
# ...
data = Rperf.stop
# Save to file
Rperf.save("profile.pb.gz", data)
Rperf.save("profile.collapsed", data)
Rperf.save("profile.txt", data)
# Snapshot (read data without stopping)
snap = Rperf.snapshot
Rperf.save("snap.pb.gz", snap)
snap = Rperf.snapshot(clear: true) # reset after snapshot
# Labels (annotate samples with context)
Rperf.label(request: "abc") do
# samples inside get request="abc" label
end
Rperf.labels # get current labels
# Rack middleware (labels requests by endpoint)
require "rperf/middleware"
use Rperf::Middleware # Rails: config.middleware.use Rperf::Middleware
use Rperf::Middleware, label_key: :route # custom label key
# Active Job (labels jobs by class name)
require "rperf/active_job"
class ApplicationJob < ActiveJob::Base
include Rperf::ActiveJobMiddleware
end
# Sidekiq (labels jobs by worker class name)
require "rperf/sidekiq"
Sidekiq.configure_server do |config|
config.server_middleware do |chain|
chain.add Rperf::SidekiqMiddleware
end
end
A.3 Environment variables
These are used internally by the CLI to configure the auto-started profiler:
| Variable | Values | Description |
|---|---|---|
RPERF_ENABLED |
1 |
Enable auto-start on require |
RPERF_OUTPUT |
path | Output file path |
RPERF_FREQUENCY |
integer | Sampling frequency in Hz |
RPERF_MODE |
cpu, wall |
Profiling mode |
RPERF_FORMAT |
pprof, collapsed, text |
Output format |
RPERF_VERBOSE |
1 |
Print statistics to stderr |
RPERF_STAT |
1 |
Enable stat mode output |
RPERF_STAT_REPORT |
1 |
Include profile tables in stat output |
RPERF_STAT_COMMAND |
string | Command string shown in stat output header |
RPERF_AGGREGATE |
0 |
Disable sample aggregation (return raw samples) |
A.4 Profiling mode comparison
| Aspect | cpu | wall |
|---|---|---|
| Clock | CLOCK_THREAD_CPUTIME_ID |
CLOCK_MONOTONIC |
| I/O time | Not measured | [GVL blocked] |
| Sleep time | Not measured | [GVL blocked] |
| GVL contention | Not measured | [GVL wait] |
| GC time | [GC marking], [GC sweeping] |
[GC marking], [GC sweeping] |
| Best for | CPU hotspots | Latency analysis |
A.5 Output format comparison
| Extension | Format | Tooling required |
|---|---|---|
.pb.gz (default) |
pprof protobuf | Go (rperf report) |
.collapsed |
Collapsed stacks | flamegraph.pl or speedscope |
.txt |
Text report | None |
A.6 Synthetic frames
| Frame | Mode | Meaning |
|---|---|---|
[GVL blocked] |
wall | Thread off-GVL (I/O, sleep, C ext) |
[GVL wait] |
wall | Thread waiting for GVL (contention) |
[GC marking] |
both | GC marking phase (wall time) |
[GC sweeping] |
both | GC sweeping phase (wall time) |
Bibliography
- [Aman, 2014] Aman Gupta. stackprof: A sampling call-stack profiler for Ruby 2.2+. 2014.
- [Arnaldo, 2010] Arnaldo Carvalho de Melo. The New Linux perf Tools. 2010.
- [Brendan, 2016] Brendan Gregg. "The Flame Graph". Communications of the ACM, 59(6), pp. 48--57. 2016. DOI
- [Gang, 2010] Gang Ren and Eric Tune and Tipp Moseley and Yixin Shi and Silvius Rus and Robert Hundt. "Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers". IEEE Micro, 30(4), pp. 65--79. 2010. DOI
- [Susan, 1982] Susan L. Graham and Peter B. Kessler and Marshall K. McKusick. "gprof: A Call Graph Execution Profiler". In Proceedings of the SIGPLAN '82 Symposium on Compiler Construction, pp. 120--126. 1982. DOI
- [Todd, 2010] Todd Mytkowicz and Amer Diwan and Matthias Hauswirth and Peter F. Sweeney. "Evaluating the Accuracy of Java Profilers". In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '10), pp. 187--197. 2010. DOI
Index
C
- collapsed stacks
- 1. Introduction
- 1. Introduction
- 6. Output Formats
- cpu mode
- 3. CLI Usage
- cumulative time
- 5. Interpreting Results
D
- deferred string resolution
- 9. Sampling
F
- flat time
- 5. Interpreting Results
- fork safety
- 8. Architecture Overview
P
- postponed job
- 1. Introduction
- 1. Introduction
- pprof
- 1. Introduction
- 1. Introduction
- 4. Ruby API
- 6. Output Formats
R
- Rperf.label
- 4. Ruby API
- 6. Output Formats
- 7. Framework Integration
- Rperf.labels
- 4. Ruby API
- Rperf.save
- 4. Ruby API
- Rperf.snapshot
- 4. Ruby API
- Rperf.start
- 4. Ruby API
- Rperf.stop
- 4. Ruby API
- rperf
- 1. Introduction
- rperf diff
- 3. CLI Usage
- rperf exec
- 3. CLI Usage
- rperf record
- 1. Introduction
- 3. CLI Usage
- rperf report
- 1. Introduction
- 3. CLI Usage
- rperf stat
- 1. Introduction
- 3. CLI Usage
S
- safepoint
- 1. Introduction
- 1. Introduction
- safepoint bias
- 1. Introduction
- 1. Introduction
- sampling
- 1. Introduction
- single session
- 8. Architecture Overview
- synthetic frames
- 1. Introduction
- 5. Interpreting Results
W
- wall mode
- 3. CLI Usage
- 3. CLI Usage
- 3. CLI Usage
- weight
- 1. Introduction
\
- \[GC marking\]
- 5. Interpreting Results
- \[GC sweeping\]
- 5. Interpreting Results
- \[GVL blocked\]
- 5. Interpreting Results
- \[GVL wait\]
- 5. Interpreting Results