rperf guide

A safepoint-based sampling performance profiler for Ruby.

rperf profiles your Ruby programs by sampling at safepoints and using actual time deltas as weights to correct safepoint bias. It provides a perf-like CLI, a Ruby API, and outputs pprof, collapsed stacks, and text report formats.

Key features:

CPU and wall-clock profiling modes
Time-weighted sampling that corrects safepoint bias
GVL contention and GC phase tracking
pprof, collapsed stacks, and text output formats
Low overhead (sampling callback cost < 0.2% at default 1000 Hz)
Requires Ruby >= 3.4.0 on POSIX systems (Linux, macOS)

Getting Started

This part covers what rperf is, why it exists, and how to install it on your system.

1. Introduction

1.1 What is rperf?

rperf is a safepoint-based sampling performance profiler for Ruby. It helps you find where your Ruby program spends its time — whether that’s CPU computation, I/O waits, GVL contention, or garbage collection.

Unlike traditional sampling profilers that count samples uniformly, rperf uses actual time deltas (in nanoseconds) as weights for each sample. This corrects the safepoint bias problem inherent in postponed job-based sampling, producing more accurate results.

rperf is inspired by Linux perf^{[Arnaldo, 2010]}, providing a familiar CLI interface with subcommands like record, stat, and report.

1.1.1 Advantages

Accurate profiling: Time-delta weighting corrects safepoint bias, producing results closer to real time distribution than traditional count-based profilers.
GVL / GC visibility: In wall mode, tracks off-GVL blocking, GVL contention, and GC marking/sweeping as distinct frames — no separate tool needed.
Standard output formats: Outputs pprof protobuf (compatible with go tool pprof), collapsed stacks (for flame graphs / speedscope), and human-readable text.
Low overhead: Default 1000 Hz sampling callback cost is < 0.2%, suitable for production.
Simple CLI: rperf stat for a quick overview, rperf record + rperf report for detailed analysis.

1.1.2 Limitations

Ruby 3.4+ only: Requires APIs introduced in Ruby 3.4.
POSIX only: Linux and macOS. Windows is not supported.
Method-level granularity: No line-number resolution — profiles show method names only.
Single session: Only one profiling session can be active at a time (global state in the C extension).
Safepoint latency: Samples are still deferred to safepoints. The time-delta weighting corrects the bias but cannot recover the exact interrupted instruction pointer.

1.2 Why another profiler?

Ruby already has profiling tools like stackprof^{[Aman, 2014]}. So why rperf?

1.2.1 The safepoint bias problem

Most Ruby sampling profilers collect backtraces by calling rb_profile_frames directly in the signal handler. This approach yields backtraces at the actual signal timing, but relies on undocumented internal VM state — rb_profile_frames is not guaranteed to be async-signal-safe, and the results can be unreliable if the VM is in the middle of updating its internal structures.

rperf takes a different approach: it uses the postponed job mechanism (rb_postponed_job), which is the Ruby VM’s official API for safely deferring work from signal handlers. Backtrace collection is deferred to the next safepoint — a point where the VM is in a consistent state and rb_profile_frames can return reliable results. The trade-off is that when a timer fires between safepoints, the actual sample is delayed until the next safepoint.

If each sample were counted equally (weight = 1), a sample delayed by a long-running C method would get the same weight as one taken immediately. This is the safepoint bias^{[Todd, 2010]} problem: functions that happen to be running when the thread reaches a safepoint appear more often than they should, while functions between safepoints are under-represented.

sequenceDiagram participant Timer participant VM participant Profiler Timer->>VM: Signal (t=0ms) Note over VM: Running C method...
cannot stop yet VM->>Profiler: Safepoint reached (t=3ms) Note over Profiler: Traditional: weight = 1
rperf: weight = 3ms Timer->>VM: Signal (t=1ms) VM->>Profiler: Safepoint reached (t=1ms) Note over Profiler: Traditional: weight = 1
rperf: weight = 1ms

mermaid source

sequenceDiagram
    participant Timer
    participant VM
    participant Profiler

    Timer->>VM: Signal (t=0ms)
    Note over VM: Running C method...<br/>cannot stop yet
    VM->>Profiler: Safepoint reached (t=3ms)
    Note over Profiler: Traditional: weight = 1<br/>rperf: weight = 3ms
    Timer->>VM: Signal (t=1ms)
    VM->>Profiler: Safepoint reached (t=1ms)
    Note over Profiler: Traditional: weight = 1<br/>rperf: weight = 1ms

rperf solves this by recording clock_now - clock_prev as the weight of each sample. A sample delayed by 3ms gets 3x the weight of a 1ms sample, accurately reflecting where time was actually spent.

1.2.2 Other advantages

GVL and GC awareness: In wall mode, rperf tracks time spent blocked off the GVL, waiting to reacquire the GVL, and in GC marking/sweeping phases — each as distinct synthetic frames.
perf-like CLI: The rperf stat command gives you a quick performance overview (like perf stat), while rperf record + rperf report gives you detailed profiling.
Standard output: rperf outputs pprof protobuf format, compatible with Go’s pprof tool ecosystem, as well as collapsed stacks for flame graphs^{[Brendan, 2016]} and speedscope.
Low overhead: Default 1000 Hz sampling callback cost is < 0.2%, suitable for production use.

1.3 Requirements

Ruby >= 3.4.0
POSIX system (Linux or macOS)
Go (optional, for rperf report and rperf diff subcommands)

1.4 Quick start

Profile a Ruby script and view the results:

# Install rperf
gem install rperf

# Quick performance overview
rperf stat ruby my_app.rb

# Record a profile and view it interactively
rperf record ruby my_app.rb
rperf report

Or use rperf from Ruby code:

require "rperf"

Rperf.start(output: "profile.pb.gz") do
  # code to profile
end

2. Installation

2.1 Installing the gem

rperf is distributed as a Ruby gem with a C extension:

gem install rperf

Or add it to your Gemfile:

gem "rperf"

Then run:

bundle install

2.2 Verifying the installation

Check that rperf is installed correctly:

rperf --help

You should see:

Usage: rperf record [options] command [args...]
       rperf stat [options] command [args...]
       rperf report [options] [file]
       rperf diff [options] base.pb.gz target.pb.gz
       rperf help

Run 'rperf help' for full documentation

2.3 Platform support

rperf supports POSIX systems:

Platform	Timer implementation	Notes
Linux	`timer_create` + signal (default)	Best precision (~1000us at 1000Hz)
Linux	`nanosleep` thread (with `signal: false`)	Fallback, ~100us drift/tick
macOS	`nanosleep` thread	Signal-based timer not available

On Linux, rperf uses timer_create with SIGEV_SIGNAL and a sigaction handler by default. This provides precise interval timing with no extra thread. The signal number defaults to SIGRTMIN+8 and can be changed via the signal: keyword argument to Rperf.start in the Ruby API.

On macOS (and when signal: false is set on Linux), rperf falls back to a dedicated pthread with a nanosleep loop.

2.4 Optional: Go toolchain

The rperf report and rperf diff subcommands are thin wrappers around go tool pprof. If you want to use these commands, install Go from go.dev.

Without Go, you can still use all other rperf features. You can view pprof files with other tools like speedscope or generate text/collapsed output directly.

How to Use

This part explains how to profile Ruby programs with rperf — from the command line and as a library — and describes the available output formats.

3. CLI Usage

rperf provides a perf-like command-line interface with five main subcommands: record, stat, exec, report, and diff.

3.1 rperf stat

rperf stat is the quickest way to get a performance overview. It runs your command with wall-mode profiling and prints a summary to stderr.

rperf stat ruby my_app.rb

3.1.1 Example: CPU-bound program

Here is a simple Fibonacci computation:

# fib.rb
def fib(n)
  return n if n <= 1
  fib(n - 1) + fib(n - 2)
end
fib(35)

Running rperf stat:

rperf stat ruby fib.rb

 Performance stats for 'ruby fib.rb':

           744.4 ms   user
            32.0 ms   sys
           491.0 ms   real

           481.0 ms 100.0%  CPU execution
            12.0 ms         [Ruby] GC time (9 count: 6 minor, 3 major)
         154,468            [Ruby] allocated objects
          66,596            [Ruby] freed objects
              25 MB         [OS] peak memory (maxrss)
              31            [OS] context switches (10 voluntary, 21 involuntary)
               0 MB         [OS] disk I/O (0 MB read, 0 MB write)

  481 samples / 481 triggers, 0.1% profiler overhead

The output tells you:

user/sys/real: Standard timing (like the time command)
Time breakdown: Where wall time was spent — CPU execution, GVL blocked (I/O/sleep), GVL wait (contention), GC marking, GC sweeping
Ruby stats: GC count, allocated/freed objects, YJIT ratio (if enabled)
OS stats: Peak memory, context switches, disk I/O

Use --report to add flat and cumulative top-50 function tables to the output.

3.1.2 Example: Mixed CPU and I/O

# mixed.rb
def cpu_work(n)
  sum = 0
  n.times { |i| sum += i * i }
  sum
end

def io_work
  sleep(0.05)
end

5.times do
  cpu_work(500_000)
  io_work
end

Running rperf stat:

rperf stat ruby mixed.rb

Because stat always uses wall mode, you can see how time is divided between CPU and I/O. The [Ruby] GVL blocked line shows time spent sleeping/in I/O, while CPU execution shows compute time.

3.1.3 stat options

rperf stat [options] command [args...]

Option	Description
`-o PATH`	Also save profile to file (default: none)
`-f HZ`	Sampling frequency in Hz (default: 1000)
`-m MODE`	`cpu` or `wall` (default: `wall`)
`--report`	Include flat/cumulative profile tables in output
`-v`	Print additional sampling statistics

3.2 rperf exec

rperf exec runs a command with profiling and prints a full performance report to stderr — equivalent to rperf stat --report. It uses wall mode by default and does not save to a file.

rperf exec ruby my_app.rb

This prints everything stat shows (timing, time breakdown, GC/memory/OS stats) plus flat and cumulative top-50 function tables.

3.2.1 exec options

rperf exec [options] command [args...]

Option	Description
`-o PATH`	Also save profile to file (default: none)
`-f HZ`	Sampling frequency in Hz (default: 1000)
`-m MODE`	`cpu` or `wall` (default: `wall`)
`-v`	Print additional sampling statistics

3.3 rperf record

rperf record profiles a command and saves the result to a file. This is the primary way to capture profiles for detailed analysis.

rperf record ruby my_app.rb

By default, it saves to rperf.data in pprof format with CPU mode.

3.3.1 Example: Recording a profile

rperf record ruby fib.rb

This creates rperf.data. You can then analyze it with rperf report or other pprof-compatible tools.

3.3.2 Choosing a profiling mode

rperf supports two profiling modes:

cpu (default): Measures per-thread CPU time. Best for finding functions that consume CPU cycles. Ignores time spent sleeping, in I/O, or waiting for the GVL.
wall: Measures wall-clock time. Best for finding where wall time goes, including I/O, sleep, and GVL contention.

# CPU mode (default)
rperf record ruby my_app.rb

# Wall mode
rperf record -m wall ruby my_app.rb

3.3.3 Choosing an output format

rperf auto-detects the format from the file extension:

# pprof format (default)
rperf record -o profile.pb.gz ruby my_app.rb

# Collapsed stacks (for FlameGraph / speedscope)
rperf record -o profile.collapsed ruby my_app.rb

# Human-readable text
rperf record -o profile.txt ruby my_app.rb

You can also force the format explicitly:

rperf record --format text -o profile.dat ruby my_app.rb

3.3.4 Example: Text output

rperf record -o profile.txt ruby fib.rb

The text output looks like this:

Total: 509.5ms (cpu)
Samples: 509, Frequency: 1000Hz

 Flat:
           509.5 ms 100.0%  Object#fib (fib.rb)

 Cumulative:
           509.5 ms 100.0%  Object#fib (fib.rb)
           509.5 ms 100.0%  <main> (fib.rb)

3.3.5 Example: Wall mode text output

rperf record -m wall -o wall_profile.txt ruby mixed.rb

Total: 311.8ms (wall)
Samples: 80, Frequency: 1000Hz

 Flat:
           250.6 ms  80.4%  [GVL blocked] (<GVL>)
            44.1 ms  14.1%  Object#cpu_work (mixed.rb)
            13.9 ms   4.5%  Integer#times (<internal:numeric>)
             3.2 ms   1.0%  Kernel#sleep (<C method>)
             0.0 ms   0.0%  [GVL wait] (<GVL>)

 Cumulative:
           311.8 ms 100.0%  Integer#times (<internal:numeric>)
           311.8 ms 100.0%  block in <main> (mixed.rb)
           311.8 ms 100.0%  <main> (mixed.rb)
           253.8 ms  81.4%  Kernel#sleep (<C method>)
           253.8 ms  81.4%  Object#io_work (mixed.rb)
           250.6 ms  80.4%  [GVL blocked] (<GVL>)
            58.0 ms  18.6%  Object#cpu_work (mixed.rb)
             0.0 ms   0.0%  [GVL wait] (<GVL>)

In wall mode, [GVL blocked] appears as the dominant cost — this is the sleep time in io_work. The CPU time for cpu_work is clearly separated.

3.3.6 Verbose output

The -v flag prints sampling statistics to stderr during profiling:

rperf record -v ruby my_app.rb

[rperf] mode=cpu frequency=1000Hz
[rperf] sampling: 98 calls, 0.11ms total, 1.1us/call avg
[rperf] samples recorded: 904
[rperf] top 10 by flat:
[rperf]       53.4ms  50.1%  Object#cpu_work (-e)
[rperf]       17.0ms  15.9%  Integer#times (<internal:numeric>)
...

3.3.7 record options

rperf record [options] command [args...]

Option	Description
`-o PATH`	Output file (default: `rperf.data`)
`-f HZ`	Sampling frequency in Hz (default: 1000)
`-m MODE`	`cpu` or `wall` (default: `cpu`)
`--format FMT`	`pprof`, `collapsed`, or `text` (default: auto from extension)
`-p, --print`	Print text profile to stdout (same as `--format=text --output=/dev/stdout`)
`-v`	Print sampling statistics to stderr

3.4 rperf report

rperf report opens a pprof profile for analysis. It wraps go tool pprof and requires Go to be installed.

# Open interactive web UI (default)
rperf report

# Open a specific file
rperf report profile.pb.gz

# Print top functions
rperf report --top

# Print pprof text summary
rperf report --text

3.4.1 Example: Top and text output

Using the fib.rb profile recorded earlier:

rperf report --top rperf.data

Type: cpu
Showing nodes accounting for 577.31ms, 100% of 577.31ms total
      flat  flat%   sum%        cum   cum%
  577.31ms   100%   100%   577.31ms   100%  Object#fib
         0     0%   100%   577.31ms   100%  <main>

The default behavior (without --top or --text) opens an interactive web UI in your browser with flame graphs, top function views, and call graph visualizations powered by pprof^{[Gang, 2010]}.

3.4.2 report options

Option	Description
`--top`	Print top functions by flat time
`--text`	Print pprof text summary
(default)	Open interactive web UI in browser

3.5 rperf diff

rperf diff compares two pprof profiles, showing the difference (target - base). This is useful for measuring the impact of optimizations.

# Open diff in browser
rperf diff before.pb.gz after.pb.gz

# Print top functions by diff
rperf diff --top before.pb.gz after.pb.gz

# Print text diff
rperf diff --text before.pb.gz after.pb.gz

3.5.1 Workflow example

# Profile the baseline
rperf record -o before.pb.gz ruby my_app.rb

# Make your optimization changes...

# Profile again
rperf record -o after.pb.gz ruby my_app.rb

# Compare
rperf diff before.pb.gz after.pb.gz

3.6 rperf help

rperf help prints the full reference documentation, including profiling modes, output formats, synthetic frames, and diagnostic tips.

rperf help

This outputs detailed documentation suitable for both human reading and AI-assisted analysis.

4. Ruby API

rperf provides a Ruby API for programmatic profiling. This is useful when you want to profile specific sections of code, integrate profiling into test suites, or build custom profiling workflows.

4.1 Basic usage

4.1.1 Block form (recommended)

The simplest way to use rperf is with the block form of Rperf.start. It profiles the block and returns the profiling data:

require "rperf"

data = Rperf.start(output: "profile.pb.gz", frequency: 1000, mode: :cpu) do
  # code to profile
end

When output: is specified, the profile is automatically written to the file when the block finishes. The method also returns the raw data hash for further processing.

4.1.2 Example: Profiling a Fibonacci function

require "rperf"

def fib(n)
  return n if n <= 1
  fib(n - 1) + fib(n - 2)
end

data = Rperf.start(frequency: 1000, mode: :cpu) do
  fib(33)
end

Rperf.save("profile.txt", data)

Running this produces:

Total: 192.7ms (cpu)
Samples: 192, Frequency: 1000Hz

Flat:
     192.7ms 100.0%  Object#fib (example.rb)

Cumulative:
     192.7ms 100.0%  Object#fib (example.rb)
     192.7ms 100.0%  block in <main> (example.rb)
     192.7ms 100.0%  Rperf.start (lib/rperf.rb)
     192.7ms 100.0%  <main> (example.rb)

4.1.3 Manual start/stop

For cases where block form is awkward, you can manually start and stop profiling:

require "rperf"

Rperf.start(frequency: 1000, mode: :wall)

# ... code to profile ...

data = Rperf.stop

Rperf.stop returns the data hash, or nil if the profiler was not running.

4.2 Rperf.start parameters

Parameter	Type	Default	Description
`frequency:`	Integer	1000	Sampling frequency in Hz
`mode:`	Symbol	`:cpu`	`:cpu` or `:wall`
`output:`	String	`nil`	File path to write on stop
`verbose:`	Boolean	`false`	Print statistics to stderr
`format:`	Symbol	`nil`	`:pprof`, `:collapsed`, `:text`, or `nil` (auto-detect from output extension)
`signal:`	Integer/Boolean	`nil`	Linux only: `nil` = timer signal (default), `false` = nanosleep thread, positive integer = specific RT signal number
`aggregate:`	Boolean	`true`	Aggregate identical stacks during profiling to reduce memory. `false` returns raw per-sample data

4.3 Rperf.stop return value

Rperf.stop returns nil if the profiler was not running. Otherwise it returns a Hash:

{
  mode: :cpu,               # or :wall
  frequency: 1000,
  sampling_count: 1234,     # number of timer callbacks
  sampling_time_ns: 56789,  # total time spent sampling (overhead)
  trigger_count: 1234,      # number of timer triggers
  detected_thread_count: 4, # threads seen during profiling
  start_time_ns: 17740...,  # CLOCK_REALTIME epoch nanos
  duration_ns: 10000000,    # profiling duration in nanos
  unique_frames: 42,        # unique frame count (aggregate: true only)
  unique_stacks: 120,       # unique stack count (aggregate: true only)
  samples: [                          # Array of [frames, weight, thread_seq, label_set_id]
    [frames, weight, seq, lsi],       #   frames: [[path, label], ...] deepest-first
    ...                               #   weight: Integer (nanoseconds)
  ],                                  #   seq: Integer (thread sequence, 1-based)
                                      #   lsi: Integer (label set ID, 0 = no labels)
  label_sets: [{}, {request: "abc"}], # label set table (index = label_set_id, present when labels used)
}

Each sample has:

frames: An array of [path, label] pairs, ordered deepest-first (leaf frame at index 0)
weight: Time in nanoseconds attributed to this sample
thread_seq: Thread sequence number (1-based, assigned per profiling session)
label_set_id: Label set ID (0 = no labels). Index into the label_sets array

When aggregate: true (default), identical stacks are merged and their weights summed. The samples array contains one entry per unique (stack, thread_seq, label_set_id) combination. When aggregate: false, every raw sample is returned individually.

4.4 Rperf.save

Rperf.save writes profiling data to a file in any supported format:

Rperf.save("profile.pb.gz", data)        # pprof format
Rperf.save("profile.collapsed", data)    # collapsed stacks
Rperf.save("profile.txt", data)          # text report

The format is auto-detected from the file extension. You can override it with the format: keyword:

Rperf.save("output.dat", data, format: :text)

4.5 Rperf.snapshot

Rperf.snapshot returns the current profiling data without stopping. Only works in aggregate mode (the default). Returns nil if not profiling.

Rperf.start(frequency: 1000)
# ... work ...
snap = Rperf.snapshot
Rperf.save("snap.pb.gz", snap)
# ... more work (profiling continues) ...
data = Rperf.stop

With clear: true, the aggregated data is reset after taking the snapshot. This enables interval-based profiling where each snapshot covers only the period since the last clear:

Rperf.start(frequency: 1000)
loop do
  sleep 10
  snap = Rperf.snapshot(clear: true)
  Rperf.save("profile-#{Time.now.to_i}.pb.gz", snap)
end

4.6 Sample labels

Rperf.label attaches key-value labels to the current thread’s samples. Labels appear in pprof sample labels, enabling per-context filtering (e.g., per-request profiling). If profiling is not running, label is silently ignored — safe to call unconditionally (e.g., from Rack middleware).

4.6.1 Block form

With a block, labels are automatically restored when the block exits — even if an exception is raised:

Rperf.label(request: "abc-123", endpoint: "/api/users") do
  handle_request   # samples inside get these labels
end
# labels are restored to previous state here

4.6.2 Without block

Without a block, labels persist on the current thread until changed:

Rperf.label(request: "abc-123")
# all subsequent samples on this thread have request="abc-123"

4.6.3 Merging and deleting

New labels merge with existing ones. Set a value to nil to remove a key:

Rperf.label(request: "abc")
Rperf.label(phase: "db")       # adds phase, keeps request
Rperf.labels                   #=> {request: "abc", phase: "db"}
Rperf.label(request: nil)      # removes request
Rperf.labels                   #=> {phase: "db"}

4.6.4 Nested blocks

Each block creates a scope. When it exits, the labels are restored to the state before the block — regardless of what happened inside:

Rperf.label(request: "abc") do
  Rperf.label(phase: "db") do
    Rperf.labels  #=> {request: "abc", phase: "db"}
  end
  Rperf.labels    #=> {request: "abc"}
end
Rperf.labels      #=> {}

4.6.5 Filtering by label in pprof

Labels are written into pprof sample labels. Use go tool pprof to filter:

# Filter to specific label value
go tool pprof -tagfocus=request=abc-123 profile.pb.gz

# Group by label at stack root ("which requests are slow?")
go tool pprof -tagroot=request profile.pb.gz

# Group by label at stack leaf ("who calls this function?")
go tool pprof -tagleaf=request profile.pb.gz

# Exclude specific label value
go tool pprof -tagignore=request=healthcheck profile.pb.gz

4.6.6 Reading labels

Rperf.labels returns the current thread’s labels as a Hash:

Rperf.labels  #=> {request: "abc", phase: "db"}

Returns an empty Hash if no labels are set.

4.7 Practical examples

4.7.1 Profiling a web request handler

require "rperf"

class ApplicationController
  def profile_action
    data = Rperf.start(mode: :wall) do
      # Simulate a typical request
      users = User.where(active: true).limit(100)
      result = users.map { |u| serialize_user(u) }
      render json: result
    end

    Rperf.save("request_profile.txt", data)
  end
end

Using wall mode here captures not just CPU time but also database I/O and any GVL contention.

4.7.2 Comparing CPU and wall profiles

require "rperf"

def workload
  # Mix of CPU and I/O
  100.times do
    compute_something
    sleep(0.001)
  end
end

# CPU profile: shows where CPU cycles go
cpu_data = Rperf.start(mode: :cpu) { workload }
Rperf.save("cpu.txt", cpu_data)

# Wall profile: shows where wall time goes
wall_data = Rperf.start(mode: :wall) { workload }
Rperf.save("wall.txt", wall_data)

The CPU profile will focus on compute_something, while the wall profile will show the sleep calls as [GVL blocked] time.

4.7.3 Processing samples

You can work with the sample data programmatically. By default, samples are aggregated (identical stacks merged):

require "rperf"

data = Rperf.start(mode: :cpu) { workload }
# data[:aggregated_samples] contains aggregated entries (one per unique stack)

# Find the hottest function
flat = Hash.new(0)
data[:aggregated_samples].each do |frames, weight, thread_seq|
  leaf_label = frames.first&.last  # frames[0] is the leaf
  flat[leaf_label] += weight
end

top = flat.sort_by { |_, w| -w }.first(5)
top.each do |label, weight_ns|
  puts "#{label}: #{weight_ns / 1_000_000.0}ms"
end

To get raw (non-aggregated) per-sample data, pass aggregate: false. Each timer tick produces a separate entry:

data = Rperf.start(mode: :cpu, aggregate: false) { workload }
# data[:raw_samples] contains one entry per timer sample
data[:raw_samples].each do |frames, weight, thread_seq|
  puts "thread=#{thread_seq} weight=#{weight}ns depth=#{frames.size}"
end

4.7.4 Generating collapsed stacks for FlameGraph

require "rperf"

data = Rperf.start(mode: :cpu) { workload }
Rperf.save("profile.collapsed", data)

The collapsed format is one line per unique stack, compatible with Brendan Gregg’s FlameGraph^{[Brendan, 2016]} tools and speedscope:

frame1;frame2;...;leaf weight_ns

You can generate a flame graph SVG:

flamegraph.pl profile.collapsed > flamegraph.svg

Or open the .collapsed file directly in speedscope.

5. Interpreting Results

5.1 Flat vs. cumulative time

Understanding the difference between flat and cumulative time is essential for effective profiling, as described in classic profiling literature gprof^{[Susan, 1982]}.

Flat time: Time attributed directly to a function — it was the leaf (deepest frame) in the sample. High flat time means the function itself is doing expensive work.
Cumulative time: Time for all samples where the function appears anywhere in the stack. High cumulative time means the function (or something it calls) is expensive.

mermaid source

graph TD
    A["main()"] -->|calls| B["process()"]
    B -->|calls| C["compute()"]
    B -->|calls| D["format()"]

    style C fill:#ff6b6b,color:#fff
    style D fill:#ffa07a,color:#fff

If compute() has high flat time, optimize compute() itself. If process() has high cumulative but low flat time, the cost is in its children (compute() and format()).

5.2 Weight unit

All weights in rperf are in nanoseconds, regardless of profiling mode:

1,000 ns = 1 us (microsecond)
1,000,000 ns = 1 ms (millisecond)
1,000,000,000 ns = 1 s (second)

5.3 Synthetic frames

In addition to normal Ruby frames, rperf records synthetic frames that represent non-CPU activity. These always appear as the leaf (deepest) frame in a sample.

5.3.1 [GVL blocked]

Mode: wall only

Time the thread spent off the GVL — during I/O operations, sleep, or C extensions that release the GVL. This is one of the synthetic frames. This time is attributed to the stack captured at the SUSPENDED event (when the thread released the GVL).

High [GVL blocked] time indicates your program is I/O bound. Look at the cumulative view to find which functions are triggering the I/O.

5.3.2 [GVL wait]

Mode: wall only

Time the thread spent waiting to reacquire the GVL after becoming ready. This indicates GVL contention — another thread is holding the GVL while this thread wants to run.

High [GVL wait] time means your threads are serialized on the GVL. Consider reducing GVL-holding work, using Ractors, or moving work to child processes.

5.3.3 [GC marking]

Mode: cpu and wall

Time spent in the GC marking phase. Always measured in wall time. Attributed to the stack that triggered GC.

High [GC marking] time means too many live objects. Reduce the number of long-lived allocations.

5.3.4 [GC sweeping]

Mode: cpu and wall

Time spent in the GC sweeping phase. Always measured in wall time. Attributed to the stack that triggered GC.

High [GC sweeping] time means too many short-lived objects. Consider reusing objects or using object pools.

5.4 Diagnosing common problems

5.4.1 High CPU usage

Mode: cpu

Look for functions with high flat CPU time. These are the functions consuming CPU cycles.

Action: Optimize the hot function (better algorithm, caching) or call it less frequently.

5.4.2 Slow requests / high latency

Mode: wall

Look for functions with high cumulative wall time.

If [GVL blocked] is dominant: I/O or sleep is the bottleneck. Check database queries, HTTP calls, file I/O.
If [GVL wait] is dominant: GVL contention. Reduce GVL-holding work or move to Ractors/child processes.

5.4.3 GC pressure

Mode: cpu or wall

Look for [GC marking] and [GC sweeping] samples.

High [GC marking]: Too many live objects. Reduce allocations of long-lived objects.
High [GC sweeping]: Too many short-lived objects. Reuse or pool objects.

The rperf stat output also shows GC counts and allocated/freed object counts, which can help diagnose allocation-heavy code.

5.4.4 Multithreaded app slower than expected

Mode: wall

Look for [GVL wait] time across threads.

rperf stat ruby threaded_app.rb

Example output for a GVL-contended workload:

 Performance stats for 'ruby threaded_app.rb':

            89.9 ms   user
            14.0 ms   sys
            41.2 ms   real

             0.5 ms   1.2%  CPU execution
             9.8 ms  23.8%  [Ruby] GVL blocked (I/O, sleep)
            30.8 ms  75.0%  [Ruby] GVL wait (contention)

Here, 75% of wall time is GVL contention. The four threads are fighting for the GVL, serializing their CPU work.

5.5 Tips for effective profiling

Default frequency (1000 Hz) works well for most cases. For long-running production profiling, 100-500 Hz reduces overhead further.
Profile representative workloads, not micro-benchmarks. Profiling a real request or batch job gives more actionable results than profiling a tight loop.
Compare cpu and wall profiles to distinguish CPU-bound from I/O-bound code. If a function is hot in wall mode but not in CPU mode, it’s blocked on I/O.
Use rperf diff to measure the impact of optimizations. Profile before and after, then compare.
Use the verbose flag (-v) to check profiler overhead and see top functions immediately.

6. Output Formats

rperf supports three output formats. The format is auto-detected from the file extension, or can be set explicitly with the --format flag (CLI) or format: parameter (API).

6.1 pprof (default)

The pprof format is a gzip-compressed Protocol Buffers binary. This is the standard format used by Go’s pprof tooling and is the default output of rperf.

Extension convention: .pb.gz

How to view:

# Interactive web UI (requires Go)
rperf report profile.pb.gz

# Top functions
rperf report --top profile.pb.gz

# Text report
rperf report --text profile.pb.gz

# Or use go tool pprof directly
go tool pprof -http=:8080 profile.pb.gz
go tool pprof -top profile.pb.gz

You can also import pprof files into speedscope via its web interface.

Advantages: Standard format supported by a wide ecosystem of tools. Supports diff comparison between two profiles. Interactive exploration with flame graphs, call graphs, and source annotations.

6.1.1 Embedded metadata

rperf embeds the following metadata in each pprof profile:

Field	Description
`comment`	rperf version, profiling mode, frequency, Ruby version
`time_nanos`	Profile collection start time (epoch nanoseconds)
`duration_nanos`	Profile duration (nanoseconds)
`doc_url`	Link to rperf documentation

View comments with: go tool pprof -comments profile.pb.gz

6.1.2 Sample labels

Each sample carries a thread_seq numeric label — a thread sequence number (1-based) assigned when rperf first sees each thread during a profiling session. When Rperf.label is used, custom key-value string labels are also attached to samples.

# Group flame graph by thread
go tool pprof -tagroot=thread_seq profile.pb.gz

# Filter by custom label
go tool pprof -tagfocus=request=abc-123 profile.pb.gz

# Group by label at root ("which requests are slow?")
go tool pprof -tagroot=request profile.pb.gz

# Group by label at leaf ("who calls this function?")
go tool pprof -tagleaf=request profile.pb.gz

# Exclude by label
go tool pprof -tagignore=request=healthcheck profile.pb.gz

6.2 Collapsed stacks

The collapsed stacks format is a plain text format with one line per unique stack trace. Each line contains a semicolon-separated stack (bottom-to-top) followed by a space and the weight in nanoseconds.

Extension convention: .collapsed

Format:

bottom_frame;...;top_frame weight_ns

Example output:

<main>;Integer#times;block in <main>;Object#cpu_work;Integer#times;Object#cpu_work 53419170
<main>;Integer#times;block in <main>;Object#cpu_work;Integer#times 16962309
<main>;Integer#times;block in <main>;Object#io_work;Kernel#sleep 2335151

How to use:

# Generate a FlameGraph SVG
rperf record -o profile.collapsed ruby my_app.rb
flamegraph.pl profile.collapsed > flamegraph.svg

# Open in speedscope (drag-and-drop the .collapsed file)
# macOS: open https://www.speedscope.app/
# Linux: xdg-open https://www.speedscope.app/

Advantages: Simple text format, easy to process with command-line tools. Compatible with Brendan Gregg’s FlameGraph^{[Brendan, 2016]} tools and speedscope.

6.2.1 Parsing collapsed stacks programmatically

File.readlines("profile.collapsed").each do |line|
  stack, weight = line.rpartition(" ").then { |s, _, w| [s, w.to_i] }
  frames = stack.split(";")
  # frames[0] is bottom (main), frames[-1] is leaf (hot)
  puts "#{frames.last}: #{weight / 1_000_000.0}ms"
end

6.3 Text report

The text format is a human-readable (and AI-readable) report showing flat and cumulative top-N tables.

Extension convention: .txt

Example output:

Total: 509.5ms (cpu)
Samples: 509, Frequency: 1000Hz

 Flat:
           509.5 ms 100.0%  Object#fib (fib.rb)

 Cumulative:
           509.5 ms 100.0%  Object#fib (fib.rb)
           509.5 ms 100.0%  <main> (fib.rb)

Sections:

Header: Total profiled time, sample count, and frequency
Flat table: Functions sorted by self time (the function was the leaf/deepest frame)
Cumulative table: Functions sorted by total time (the function appeared anywhere in the stack)

Advantages: No tooling required — readable with cat. Top 50 entries per table by default. Good for quick analysis, sharing in issue reports, or feeding to AI assistants.

6.4 Format comparison

Feature	pprof	collapsed	text
File size	Small (binary + gzip)	Medium (text)	Small (text)
Flame graph	Yes (via pprof web UI)	Yes (via flamegraph.pl)	No
Call graph	Yes	No	No
Diff comparison	Yes (`rperf diff`)	No	No
No tools needed	No (requires Go)	No (requires flamegraph.pl)	Yes
Programmatic parsing	Complex (protobuf)	Simple	Simple
AI-friendly	No	Yes	Yes

6.5 Auto-detection rules

File extension	Format
`.collapsed`	Collapsed stacks
`.txt`	Text report
Anything else	pprof

The default output file (rperf.data) uses pprof format.

7. Framework Integration

rperf provides optional integrations that automatically label samples with context from web frameworks and job processors. These integrations only set labels — they do not start or stop profiling. Start profiling separately (e.g., in an initializer).

7.1 Rack middleware

Rperf::Middleware labels each request with its endpoint (METHOD /path).

require "rperf/middleware"

7.1.1 Rails

# config/initializers/rperf.rb
require "rperf/middleware"

Rperf.start(mode: :wall, frequency: 99)

Rails.application.config.middleware.use Rperf::Middleware

at_exit do
  data = Rperf.stop
  Rperf.save("tmp/profile.pb.gz", data) if data
end

Then filter the profile by endpoint:

go tool pprof -tagfocus=endpoint="GET /api/users" tmp/profile.pb.gz
go tool pprof -tagroot=endpoint tmp/profile.pb.gz   # group by endpoint

7.1.2 Sinatra

require "sinatra"
require "rperf/middleware"

Rperf.start(mode: :wall, frequency: 99)
use Rperf::Middleware

at_exit do
  data = Rperf.stop
  Rperf.save("profile.pb.gz", data) if data
end

get "/hello" do
  "Hello, world!"
end

7.1.3 Customizing the label key

By default the middleware uses the label key :endpoint. You can change it:

use Rperf::Middleware, label_key: :route

7.2 Active Job

Rperf::ActiveJobMiddleware labels each job with its class name (e.g., SendEmailJob). Works with any Active Job backend — Sidekiq, GoodJob, Solid Queue, etc.

require "rperf/active_job"

Include it in your base job class:

# app/jobs/application_job.rb
class ApplicationJob < ActiveJob::Base
  include Rperf::ActiveJobMiddleware
end

All subclasses inherit the label automatically:

class SendEmailJob < ApplicationJob
  def perform(user)
    # samples here get job="SendEmailJob"
  end
end

Filter by job:

go tool pprof -tagfocus=job=SendEmailJob profile.pb.gz
go tool pprof -tagroot=job profile.pb.gz   # group by job class

7.3 Sidekiq

Rperf::SidekiqMiddleware labels each job with its worker class name. This covers both Active Job-backed workers and plain Sidekiq workers.

require "rperf/sidekiq"

Register it as a Sidekiq server middleware:

# config/initializers/sidekiq.rb
Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add Rperf::SidekiqMiddleware
  end
end

Note

If you use Active Job with Sidekiq, choose one or the other — using both will result in duplicate labels. The Sidekiq middleware is more general (covers non-Active Job workers too).

7.4 Full Rails example

A typical Rails setup with both web and job profiling:

# config/initializers/rperf.rb
require "rperf/middleware"
require "rperf/sidekiq"

Rperf.start(mode: :wall, frequency: 99)

# Label web requests
Rails.application.config.middleware.use Rperf::Middleware

# Label Sidekiq jobs
Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add Rperf::SidekiqMiddleware
  end
end

# Export profiles periodically
Thread.new do
  loop do
    sleep 60
    snap = Rperf.snapshot(clear: true)
    Rperf.save("tmp/profile-#{Time.now.to_i}.pb.gz", snap) if snap
  end
end

Then compare where time goes across endpoints and jobs:

go tool pprof -tagroot=endpoint tmp/profile-*.pb.gz   # web breakdown
go tool pprof -tagroot=job tmp/profile-*.pb.gz         # job breakdown

Deep Dive

This part explores how to read and act on profiling results, and explains how rperf works under the hood.

8. Architecture Overview

This chapter describes rperf’s high-level architecture and the core data structures that underpin profiling. Understanding these internals helps you interpret edge cases in profiling results and appreciate the design trade-offs.

8.1 System diagram

rperf consists of a C extension and a Ruby wrapper:

mermaid source

graph TD
    subgraph "C Extension (ext/rperf/rperf.c)"
        Timer["Timer<br/>(signal or pthread_cond_timedwait)"]
        PJ["rb_postponed_job_trigger"]
        Sample["Sample callback<br/>(rperf_sample_job)"]
        ThreadHook["Thread event hook<br/>(SUSPENDED/READY/RESUMED/EXITED)"]
        GCHook["GC event hook<br/>(ENTER/EXIT/START/END_MARK/END_SWEEP)"]
        BufA["Sample buffer A"]
        BufB["Sample buffer B"]
        Worker["Worker thread<br/>(timer + aggregation)"]
        FrameTable["Frame table<br/>(VALUE → frame_id)"]
        AggTable["Aggregation table<br/>(stack → weight)"]
    end

    subgraph "Ruby (lib/rperf.rb)"
        API["Rperf.start / Rperf.stop"]
        Encoders["Encoders<br/>(PProf / Collapsed / Text)"]
        StatOut["stat output"]
    end

    Timer -->|triggers| PJ
    PJ -->|at safepoint| Sample
    Sample -->|writes| BufA
    ThreadHook -->|writes| BufA
    GCHook -->|writes| BufA
    BufA -->|swap when full| BufB
    BufB -->|aggregated by| Worker
    Worker -->|inserts| FrameTable
    Worker -->|inserts| AggTable
    API -->|calls| Sample
    AggTable -->|read at stop| Encoders
    FrameTable -->|resolved at stop| Encoders
    Encoders --> StatOut

The C extension handles all performance-critical operations: timer management, sample recording, and aggregation. The Ruby layer provides the user-facing API, output encoding, and statistics formatting.

8.2 Global profiler state

rperf uses a single global rperf_profiler_t struct. Only one profiling session can be active at a time (single session limitation). The struct holds:

Timer configuration (frequency, mode, signal number)
Double-buffered sample storage (two sample buffers + frame pools, swapped when full)
Frame table (VALUE → uint32 frame_id, deduplicates frame references)
Aggregation table (unique stack → accumulated weight)
Worker thread handle (unified timer + aggregation thread)
Thread-specific key for per-thread data
GC phase tracking state
Sampling overhead counters

8.3 Per-thread data

Each thread gets a rperf_thread_data_t struct stored via Ruby’s thread-specific data API (rb_internal_thread_specific_set). This tracks:

prev_time_ns: Previous time reading (for computing weight)
prev_wall_ns: Previous wall time reading
suspended_at_ns: Wall timestamp when thread was suspended
ready_at_ns: Wall timestamp when thread became ready
suspended_frame_start/depth: Saved backtrace from SUSPENDED event
label_set_id: Current label set ID (0 = no labels)

Thread data is created lazily on first encounter and freed on the EXITED event or at profiler stop.

8.4 GC safety

Frame VALUEs must be protected from garbage collection. rperf wraps the profiler struct in a TypedData object with a custom dmark function that marks three regions:

Both frame pools (active and standby buffers)
Frame table keys (unique frame VALUEs, excluding synthetic frame slots)

The frame table keys array starts at 4,096 entries and grows by 2× when full. Growth allocates a new array, copies existing data, and swaps the pointer atomically (memory_order_release). The old array is kept alive until stop to prevent use-after-free if GC’s mark phase is reading it concurrently. The dmark function loads the keys pointer with memory_order_acquire and the count with memory_order_acquire to ensure a consistent view.

8.5 Fork safety

rperf registers a pthread_atfork child handler that silently stops profiling in the forked child process (fork safety):

Clears the timer/signal state
Removes event hooks
Frees sample buffers, frame table, and aggregation table

The parent process continues profiling unaffected. The child can start a fresh profiling session if needed.

8.6 Known limitations

8.6.1 Running EC race

There is a known race condition in the Ruby VM where rb_postponed_job_trigger from the timer thread may set the interrupt flag on the wrong thread’s execution context. This happens when a new thread’s native thread starts before acquiring the GVL. The result is that timer samples may miss threads doing C busy-wait, with their CPU time leaking into the next SUSPENDED event’s stack.

This is a Ruby VM bug, not a rperf bug, and affects all postponed-job-based profilers.

8.6.2 Single session

Only one profiling session can be active at a time due to the global profiler state. Calling Rperf.start while already profiling is not supported.

8.6.3 Method-level granularity

rperf profiles at the method level, not the line level. Frame labels use rb_profile_frame_full_label for qualified names (e.g., Integer#times, MyClass#method_name). Line numbers are not included.

9. Sampling

This chapter explains how rperf collects samples: the timer mechanism that triggers sampling, the sampling callback itself, and the event hooks that capture GVL and GC activity.

9.1 Timer and worker thread

rperf uses a single worker thread that handles both timer triggering and periodic sample aggregation. The timer mechanism depends on the platform.

9.1.1 Linux: Signal-based timer (default)

On Linux, rperf uses timer_create with SIGEV_THREAD_ID to deliver a real-time signal (default: SIGRTMIN+8) exclusively to the worker thread at the configured frequency. The signal handler calls rb_postponed_job_trigger to schedule the sampling callback.

Using SIGEV_THREAD_ID ensures the timer signal only targets the worker thread, preventing it from interrupting nanosleep, read, or other blocking syscalls in Ruby threads (e.g., inside rb_thread_call_without_gvl).

sequenceDiagram participant Kernel as Linux Kernel participant Worker as Worker Thread participant VM as Ruby VM participant Job as Postponed Job Kernel->>Worker: SIGRTMIN+8 (every 1ms at 1000Hz) Worker->>VM: rb_postponed_job_trigger() Note over VM: Continues until safepoint... VM->>Job: rperf_sample_job() Job->>Job: rb_profile_frames() + record sample Note over Worker: Also aggregates samples
when swap_ready is set

mermaid source

sequenceDiagram
    participant Kernel as Linux Kernel
    participant Worker as Worker Thread
    participant VM as Ruby VM
    participant Job as Postponed Job

    Kernel->>Worker: SIGRTMIN+8 (every 1ms at 1000Hz)
    Worker->>VM: rb_postponed_job_trigger()
    Note over VM: Continues until safepoint...
    VM->>Job: rperf_sample_job()
    Job->>Job: rb_profile_frames() + record sample
    Note over Worker: Also aggregates samples<br/>when swap_ready is set

This approach provides precise timing (~1000us median interval at 1000Hz).

9.1.2 Fallback: pthread_cond_timedwait

On macOS, or when signal: false is set on Linux, the worker thread uses pthread_cond_timedwait with an absolute deadline as the timer:

Timeout (deadline reached): trigger rb_postponed_job_trigger and advance deadline
Signal (swap_ready set): aggregate the standby buffer immediately

The deadline-based approach avoids drift when aggregation takes time. This mode has ~100us timing imprecision compared to the signal-based approach.

9.2 Sampling callback

When the postponed job fires, rperf_sample_job runs on whatever thread currently holds the GVL. It only samples that thread using rb_thread_current().

This is a deliberate design choice:

rb_profile_frames can only capture the current thread’s stack
There’s no need to iterate Thread.list — combined with GVL event hooks, rperf gets broad coverage of all threads (though a known race in the Ruby VM can cause occasional missed samples)

The sampling callback:

Gets or creates per-thread data (rperf_thread_data_t)
Reads the current clock (CLOCK_THREAD_CPUTIME_ID for CPU mode, CLOCK_MONOTONIC for wall mode)
Computes weight as time_now - prev_time
Captures the backtrace with rb_profile_frames directly into the frame pool
Records the sample (frame start index, depth, weight, type)
Updates prev_time

9.3 GVL event tracking (wall mode)

In wall mode, rperf hooks into Ruby’s thread event API to track GVL transitions. This captures time that sampling alone would miss — time spent off the GVL.

stateDiagram-v2 [*] --> Running: Thread holds GVL Running --> Suspended: SUSPENDED event
(capture backtrace + timestamp) Suspended --> Ready: READY event
(record wall timestamp) Ready --> Running: RESUMED event
(record GVL blocked + wait samples) Running --> [*]: EXITED event
(cleanup thread data)

mermaid source

stateDiagram-v2
    [*] --> Running: Thread holds GVL
    Running --> Suspended: SUSPENDED event<br/>(capture backtrace + timestamp)
    Suspended --> Ready: READY event<br/>(record wall timestamp)
    Ready --> Running: RESUMED event<br/>(record GVL blocked + wait samples)
    Running --> [*]: EXITED event<br/>(cleanup thread data)

9.3.1 SUSPENDED

When a thread releases the GVL (e.g., before I/O):

Capture the current backtrace into the frame pool
Record a normal sample (time since last sample)
Save the backtrace and wall timestamp for later use

9.3.2 READY

When a thread becomes ready to run (e.g., I/O completed):

Record the wall timestamp (no GVL needed — only simple C operations)

9.3.3 RESUMED

When a thread reacquires the GVL:

Record a [GVL blocked] sample: weight = ready_at - suspended_at (off-GVL time)
Record a [GVL wait] sample: weight = resumed_at - ready_at (GVL contention time)
Both samples reuse the backtrace captured at SUSPENDED

This way, off-GVL time and GVL contention are accurately attributed to the code that triggered them, even though no timer-based sampling can occur while the thread is off the GVL.

9.4 GC phase tracking

rperf hooks into Ruby’s internal GC events to track garbage collection time:

Event	Action
`GC_START`	Set phase to marking
`GC_END_MARK`	Set phase to sweeping
`GC_END_SWEEP`	Clear phase
`GC_ENTER`	Capture backtrace + wall timestamp
`GC_EXIT`	Record `[GC marking]` or `[GC sweeping]` sample

GC samples always use wall time regardless of the profiling mode, because GC time is real elapsed time that affects application latency.

9.5 Deferred string resolution

During sampling, rperf stores raw frame VALUEs (Ruby internal object references) in the frame pool — not strings. This deferred string resolution keeps the hot path allocation-free and fast.

String resolution happens at stop time:

Rperf.stop calls _c_stop
The frame table maps each unique frame VALUE to a [path, label] string pair via rb_profile_frame_full_label and rb_profile_frame_path
These resolved strings are passed to the Ruby encoders

This means sampling only writes integers (VALUE pointers and timestamps) to pre-allocated buffers. No Ruby objects are created, no GC pressure is added during profiling.

10. Aggregation

This chapter explains how rperf processes raw samples into compact, deduplicated data structures. Aggregation keeps memory usage bounded during long profiling sessions.

10.1 Overview

By default (aggregate: true), rperf periodically aggregates samples in the background. Raw samples — each containing a stack trace, weight, and metadata — are merged into two hash tables: a frame table and an aggregation table. Identical stacks are combined by summing their weights.

10.2 Double buffering

Two sample buffers alternate roles to allow concurrent sampling and aggregation:

The active buffer receives new samples from sampling callbacks
When the active buffer reaches 10,000 samples, the buffers swap
The standby buffer is processed by the worker thread in the background

graph LR subgraph "Active" A["Sample buffer A
(receiving samples)"] end subgraph "Standby" B["Sample buffer B
(being aggregated)"] end A -->|"swap when
10K samples"| B B -->|"swap back
after aggregation"| A

mermaid source

graph LR
    subgraph "Active"
        A["Sample buffer A<br/>(receiving samples)"]
    end
    subgraph "Standby"
        B["Sample buffer B<br/>(being aggregated)"]
    end
    A -->|"swap when<br/>10K samples"| B
    B -->|"swap back<br/>after aggregation"| A

If the worker thread hasn’t finished processing the standby buffer, the swap is skipped and the active buffer continues growing (fallback to unbounded mode).

Each buffer has its own frame pool. The active buffer’s frame pool receives raw VALUE references from rb_profile_frames. When buffers swap, the standby buffer’s frame pool is processed and then cleared for reuse.

10.3 Frame table

The frame table (VALUE → uint32_t frame_id) deduplicates frame references. Each unique frame VALUE gets a small integer ID.

Keys: raw frame VALUEs (the only GC mark target for aggregated data)
Values: compact uint32_t IDs used in the aggregation table
Initial capacity: 4,096 entries, grows by 2× when full
Growth uses atomic pointer swaps for GC dmark safety (see GC safety)

Using uint32_t frame IDs instead of full VALUEs halves the memory needed for stack storage in the aggregation table.

10.4 Aggregation table

The aggregation table merges identical stacks by summing their weights:

Key: (frame_ids[], thread_seq, label_set_id) — the stack (as frame IDs), thread sequence number, and label set
Value: accumulated weight in nanoseconds

Frame IDs are stored contiguously in a separate stack pool. Each aggregation table entry points into this pool (start index + depth).

Synthetic frames ([GVL blocked], [GVL wait], [GC marking], [GC sweeping]) are converted to reserved frame IDs during aggregation, eliminating the type field from the sample representation. This unifies the treatment of real and synthetic frames.

The label_set_id is part of the key, so samples with the same stack but different labels are kept separate.

10.5 Memory usage

Buffer	Initial size	Element size	Initial memory
Sample buffer (×2)	16,384	32B	512KB × 2
Frame pool (×2)	131,072	8B (VALUE)	1MB × 2
Frame table keys	4,096	8B (VALUE)	32KB
Frame table buckets	8,192	4B (uint32)	32KB
Agg table buckets	2,048	28B	56KB
Stack pool	4,096	4B (uint32)	16KB

Total: ~3.6MB with aggregate: true, ~1.5MB with aggregate: false (single buffer only). Frame table and aggregation table grow dynamically as needed.

With aggregation enabled, memory usage stays roughly constant regardless of profiling duration — only the number of unique stacks matters, not the total number of samples collected.

11. Output Encoding

This chapter describes how rperf converts its internal aggregated data into output formats. Encoding runs once at stop time, so it is not performance-critical.

11.1 From aggregated data to output

When Rperf.stop is called, the C extension returns the aggregated data to Ruby:

Frame table: An array of [path, label] string pairs, indexed by frame ID. Strings are resolved from raw VALUEs via rb_profile_frame_full_label and rb_profile_frame_path.
Aggregation table: An array of [frame_ids, weight, thread_seq, label_set_id] entries.
Label sets: An array of frozen Hashes mapping label keys to values.

The Ruby encoders (Rperf::PProf, Rperf::Collapsed, Rperf::Text) consume these arrays to produce the final output.

11.2 pprof encoder

rperf encodes the pprof^{[Gang, 2010]} protobuf format entirely in Ruby, with no protobuf gem dependency. The encoder in Rperf::PProf.encode:

Builds a string table (index 0 is always the empty string)
Converts string frames to index frames and merges identical stacks
Builds location and function tables
Encodes the Profile protobuf message field by field
Compresses the result with gzip

This hand-written encoder is simple (~100 lines) and handles only the subset of the pprof protobuf schema that rperf needs.

11.2.1 Embedded metadata

Each pprof profile includes metadata as comments: rperf version, profiling mode, frequency, and Ruby version. The time_nanos and duration_nanos fields record when and how long the profile was collected.

11.2.2 Thread sequence labels

Every sample carries a thread_seq numeric label — a 1-based thread sequence number assigned when rperf first sees each thread. This allows grouping flame graphs by thread in pprof tools.

11.3 Sample labels

Rperf.label enables per-context annotation of samples. The implementation is split between Ruby and C to keep the hot path minimal:

Ruby side: Manages label sets as frozen Hash objects in an array (@label_set_table). A deduplication index (@label_set_index) maps each unique Hash to an integer ID. Rperf.label(key: value) merges the new labels with the current set, interns the result, and passes the integer ID to C.
C side: Each rperf_thread_data_t stores a label_set_id (integer). When a sample is recorded, the current thread’s label_set_id is copied into the sample — a single integer, adding zero allocation overhead to the hot path. The aggregation table includes label_set_id in its hash key, so identical stacks with different labels remain separate entries.
Encoding: At encode time, the Ruby PProf encoder reads the label_sets array and writes each label as a pprof Sample.Label (key-value string pair). go tool pprof -tagfocus and -tagroot can filter and group by label.

11.4 Collapsed stacks encoder

The collapsed stacks encoder produces one line per unique stack: frames joined by semicolons (bottom-to-top), followed by a space and the weight in nanoseconds. This format is consumed by FlameGraph tools and speedscope.

When label sets are present, labels are not included in the collapsed output — use pprof format for label-aware analysis.

11.5 Text report encoder

The text report encoder produces a human-readable summary with:

Header: Total profiled time, sample count, and frequency
Flat table: Top 50 functions sorted by self time (the function was the leaf/deepest frame)
Cumulative table: Top 50 functions sorted by total time (the function appeared anywhere in the stack)

This format requires no external tools and is suitable for quick analysis, issue reports, or AI-assisted analysis.

A. Quick Reference

A.1 CLI cheat sheet

# Quick performance overview
rperf stat ruby my_app.rb

# Performance overview with profile tables
rperf stat --report ruby my_app.rb

# Full performance report (same as stat --report)
rperf exec ruby my_app.rb

# Record to default file (rperf.data, pprof format, cpu mode)
rperf record ruby my_app.rb

# Record with options
rperf record -m wall -f 500 -o profile.pb.gz ruby my_app.rb

# Record and print text profile to stdout
rperf record -p ruby my_app.rb

# Record to text format
rperf record -o profile.txt ruby my_app.rb

# Record to collapsed stacks
rperf record -o profile.collapsed ruby my_app.rb

# View profile in browser (requires Go)
rperf report

# Print top functions
rperf report --top profile.pb.gz

# Compare two profiles
rperf diff before.pb.gz after.pb.gz

# Full documentation
rperf help

A.2 Ruby API cheat sheet

require "rperf"

# Block form
data = Rperf.start(output: "profile.pb.gz", mode: :cpu) do
  # code to profile
end

# Manual form
Rperf.start(frequency: 1000, mode: :wall)
# ...
data = Rperf.stop

# Save to file
Rperf.save("profile.pb.gz", data)
Rperf.save("profile.collapsed", data)
Rperf.save("profile.txt", data)

# Snapshot (read data without stopping)
snap = Rperf.snapshot
Rperf.save("snap.pb.gz", snap)
snap = Rperf.snapshot(clear: true)  # reset after snapshot

# Labels (annotate samples with context)
Rperf.label(request: "abc") do
  # samples inside get request="abc" label
end
Rperf.labels       # get current labels

# Rack middleware (labels requests by endpoint)
require "rperf/middleware"
use Rperf::Middleware                    # Rails: config.middleware.use Rperf::Middleware
use Rperf::Middleware, label_key: :route # custom label key

# Active Job (labels jobs by class name)
require "rperf/active_job"
class ApplicationJob < ActiveJob::Base
  include Rperf::ActiveJobMiddleware
end

# Sidekiq (labels jobs by worker class name)
require "rperf/sidekiq"
Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add Rperf::SidekiqMiddleware
  end
end

A.3 Environment variables

These are used internally by the CLI to configure the auto-started profiler:

Variable	Values	Description
`RPERF_ENABLED`	`1`	Enable auto-start on require
`RPERF_OUTPUT`	path	Output file path
`RPERF_FREQUENCY`	integer	Sampling frequency in Hz
`RPERF_MODE`	`cpu`, `wall`	Profiling mode
`RPERF_FORMAT`	`pprof`, `collapsed`, `text`	Output format
`RPERF_VERBOSE`	`1`	Print statistics to stderr
`RPERF_STAT`	`1`	Enable stat mode output
`RPERF_STAT_REPORT`	`1`	Include profile tables in stat output
`RPERF_STAT_COMMAND`	string	Command string shown in stat output header
`RPERF_AGGREGATE`	`0`	Disable sample aggregation (return raw samples)

A.4 Profiling mode comparison

Aspect	cpu	wall
Clock	`CLOCK_THREAD_CPUTIME_ID`	`CLOCK_MONOTONIC`
I/O time	Not measured	`[GVL blocked]`
Sleep time	Not measured	`[GVL blocked]`
GVL contention	Not measured	`[GVL wait]`
GC time	`[GC marking]`, `[GC sweeping]`	`[GC marking]`, `[GC sweeping]`
Best for	CPU hotspots	Latency analysis

A.5 Output format comparison

Extension	Format	Tooling required
`.pb.gz` (default)	pprof protobuf	Go (`rperf report`)
`.collapsed`	Collapsed stacks	flamegraph.pl or speedscope
`.txt`	Text report	None

A.6 Synthetic frames

Frame	Mode	Meaning
`[GVL blocked]`	wall	Thread off-GVL (I/O, sleep, C ext)
`[GVL wait]`	wall	Thread waiting for GVL (contention)
`[GC marking]`	both	GC marking phase (wall time)
`[GC sweeping]`	both	GC sweeping phase (wall time)

Bibliography

[Aman, 2014] Aman Gupta. stackprof: A sampling call-stack profiler for Ruby 2.2+. 2014.
[Arnaldo, 2010] Arnaldo Carvalho de Melo. The New Linux perf Tools. 2010.
[Brendan, 2016] Brendan Gregg. "The Flame Graph". Communications of the ACM, 59(6), pp. 48--57. 2016. DOI
[Gang, 2010] Gang Ren and Eric Tune and Tipp Moseley and Yixin Shi and Silvius Rus and Robert Hundt. "Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers". IEEE Micro, 30(4), pp. 65--79. 2010. DOI
[Susan, 1982] Susan L. Graham and Peter B. Kessler and Marshall K. McKusick. "gprof: A Call Graph Execution Profiler". In Proceedings of the SIGPLAN '82 Symposium on Compiler Construction, pp. 120--126. 1982. DOI
[Todd, 2010] Todd Mytkowicz and Amer Diwan and Matthias Hauswirth and Peter F. Sweeney. "Evaluating the Accuracy of Java Profilers". In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '10), pp. 187--197. 2010. DOI

Index

C

collapsed stacks: 1. Introduction; 1. Introduction; 6. Output Formats
cpu mode: 3. CLI Usage
cumulative time: 5. Interpreting Results

D

deferred string resolution: 9. Sampling

F

flat time: 5. Interpreting Results
fork safety: 8. Architecture Overview

G

GVL: 1. Introduction; 5. Interpreting Results

P

postponed job: 1. Introduction; 1. Introduction
pprof: 1. Introduction; 1. Introduction; 4. Ruby API; 6. Output Formats

R

Rperf.label: 4. Ruby API; 6. Output Formats; 7. Framework Integration
Rperf.labels: 4. Ruby API
Rperf.save: 4. Ruby API
Rperf.snapshot: 4. Ruby API
Rperf.start: 4. Ruby API
Rperf.stop: 4. Ruby API
rperf: 1. Introduction
rperf diff: 3. CLI Usage
rperf exec: 3. CLI Usage
rperf record: 1. Introduction; 3. CLI Usage
rperf report: 1. Introduction; 3. CLI Usage
rperf stat: 1. Introduction; 3. CLI Usage

S

safepoint: 1. Introduction; 1. Introduction
safepoint bias: 1. Introduction; 1. Introduction
sampling: 1. Introduction
single session: 8. Architecture Overview
synthetic frames: 1. Introduction; 5. Interpreting Results

W

wall mode: 3. CLI Usage; 3. CLI Usage; 3. CLI Usage
weight: 1. Introduction

\

\[GC marking\]: 5. Interpreting Results
\[GC sweeping\]: 5. Interpreting Results
\[GVL blocked\]: 5. Interpreting Results
\[GVL wait\]: 5. Interpreting Results