rperf guide

A safepoint-based sampling performance profiler for Ruby.

rperf profiles your Ruby programs by sampling at safepoints and using actual time deltas as weights to correct safepoint bias. It provides a perf-like CLI, a Ruby API, and outputs JSON, pprof, collapsed stacks, and text report formats.

Key features:

CPU and wall-clock profiling modes
Time-weighted sampling that corrects safepoint bias
GVL contention and GC phase tracking
JSON (native), pprof, collapsed stacks, and text output formats
Low overhead (sampling callback cost < 0.2% at default 1000 Hz)
Requires Ruby >= 3.4.0 on POSIX systems (Linux, macOS)

Getting Started

This part covers what rperf is, why it exists, and how to install it on your system.

1. Introduction

1.1 What is rperf?

rperf is a safepoint-based sampling performance profiler for Ruby. It helps you find where your Ruby program spends its time — whether that’s CPU computation, I/O waits, GVL contention, or garbage collection.

Unlike traditional sampling profilers that count samples uniformly, rperf uses actual time deltas (in nanoseconds) as weights for each sample. This corrects the safepoint bias problem inherent in postponed job-based sampling, producing more accurate results.

rperf is inspired by Linux perf^{[Arnaldo, 2010]}, providing a familiar CLI interface with subcommands like record, stat, and report.

1.1.1 Advantages

Accurate profiling: Time-delta weighting corrects safepoint bias, producing results closer to real time distribution than traditional count-based profilers.
GVL / GC visibility: Tracks GVL blocking and contention (wall mode) and GC marking/sweeping (both modes) as sample labels — no separate tool needed.
Standard output formats: Outputs JSON (rperf native, default), pprof protobuf (compatible with go tool pprof), collapsed stacks (for flame graphs / speedscope), and human-readable text.
Built-in viewer: Includes an in-browser flamegraph viewer (Rperf::Viewer) with tag filtering. --html output loads d3 from CDN.
Low overhead: Default 1000 Hz sampling callback cost is < 0.2%, suitable for production.
Simple CLI: rperf stat for a quick overview, rperf record + rperf report for detailed analysis.

1.1.2 Limitations

Ruby 3.4+ only: Requires APIs introduced in Ruby 3.4.
POSIX only: Linux and macOS. Windows is not supported.
Method-level granularity: No line-number resolution — profiles show method names only.
Single session: Only one profiling session can be active at a time (global state in the C extension).
Safepoint latency: Samples are still deferred to safepoints. The time-delta weighting corrects the bias but cannot recover the exact interrupted instruction pointer.

1.2 Why another profiler?

Ruby already has profiling tools like stackprof^{[Aman, 2014]}. So why rperf?

1.2.1 The safepoint bias problem

Most Ruby sampling profilers collect backtraces by calling rb_profile_frames directly in the signal handler. This approach yields backtraces at the actual signal timing, but relies on undocumented internal VM state — rb_profile_frames is not guaranteed to be async-signal-safe, and the results can be unreliable if the VM is in the middle of updating its internal structures.

rperf takes a different approach: it uses the postponed job mechanism (rb_postponed_job), which is the Ruby VM’s official API for safely deferring work from signal handlers. Backtrace collection is deferred to the next safepoint — a point where the VM is in a consistent state and rb_profile_frames can return reliable results. The trade-off is that when a timer fires between safepoints, the actual sample is delayed until the next safepoint.

If each sample were counted equally (weight = 1), a sample delayed by a long-running C method would get the same weight as one taken immediately. This is the safepoint bias^{[Todd, 2010]} problem: functions that happen to be running when the thread reaches a safepoint appear more often than they should, while functions between safepoints are under-represented.

sequenceDiagram participant Timer participant VM participant Profiler Timer->>VM: Signal (t=0ms) Note over VM: Running C method...
cannot stop yet VM->>Profiler: Safepoint reached (t=3ms) Note over Profiler: Traditional: weight = 1
rperf: weight = 3ms Timer->>VM: Signal (t=1ms) VM->>Profiler: Safepoint reached (t=1ms) Note over Profiler: Traditional: weight = 1
rperf: weight = 1ms

mermaid source

sequenceDiagram
    participant Timer
    participant VM
    participant Profiler

    Timer->>VM: Signal (t=0ms)
    Note over VM: Running C method...<br/>cannot stop yet
    VM->>Profiler: Safepoint reached (t=3ms)
    Note over Profiler: Traditional: weight = 1<br/>rperf: weight = 3ms
    Timer->>VM: Signal (t=1ms)
    VM->>Profiler: Safepoint reached (t=1ms)
    Note over Profiler: Traditional: weight = 1<br/>rperf: weight = 1ms

rperf solves this by recording clock_now - clock_prev as the weight of each sample. A sample delayed by 3ms gets 3x the weight of a 1ms sample, accurately reflecting where time was actually spent.

1.2.2 Other advantages

GVL and GC awareness: In wall mode, rperf tracks time spent blocked off the GVL and waiting to reacquire the GVL (%GVL labels). GC marking/sweeping time is tracked in both CPU and wall modes (%GC labels, always using wall time as the weight). These labels can be filtered with pprof’s -tagfocus and -tagroot.
perf-like CLI: The rperf stat command gives you a quick performance overview (like perf stat), while rperf record + rperf report gives you detailed profiling.
Standard output: rperf outputs JSON (native format, default), pprof protobuf format (compatible with Go’s pprof tool ecosystem), collapsed stacks (for flame graphs^{[Brendan, 2016]} and speedscope), and human-readable text.
Low overhead: Default 1000 Hz sampling callback cost is < 0.2%, suitable for production use.

1.3 Requirements

Ruby >= 3.4.0
POSIX system (Linux or macOS)
Go (optional, for rperf report with .pb.gz files and rperf diff)

1.4 Quick start

Profile a Ruby script and view the results:

# Install rperf
gem install rperf

# Quick performance overview
rperf stat ruby my_app.rb

# Record a profile and view it interactively
rperf record ruby my_app.rb
rperf report

Or use rperf from Ruby code:

require "rperf"

Rperf.start(output: "profile.pb.gz") do
  # code to profile
end

View on GitHub

2. Installation

2.1 Installing the gem

rperf is distributed as a Ruby gem with a C extension:

gem install rperf

Or add it to your Gemfile:

gem "rperf"

Then run:

bundle install

2.2 Verifying the installation

Check that rperf is installed correctly:

rperf --help

You should see:

Usage: rperf record [options] command [args...]
       rperf stat [options] command [args...]
       rperf exec [options] command [args...]
       rperf report [options] [file]
       rperf diff [options] base.pb.gz target.pb.gz
       rperf help

Run 'rperf help' for full documentation

2.3 Platform support

rperf supports POSIX systems:

Platform	Timer implementation	Notes
Linux	`timer_create` + signal (default)	Best precision (~1000us at 1000Hz)
Linux	`nanosleep` thread (with `signal: false`)	Fallback, ~100us drift/tick
macOS	`nanosleep` thread	Signal-based timer not available

On Linux, rperf uses timer_create with SIGEV_THREAD_ID to deliver timer signals to a dedicated worker thread by default. This provides precise interval timing without interrupting Ruby threads. The signal number defaults to SIGRTMIN+8 and can be changed via the signal: keyword argument to Rperf.start in the Ruby API.

On macOS (and when signal: false is set on Linux), rperf falls back to a dedicated pthread with a nanosleep loop.

2.4 Optional: Go toolchain

Go is only required for rperf report with pprof (.pb.gz) files and for the rperf diff subcommand, both of which wrap go tool pprof. If you need these features, install Go from go.dev.

Without Go, you can still use all rperf features including rperf report with the default JSON (.json.gz) format, which opens the built-in rperf viewer. You can also view pprof files with other tools like speedscope or generate text/collapsed output directly.

View on GitHub

How to Use

This part explains how to profile Ruby programs with rperf — from the command line and as a library — and describes the available output formats.

3. CLI Usage

rperf provides a perf-like command-line interface with five main subcommands: record, stat, exec, report, and diff.

3.1 rperf stat

rperf stat is the quickest way to get a performance overview. It runs your command with wall-mode profiling and prints a summary to stderr.

rperf stat ruby my_app.rb

3.1.1 Example: CPU-bound program

Here is a simple Fibonacci computation:

# fib.rb
def fib(n)
  return n if n <= 1
  fib(n - 1) + fib(n - 2)
end
fib(35)

Running rperf stat:

rperf stat ruby fib.rb

 Performance stats for 'ruby fib.rb':

           744.4 ms   user
            32.0 ms   sys
           491.0 ms   real

           481.0 ms 100.0%  [Rperf] CPU execution
            12.0 ms         [Ruby ] GC time (9 count: 6 minor, 3 major)
         154,468            [Ruby ] allocated objects
          66,596            [Ruby ] freed objects
              25 MB         [OS   ] peak memory (maxrss)
              31            [OS   ] context switches (10 voluntary, 21 involuntary)
               0 MB         [OS   ] disk I/O (0 MB read, 0 MB write)

  481 samples / 481 triggers, 0.1% profiler overhead

The output tells you:

user/sys/real: Standard timing (like the time command)
[Rperf] lines: Sampling-derived time breakdown — CPU execution, GVL blocked (I/O/sleep), GVL wait (contention), GC marking, GC sweeping
[Ruby ] lines: Runtime statistics — GC count, allocated/freed objects, YJIT ratio (if enabled)
[OS ] lines: OS statistics — peak memory, context switches, disk I/O

Use --report to add flat and cumulative top-50 function tables to the output.

3.1.2 Example: Mixed CPU and I/O

# mixed.rb
def cpu_work(n)
  sum = 0
  n.times { |i| sum += i * i }
  sum
end

def io_work
  sleep(0.05)
end

5.times do
  cpu_work(500_000)
  io_work
end

Running rperf stat:

rperf stat ruby mixed.rb

Because stat always uses wall mode, you can see how time is divided between CPU and I/O. The [Rperf] GVL blocked line shows time spent sleeping/in I/O, while [Rperf] CPU execution shows compute time.

3.1.3 stat options

rperf stat [options] command [args...]

Option	Description
`-o PATH`	Also save profile to file (default: none)
`-f HZ`	Sampling frequency in Hz (default: 1000)
`-m MODE`	`cpu` or `wall` (default: `wall`)
`--report`	Include flat/cumulative profile tables in output
`--signal VALUE`	Timer signal (Linux only): signal number, or `false` for nanosleep thread
`--no-aggregate`	Disable sample aggregation (keep raw samples)
`--no-inherit`	Do not profile forked/spawned child processes (default: inherit)
`-v`	Print additional sampling statistics

3.2 rperf exec

rperf exec runs a command with profiling and prints a full performance report to stderr — equivalent to rperf stat --report. It uses wall mode by default and does not save to a file.

rperf exec ruby my_app.rb

This prints everything stat shows (timing, time breakdown, GC/memory/OS stats) plus flat and cumulative top-50 function tables.

3.2.1 exec options

rperf exec [options] command [args...]

Option	Description
`-o PATH`	Also save profile to file (default: none)
`-f HZ`	Sampling frequency in Hz (default: 1000)
`-m MODE`	`cpu` or `wall` (default: `wall`)
`--signal VALUE`	Timer signal (Linux only): signal number, or `false` for nanosleep thread
`--no-aggregate`	Disable sample aggregation (keep raw samples)
`--no-inherit`	Do not profile forked/spawned child processes (default: inherit)
`-v`	Print additional sampling statistics

3.3 rperf record

rperf record profiles a command and saves the result to a file. This is the primary way to capture profiles for detailed analysis.

rperf record ruby my_app.rb

By default, it saves to rperf.json.gz in JSON format with CPU mode.

3.3.1 Example: Recording a profile

rperf record ruby fib.rb

This creates rperf.json.gz. You can then analyze it with rperf report or convert to other formats.

3.3.2 Choosing a profiling mode

rperf supports two profiling modes:

cpu (default): Measures per-thread CPU time. Best for finding functions that consume CPU cycles. Ignores time spent sleeping, in I/O, or waiting for the GVL.
wall: Measures wall-clock time. Best for finding where wall time goes, including I/O, sleep, and GVL contention.

# CPU mode (default)
rperf record ruby my_app.rb

# Wall mode
rperf record -m wall ruby my_app.rb

3.3.3 Choosing an output format

rperf auto-detects the format from the file extension:

# JSON format (default)
rperf record -o profile.json.gz ruby my_app.rb

# pprof format
rperf record -o profile.pb.gz ruby my_app.rb

# Collapsed stacks (for FlameGraph / speedscope)
rperf record -o profile.collapsed ruby my_app.rb

# Human-readable text
rperf record -o profile.txt ruby my_app.rb

You can also force the format explicitly:

rperf record --format text -o profile.dat ruby my_app.rb

3.3.4 Example: Text output

rperf record -o profile.txt ruby fib.rb

The text output looks like this:

Total: 509.5ms (cpu)
Samples: 509, Frequency: 1000Hz

 Flat:
           509.5 ms 100.0%  Object#fib (fib.rb)

 Cumulative:
           509.5 ms 100.0%  Object#fib (fib.rb)
           509.5 ms 100.0%  <main> (fib.rb)

3.3.5 Example: Wall mode text output

rperf record -m wall -o wall_profile.txt ruby mixed.rb

Total: 311.8ms (wall)
Samples: 80, Frequency: 1000Hz

 Flat:
            44.1 ms  14.1%  Object#cpu_work (mixed.rb)
            13.9 ms   4.5%  Integer#times (<internal:numeric>)
             3.2 ms   1.0%  Kernel#sleep (<C method>)

 Cumulative:
           311.8 ms 100.0%  Integer#times (<internal:numeric>)
           311.8 ms 100.0%  block in <main> (mixed.rb)
           311.8 ms 100.0%  <main> (mixed.rb)
           253.8 ms  81.4%  Kernel#sleep (<C method>)
           253.8 ms  81.4%  Object#io_work (mixed.rb)
            58.0 ms  18.6%  Object#cpu_work (mixed.rb)

In wall mode, the %GVL: blocked label accounts for the dominant cost — this is the sleep time in io_work. The CPU time for cpu_work is clearly separated. GVL and GC activity appear as labels on samples rather than as stack frames, and can be filtered with pprof’s -tagfocus flag (e.g., -tagfocus=%GVL=blocked). Use rperf stat to see the time breakdown by category (CPU execution, GVL blocked, GVL wait, GC marking, GC sweeping).

3.3.6 Verbose output

The -v flag prints sampling statistics to stderr during profiling:

rperf record -v ruby my_app.rb

[Rperf] mode=cpu frequency=1000Hz
[Rperf] sampling: 98 calls, 0.11ms total, 1.1us/call avg
[Rperf] samples recorded: 904
[Rperf] top 10 by flat:
[Rperf]       53.4ms  50.1%  Object#cpu_work (-e)
[rperf]       17.0ms  15.9%  Integer#times (<internal:numeric>)
...

3.3.7 record options

rperf record [options] command [args...]

Option	Description
`-o PATH`	Output file (default: `rperf.json.gz`)
`-f HZ`	Sampling frequency in Hz (default: 1000)
`-m MODE`	`cpu` or `wall` (default: `cpu`)
`--format FMT`	`json`, `pprof`, `collapsed`, or `text` (default: auto from extension)
`-p, --print`	Print text profile to stdout (same as `--format=text --output=/dev/stdout`)
`--signal VALUE`	Timer signal (Linux only): signal number, or `false` for nanosleep thread
`--no-aggregate`	Disable sample aggregation (keep raw samples)
`--no-inherit`	Do not profile forked/spawned child processes (default: inherit)
`-v`	Print sampling statistics to stderr

3.4 rperf report

rperf report opens a profile for analysis. For JSON (.json.gz) files, it opens the rperf viewer (no external tools required). For pprof (.pb.gz) files, it wraps go tool pprof and requires Go to be installed.

# Open rperf viewer (default, JSON format)
rperf report

# Open a specific file
rperf report profile.json.gz

# Open a pprof file (requires Go)
rperf report profile.pb.gz

# Print top functions
rperf report --top

# Print pprof text summary
rperf report --text

# Output static HTML viewer (no server needed)
rperf report --html profile.json.gz > report.html

The --html flag generates a single HTML file with the profile data embedded inline. It loads d3 and d3-flamegraph from CDN, so an internet connection is required to view the flamegraph — just open the file in a browser. This is useful for sharing profiles or hosting them on static sites like GitHub Pages.

3.4.1 Example: Top and text output

Using the fib.rb profile recorded earlier:

rperf report --top rperf.json.gz

 Flat:
        577.3 ms 100.0%  Object#fib (fib.rb)
          0.0 ms   0.0%  <main> (-e)

 Cumulative:
        577.3 ms 100.0%  Object#fib (fib.rb)
        577.3 ms 100.0%  <main> (-e)

For .pb.gz files, --top and --text use go tool pprof and produce pprof-formatted output.

The default behavior (without --top or --text) opens the rperf viewer for JSON files, or an interactive web UI powered by pprof^{[Gang, 2010]} for .pb.gz files.

3.4.2 report options

Option	Description
`--top`	Print top functions by flat time
`--text`	Print text report
`--html`	Output static HTML viewer to stdout (`.json.gz` only)
(default)	Open interactive web UI in browser

3.5 rperf diff

rperf diff compares two pprof profiles, showing the difference (target - base). This is useful for measuring the impact of optimizations.

# Open diff in browser
rperf diff before.pb.gz after.pb.gz

# Print top functions by diff
rperf diff --top before.pb.gz after.pb.gz

# Print text diff
rperf diff --text before.pb.gz after.pb.gz

3.5.1 Workflow example

# Profile the baseline
rperf record -o before.pb.gz ruby my_app.rb

# Make your optimization changes...

# Profile again
rperf record -o after.pb.gz ruby my_app.rb

# Compare
rperf diff before.pb.gz after.pb.gz

3.6 Multi-process profiling

By default, all CLI subcommands (stat, exec, record) automatically profile forked and spawned Ruby child processes. Profiles from all processes are merged into a single output — the stat report shows aggregated data, and record produces a single merged file.

This is particularly useful for preforking servers like Unicorn and Puma, where the master process forks worker processes:

# Profile a Unicorn server with all its workers
rperf stat -m wall bundle exec unicorn -c config.rb

# Record a merged profile from a Puma server
rperf record -m wall -o profile.json.gz bundle exec puma

3.6.1 How it works

When child process tracking is enabled (the default), rperf:

Sets up a session directory for collecting per-process profiles
Installs a Process._fork hook that restarts profiling in forked children
Spawned Ruby children (via spawn, system, backticks) inherit RUBYLIB (pointing to rperf’s lib directory) and RUBYOPT=-rrperf, and auto-start profiling
Each child writes its profile to the session directory on exit
The root process aggregates all profiles and produces the final output

Each child process’s samples are tagged with a %pid label containing the process ID. This allows per-process filtering in the viewer (tagfocus by PID) or in pprof (-tagfocus=%pid=1234).

3.6.2 Disabling child tracking

Use --no-inherit to profile only the root process:

rperf stat --no-inherit ruby my_app.rb

3.6.3 Example: Fork with distinct work

rperf stat ruby -e '
  def parent_work = 10_000_000.times { 1 + 1 }
  def child_work  = 10_000_000.times { 1 + 1 }
  fork { child_work }
  Process.waitall
  parent_work
'

The stat output shows aggregated data from both processes, with a “Ruby processes profiled” count:

               2            [Rperf] Ruby processes profiled

3.6.4 Limitations

Daemon children: Processes that call Process.daemon and outlive the parent will have their profiles lost — the parent aggregates and cleans up the session directory at exit.
Cross-process snapshots: Rperf.snapshot only covers the current process. There is no way to snapshot all children simultaneously.
Non-Ruby children: Only Ruby child processes are profiled. Shell scripts, Python processes, etc. are not affected by rperf.
Independent rperf users: Child processes that use rperf independently (Rperf.start in their own code) will conflict with the inherited auto-start session. Such programs should clear RPERF_ENABLED from their environment before requiring rperf.

3.7 rperf help

rperf help prints the full reference documentation, including profiling modes, output formats, VM state labels, and diagnostic tips.

rperf help

This outputs detailed documentation suitable for both human reading and AI-assisted analysis.

View on GitHub

4. Ruby API

rperf provides a Ruby API for programmatic profiling. This is useful when you want to profile specific sections of code, integrate profiling into test suites, or build custom profiling workflows.

4.1 Basic usage

4.1.1 Block form (recommended)

The simplest way to use rperf is with the block form of Rperf.start. It profiles the block and returns the profiling data:

require "rperf"

data = Rperf.start(output: "profile.json.gz", frequency: 1000, mode: :cpu) do
  # code to profile
end

When output: is specified, the profile is automatically written to the file when the block finishes. The method also returns the raw data hash for further processing.

4.1.2 Example: Profiling a Fibonacci function

require "rperf"

def fib(n)
  return n if n <= 1
  fib(n - 1) + fib(n - 2)
end

data = Rperf.start(frequency: 1000, mode: :cpu) do
  fib(33)
end

Rperf.save("profile.txt", data)

Running this produces:

Total: 192.7ms (cpu)
Samples: 192, Frequency: 1000Hz

Flat:
     192.7ms 100.0%  Object#fib (example.rb)

Cumulative:
     192.7ms 100.0%  Object#fib (example.rb)
     192.7ms 100.0%  block in <main> (example.rb)
     192.7ms 100.0%  Rperf.start (lib/rperf.rb)
     192.7ms 100.0%  <main> (example.rb)

4.1.3 Manual start/stop

For cases where block form is awkward, you can manually start and stop profiling:

require "rperf"

Rperf.start(frequency: 1000, mode: :wall)

# ... code to profile ...

data = Rperf.stop

Rperf.stop returns the data hash, or nil if the profiler was not running.

4.2 Rperf.start parameters

Parameter	Type	Default	Description
`frequency:`	Integer	1000	Sampling frequency in Hz
`mode:`	Symbol	`:cpu`	`:cpu` or `:wall`
`output:`	String	`nil`	File path to write on stop
`verbose:`	Boolean	`false`	Print statistics to stderr
`format:`	Symbol	`nil`	`:json`, `:pprof`, `:collapsed`, `:text`, or `nil` (auto-detect from output extension)
`signal:`	Integer/Boolean	`nil`	Linux only: `nil` = timer signal (default), `false` = nanosleep thread, positive integer = specific RT signal number
`aggregate:`	Boolean	`true`	Aggregate identical stacks during profiling to reduce memory. `false` returns raw per-sample data
`defer:`	Boolean	`false`	Start with timer paused. Use `Rperf.profile` blocks to activate sampling for specific sections
`inherit:`	`:fork`, `true`, or `false`	`:fork`	Controls child process tracking. `:fork` tracks forked children via `Process._fork` hook. `true` also tracks spawned Ruby children (sets `RUBYLIB` and `RUBYOPT=-rrperf`). `false` disables child tracking

4.3 Rperf.stop return value

Rperf.stop returns nil if the profiler was not running. Otherwise it returns a Hash:

{
  mode: :cpu,                # or :wall
  frequency: 1000,
  trigger_count: 1234,       # number of timer triggers
  sampling_count: 1234,      # number of timer callbacks
  sampling_time_ns: 56789,   # total time spent sampling (overhead)
  detected_thread_count: 4,  # threads seen during profiling
  start_time_ns: 17740...,   # CLOCK_REALTIME epoch nanos
  duration_ns: 10000000,     # profiling duration in nanos

  # aggregate: true (default) — present only in this mode
  unique_frames: 42,         # unique frame count
  unique_stacks: 120,        # unique stack count
  aggregated_samples: [                   # Array of [frames, weight, thread_seq, label_set_id]
    [frames, weight, seq, lsi],           #   frames: [[path, label], ...] deepest-first
    ...                                   #   weight: Integer (nanoseconds)
  ],                                      #   seq: Integer (thread sequence, 1-based)
                                          #   lsi: Integer (label set ID, 0 = no labels)

  # aggregate: false — present only in this mode (C returns raw_samples;
  # Ruby's stop also builds aggregated_samples from them for encoder use)
  raw_samples: [                          # same element format as aggregated_samples
    [frames, weight, seq, lsi],
    ...
  ],

  label_sets: [{}, {request: "abc"}],     # label set table (present when labels used)
}

Each sample (in both aggregated_samples and raw_samples) has:

frames: An array of [path, label] pairs, ordered deepest-first (leaf frame at index 0)
weight: Time in nanoseconds attributed to this sample
thread_seq: Thread sequence number (1-based, assigned per profiling session)
label_set_id: Label set ID (0 = no labels). Index into the label_sets array

When aggregate: true (default), identical stacks are merged and their weights summed. The aggregated_samples array contains one entry per unique (stack, thread_seq, label_set_id) combination. When aggregate: false, the C extension returns raw_samples with every individual timer sample; Ruby’s Rperf.stop also builds aggregated_samples from them so encoders always work.

4.4 Rperf.save

Rperf.save writes profiling data to a file in any supported format:

Rperf.save("profile.json.gz", data)      # JSON format (default)
Rperf.save("profile.pb.gz", data)        # pprof format
Rperf.save("profile.collapsed", data)    # collapsed stacks
Rperf.save("profile.txt", data)          # text report

The format is auto-detected from the file extension. You can override it with the format: keyword:

Rperf.save("output.dat", data, format: :text)

4.5 Rperf.snapshot

Rperf.snapshot returns the current profiling data without stopping. Only works in aggregate mode (the default). Returns nil if not profiling.

Rperf.start(frequency: 1000)
# ... work ...
snap = Rperf.snapshot
Rperf.save("snap.json.gz", snap)
# ... more work (profiling continues) ...
data = Rperf.stop

With clear: true, the aggregated data is reset after taking the snapshot. This enables interval-based profiling where each snapshot covers only the period since the last clear:

Rperf.start(frequency: 1000)
loop do
  sleep 10
  snap = Rperf.snapshot(clear: true)
  Rperf.save("profile-#{Time.now.to_i}.pb.gz", snap)
end

4.6 Sample labels

Rperf.label attaches key-value labels to the current thread’s samples. Labels appear in pprof sample labels, enabling per-context filtering (e.g., per-request profiling). If profiling is not running, label is silently ignored — safe to call unconditionally (e.g., from Rack middleware).

4.6.1 Block form

With a block, labels are automatically restored when the block exits — even if an exception is raised:

Rperf.label(request: "abc-123", endpoint: "/api/users") do
  handle_request   # samples inside get these labels
end
# labels are restored to previous state here

4.6.2 Without block

Without a block, labels persist on the current thread until changed:

Rperf.label(request: "abc-123")
# all subsequent samples on this thread have request="abc-123"

4.6.3 Merging and deleting

New labels merge with existing ones. Set a value to nil to remove a key:

Rperf.label(request: "abc")
Rperf.label(phase: "db")       # adds phase, keeps request
Rperf.labels                   #=> {request: "abc", phase: "db"}
Rperf.label(request: nil)      # removes request
Rperf.labels                   #=> {phase: "db"}

4.6.4 Nested blocks

Each block creates a scope. When it exits, the labels are restored to the state before the block — regardless of what happened inside:

Rperf.label(request: "abc") do
  Rperf.label(phase: "db") do
    Rperf.labels  #=> {request: "abc", phase: "db"}
  end
  Rperf.labels    #=> {request: "abc"}
end
Rperf.labels      #=> {}

4.6.5 Filtering by label

Labels are preserved in all output formats. The rperf viewer (rperf report) and Rperf::Viewer provide interactive filtering:

tagfocus: Enter a regex to keep only samples matching a label value (e.g., abc-123)
tagroot: Prepend a label key as a root frame to group the flamegraph (e.g., by request)
tagleaf: Append a label key as a leaf frame
tagignore: Exclude samples matching specific label values

When using pprof format, go tool pprof -tagfocus, -tagroot, -tagleaf, and -tagignore provide equivalent functionality.

4.6.6 Reading labels

Rperf.labels returns the current thread’s labels as a Hash:

Rperf.labels  #=> {request: "abc", phase: "db"}

Returns an empty Hash if no labels are set.

4.7 Deferred start and Rperf.profile

4.7.1 Why defer?

Normally, Rperf.start immediately begins firing the sampling timer. Every timer tick interrupts the application to capture a stack trace — this is the profiling overhead. For long-running servers, you may not want to pay this cost for all code at all times. You might only care about specific endpoints, jobs, or code paths.

defer: true solves this. It sets up the profiler infrastructure (buffers, hooks, worker thread) but does not start the timer. The timer only fires inside Rperf.profile blocks. Outside those blocks, overhead is zero — no signals, no interrupts, no stack captures.

start(defer: false)     start(defer: true)
┌─────────────────┐     ┌─────────────────┐
│ start            │     │ start            │
│ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓ │     │                  │  ← timer not firing
│ ▓ sampling ▓▓▓▓ │     │ ┌──profile──┐   │
│ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓ │     │ │▓▓sampling▓│   │  ← timer active
│ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓ │     │ └───────────┘   │
│ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓ │     │                  │  ← timer not firing
│ stop             │     │ stop             │
└─────────────────┘     └─────────────────┘

This is especially useful with framework integrations: the middleware wraps each request/job in a profile block, so only actual request processing is sampled.

4.7.2 Rperf.profile

Rperf.profile activates the timer for the duration of the block and optionally applies labels. It combines timer control with label assignment in a single call.

Note

The timer is process-wide. While any thread’s profile block is active, all threads are sampled — not just the thread that called profile. Each thread’s samples carry its own labels, so they are distinguishable. When the last active profile block exits (refcount reaches 0), the timer is paused and all threads stop being sampled.

require "rperf"

Rperf.start(defer: true, mode: :wall)

# Timer activates here, samples get endpoint="/users" label
Rperf.profile(endpoint: "/users") do
  handle_request
end
# Timer paused — zero overhead

Rperf.profile(endpoint: "/health") do
  check_health
end

data = Rperf.stop
Rperf.save("profile.json.gz", data)

4.7.3 Nesting

profile blocks can be nested. The timer stays active until the outermost block exits. Labels merge just like Rperf.label:

Rperf.profile(endpoint: "/users") do
  Rperf.profile(phase: "db") do
    # sampled with {endpoint: "/users", phase: "db"}
    query_db
  end
  # sampled with {endpoint: "/users"}
  render_response
end
# timer paused again

4.7.4 Combining with Rperf.label

profile controls the timer; label only adds tags. They can be used together:

Rperf.start(defer: true, mode: :wall)

Rperf.profile(endpoint: "/users") do
  Rperf.label(phase: "auth") do
    authenticate     # sampled, labeled with endpoint + phase
  end
  Rperf.label(phase: "db") do
    query_db         # sampled, labeled with endpoint + phase
  end
end

4.7.5 Without defer

profile also works with a normal (non-deferred) start. In that case, the timer is already running and profile only applies labels — equivalent to Rperf.label with a block:

Rperf.start(mode: :wall)
Rperf.profile(endpoint: "/users") do
  handle_request   # sampled (timer was already running)
end

4.7.6 Error handling

profile raises RuntimeError if profiling is not started, and ArgumentError if no block is given. Labels and timer state are properly restored even if an exception is raised inside the block.

4.8 Practical examples

4.8.1 Profiling a web request handler

require "rperf"

class ApplicationController
  def profile_action
    data = Rperf.start(mode: :wall) do
      # Simulate a typical request
      users = User.where(active: true).limit(100)
      result = users.map { |u| serialize_user(u) }
      render json: result
    end

    Rperf.save("request_profile.txt", data)
  end
end

Using wall mode here captures not just CPU time but also database I/O and any GVL contention.

4.8.2 Comparing CPU and wall profiles

require "rperf"

def workload
  # Mix of CPU and I/O
  100.times do
    compute_something
    sleep(0.001)
  end
end

# CPU profile: shows where CPU cycles go
cpu_data = Rperf.start(mode: :cpu) { workload }
Rperf.save("cpu.txt", cpu_data)

# Wall profile: shows where wall time goes
wall_data = Rperf.start(mode: :wall) { workload }
Rperf.save("wall.txt", wall_data)

The CPU profile will focus on compute_something, while the wall profile will show the sleep calls with a %GVL: blocked label.

4.8.3 Processing samples

You can work with the sample data programmatically. By default, samples are aggregated (identical stacks merged):

require "rperf"

data = Rperf.start(mode: :cpu) { workload }
# data[:aggregated_samples] contains aggregated entries (one per unique stack)

# Find the hottest function
flat = Hash.new(0)
data[:aggregated_samples].each do |frames, weight, thread_seq|
  leaf_label = frames.first&.last  # frames[0] is the leaf
  flat[leaf_label] += weight
end

top = flat.sort_by { |_, w| -w }.first(5)
top.each do |label, weight_ns|
  puts "#{label}: #{weight_ns / 1_000_000.0}ms"
end

To get raw (non-aggregated) per-sample data, pass aggregate: false. Each timer tick produces a separate entry:

data = Rperf.start(mode: :cpu, aggregate: false) { workload }
# data[:raw_samples] contains one entry per timer sample
data[:raw_samples].each do |frames, weight, thread_seq|
  puts "thread=#{thread_seq} weight=#{weight}ns depth=#{frames.size}"
end

4.8.4 Generating collapsed stacks for FlameGraph

require "rperf"

data = Rperf.start(mode: :cpu) { workload }
Rperf.save("profile.collapsed", data)

The collapsed format is one line per unique stack, compatible with Brendan Gregg’s FlameGraph^{[Brendan, 2016]} tools and speedscope:

frame1;frame2;...;leaf weight_ns

You can generate a flame graph SVG:

flamegraph.pl profile.collapsed > flamegraph.svg

Or open the .collapsed file directly in speedscope.

View on GitHub

5. Interpreting Results

5.1 Flat vs. cumulative time

Understanding the difference between flat and cumulative time is essential for effective profiling, as described in classic profiling literature gprof^{[Susan, 1982]}.

Flat time: Time attributed directly to a function — it was the leaf (deepest frame) in the sample. High flat time means the function itself is doing expensive work.
Cumulative time: Time for all samples where the function appears anywhere in the stack. High cumulative time means the function (or something it calls) is expensive.

mermaid source

graph TD
    A["main()"] -->|calls| B["process()"]
    B -->|calls| C["compute()"]
    B -->|calls| D["format()"]

    style C fill:#ff6b6b,color:#fff
    style D fill:#ffa07a,color:#fff

If compute() has high flat time, optimize compute() itself. If process() has high cumulative but low flat time, the cost is in its children (compute() and format()).

5.2 Weight unit

All weights in rperf are in nanoseconds, regardless of profiling mode:

1,000 ns = 1 us (microsecond)
1,000,000 ns = 1 ms (millisecond)
1,000,000,000 ns = 1 s (second)

5.3 VM state labels

In addition to normal Ruby frames, rperf tracks non-CPU activity as labels (tags) on samples. The C extension records a vm_state for each sample, and Ruby converts these to labels with keys %GVL and %GC before encoding. These labels appear in label_sets alongside user labels (like endpoint), and can be filtered with the viewer’s tagfocus, tagroot, and tagleaf controls.

5.3.1 %GVL: blocked

Mode: wall only

Time the thread spent off the GVL — during I/O operations, sleep, or C extensions that release the GVL. This time is attributed to the stack captured at the SUSPENDED event (when the thread released the GVL).

High %GVL: blocked time indicates your program is I/O bound. Look at the cumulative view to find which functions are triggering the I/O.

In the viewer, use tagfocus with blocked to isolate these samples, or tagroot with %GVL to group by GVL state.

5.3.2 %GVL: wait

Mode: wall only

Time the thread spent waiting to reacquire the GVL after becoming ready. This indicates GVL contention — another thread is holding the GVL while this thread wants to run.

High %GVL: wait time means your threads are serialized on the GVL. Consider reducing GVL-holding work, using Ractors, or moving work to child processes.

In the viewer, use tagfocus with wait to isolate these samples.

5.3.3 %GC: mark

Mode: cpu and wall

Time spent in the GC marking phase. Always measured in wall time. Attributed to the stack that triggered GC.

High %GC: mark time means too many live objects. Reduce the number of long-lived allocations.

5.3.4 %GC: sweep

Mode: cpu and wall

Time spent in the GC sweeping phase. Always measured in wall time. Attributed to the stack that triggered GC.

High %GC: sweep time means too many short-lived objects. Consider reusing objects or using object pools.

5.3.5 Filtering VM state labels

In the rperf viewer or Rperf::Viewer:

tagfocus: Enter blocked, wait, mark, or sweep to isolate specific VM states
tagignore: Check %GVL: blocked to exclude off-GVL samples
tagroot: Check %GVL or %GC to group the flamegraph by VM state

When using pprof format, go tool pprof -tagfocus=%GVL=blocked profile.pb.gz and similar flags provide equivalent filtering.

5.4 Diagnosing common problems

5.4.1 High CPU usage

Mode: cpu

Look for functions with high flat CPU time. These are the functions consuming CPU cycles.

Action: Optimize the hot function (better algorithm, caching) or call it less frequently.

5.4.2 Slow requests / high latency

Mode: wall

Look for functions with high cumulative wall time.

If %GVL: blocked time is dominant: I/O or sleep is the bottleneck. Check database queries, HTTP calls, file I/O.
If %GVL: wait time is dominant: GVL contention. Reduce GVL-holding work or move to Ractors/child processes.

5.4.3 GC pressure

Mode: cpu or wall

Look for samples with %GC: mark and %GC: sweep labels.

High %GC: mark time: Too many live objects. Reduce allocations of long-lived objects.
High %GC: sweep time: Too many short-lived objects. Reuse or pool objects.

The rperf stat output also shows GC counts and allocated/freed object counts, which can help diagnose allocation-heavy code.

5.4.4 Multithreaded app slower than expected

Mode: wall

Look for %GVL: wait time across threads.

rperf stat ruby threaded_app.rb

Example output for a GVL-contended workload:

 Performance stats for 'ruby threaded_app.rb':

            89.9 ms   user
            14.0 ms   sys
            41.2 ms   real

             0.5 ms   1.2%  [Rperf] CPU execution
             9.8 ms  23.8%  [Rperf] GVL blocked (I/O, sleep)
            30.8 ms  75.0%  [Rperf] GVL wait (contention)

Here, 75% of wall time is GVL contention. The four threads are fighting for the GVL, serializing their CPU work.

5.5 Tips for effective profiling

Default frequency (1000 Hz) works well for most cases. For long-running production profiling, 100-500 Hz reduces overhead further.
Profile representative workloads, not micro-benchmarks. Profiling a real request or batch job gives more actionable results than profiling a tight loop.
Compare cpu and wall profiles to distinguish CPU-bound from I/O-bound code. If a function is hot in wall mode but not in CPU mode, it’s blocked on I/O.
Use rperf diff to measure the impact of optimizations. Profile before and after, then compare.
Use the verbose flag (-v) to check profiler overhead and see top functions immediately.

View on GitHub

6. Output Formats

rperf supports four output formats. The format is auto-detected from the file extension, or can be set explicitly with the --format flag (CLI) or format: parameter (API).

6.1 JSON (default)

The JSON format is rperf’s native format and the default output. It comes in two variants:

.json.gz (default) — gzip-compressed JSON. Compact, recommended for storage and sharing.
.json — plain text JSON. Readable by jq, text editors, and any JSON tool without decompression.

Both variants are auto-detected by Rperf.load and rperf report.

How to view:

# Open rperf viewer
rperf report profile.json.gz
rperf report profile.json      # also works

How to load in Ruby:

data = Rperf.load("profile.json.gz")  # gzip
data = Rperf.load("profile.json")     # plain text

Advantages: Native rperf format, no external tools required to view. Portable format. Can be loaded back into Ruby or processed by any JSON-capable tool. Use .json when you want to inspect or pipe through jq.

6.2 pprof

The pprof format is a gzip-compressed Protocol Buffers binary. This is the standard format used by Go’s pprof tooling.

Extension convention: .pb.gz

How to view:

# Interactive web UI (requires Go)
rperf report profile.pb.gz

# Top functions
rperf report --top profile.pb.gz

# Text report
rperf report --text profile.pb.gz

# Or use go tool pprof directly
go tool pprof -http=:8080 profile.pb.gz
go tool pprof -top profile.pb.gz

You can also import pprof files into speedscope via its web interface.

Advantages: Standard format supported by a wide ecosystem of tools. Supports diff comparison between two profiles. Interactive exploration with flame graphs, call graphs, and source annotations.

6.2.1 Embedded metadata

rperf embeds the following metadata in each pprof profile:

Field	Description
`comment`	rperf version, profiling mode, frequency, Ruby version
`time_nanos`	Profile collection start time (epoch nanoseconds)
`duration_nanos`	Profile duration (nanoseconds)
`doc_url`	Link to rperf documentation

View comments with: go tool pprof -comments profile.pb.gz

6.2.2 Sample labels

Each sample carries a thread_seq numeric label — a thread sequence number (1-based) assigned when rperf first sees each thread during a profiling session. When Rperf.label is used, custom key-value string labels are also attached to samples.

These labels can be filtered with go tool pprof:

go tool pprof -tagroot=thread_seq profile.pb.gz     # group by thread
go tool pprof -tagfocus=request=abc-123 profile.pb.gz  # filter by label
go tool pprof -tagroot=request profile.pb.gz         # group by label

Note

The rperf viewer (JSON format) supports the same tag operations — tagfocus, tagignore, tagroot, tagleaf — without requiring Go. See In-browser viewer for details.

6.3 Collapsed stacks

The collapsed stacks format is a plain text format with one line per unique stack trace. Each line contains a semicolon-separated stack (bottom-to-top) followed by a space and the weight in nanoseconds.

Extension convention: .collapsed

Format:

bottom_frame;...;top_frame weight_ns

Example output:

<main>;Integer#times;block in <main>;Object#cpu_work;Integer#times;Object#cpu_work 53419170
<main>;Integer#times;block in <main>;Object#cpu_work;Integer#times 16962309
<main>;Integer#times;block in <main>;Object#io_work;Kernel#sleep 2335151

How to use:

# Generate a FlameGraph SVG
rperf record -o profile.collapsed ruby my_app.rb
flamegraph.pl profile.collapsed > flamegraph.svg

# Open in speedscope (drag-and-drop the .collapsed file)
# macOS: open https://www.speedscope.app/
# Linux: xdg-open https://www.speedscope.app/

Advantages: Simple text format, easy to process with command-line tools. Compatible with Brendan Gregg’s FlameGraph^{[Brendan, 2016]} tools and speedscope.

6.3.1 Parsing collapsed stacks programmatically

File.readlines("profile.collapsed").each do |line|
  stack, weight = line.rpartition(" ").then { |s, _, w| [s, w.to_i] }
  frames = stack.split(";")
  # frames[0] is bottom (main), frames[-1] is leaf (hot)
  puts "#{frames.last}: #{weight / 1_000_000.0}ms"
end

6.4 Text report

The text format is a human-readable (and AI-readable) report showing flat and cumulative top-N tables.

Extension convention: .txt

Example output:

Total: 509.5ms (cpu)
Samples: 509, Frequency: 1000Hz

 Flat:
           509.5 ms 100.0%  Object#fib (fib.rb)

 Cumulative:
           509.5 ms 100.0%  Object#fib (fib.rb)
           509.5 ms 100.0%  <main> (fib.rb)

Sections:

Header: Total profiled time, sample count, and frequency
Flat table: Functions sorted by self time (the function was the leaf/deepest frame)
Cumulative table: Functions sorted by total time (the function appeared anywhere in the stack)

Advantages: No tooling required — readable with cat. Top 50 entries per table by default. Good for quick analysis, sharing in issue reports, or feeding to AI assistants.

6.5 Format comparison

Feature	json	pprof	collapsed	text
File size	Medium (json + gzip)	Small (binary + gzip)	Medium (text)	Small (text)
Flame graph	Yes (via rperf viewer)	Yes (via pprof web UI)	Yes (via flamegraph.pl)	No
Call graph	No	Yes	No	No
Diff comparison	Yes (`rperf diff`, requires Go)	Yes (`rperf diff`)	No	No
No tools needed	Yes	No (requires Go)	No (requires flamegraph.pl)	Yes
Load back into Ruby	Yes (`Rperf.load`)	No	No	No
Programmatic parsing	Easy (JSON)	Complex (protobuf)	Simple	Simple
AI-friendly	Yes	No	Yes	Yes

6.6 Auto-detection rules

File extension	Format
`.json.gz`	JSON (default)
`.pb.gz`	pprof
`.collapsed`	Collapsed stacks
`.txt`	Text report

The default output file (rperf.json.gz) uses JSON format.

View on GitHub

7. Framework Integration

rperf provides optional integrations that automatically profile and label samples with context from web frameworks and job processors. They use Rperf.profile, which both activates the timer and sets labels. This works seamlessly with start(defer: true) — the timer only fires while at least one profile block is active. Start profiling separately (e.g., in an initializer).

Note

The timer activated by Rperf.profile is process-wide. While any thread’s profile block is active, all threads are sampled. Each thread’s samples carry its own labels, so they remain distinguishable. This design also makes GVL contention and background processing visible within the profiled interval.

7.1 Rack middleware

Rperf::RackMiddleware profiles each request and labels it with its endpoint (METHOD /path). By default, dynamic segments are normalized (numeric IDs → :id, UUIDs → :uuid) to keep label cardinality low. Use label: :raw for the original PATH_INFO, or pass a custom label: proc for framework-specific route normalization.

7.1.1 Rails

# config/initializers/rperf.rb
require "rperf/rack"
require "rperf/viewer"

Rperf.start(defer: true, mode: :wall, frequency: 99)

Rails.application.config.middleware.use Rperf::Viewer
Rails.application.config.middleware.use Rperf::RackMiddleware

# Take a snapshot every 60 minutes
Thread.new do
  loop do
    sleep 60 * 60
    Rperf::Viewer.instance&.take_snapshot!
  end
end

Visit /rperf/ to open the viewer. Use tagfocus to filter by endpoint (e.g., GET /api/users), or tagroot to group the flamegraph by endpoint.

7.1.2 Sinatra

require "sinatra"
require "rperf/rack"
require "rperf/viewer"

Rperf.start(defer: true, mode: :wall, frequency: 99)
use Rperf::Viewer
use Rperf::RackMiddleware

Thread.new { loop { sleep 3600; Rperf::Viewer.instance&.take_snapshot! } }

get "/hello" do
  "Hello, world!"
end

7.1.3 Customizing the label key

By default the middleware uses the label key :endpoint. You can change it:

use Rperf::RackMiddleware, label_key: :route

7.2 Active Job

Rperf::ActiveJobMiddleware profiles each job and labels it with its class name (e.g., SendEmailJob). Works with any Active Job backend — Sidekiq, GoodJob, Solid Queue, etc.

Start profiling in an initializer, then include it in your base job class:

# config/initializers/rperf.rb
Rperf.start(defer: true, mode: :wall, frequency: 99)

# app/jobs/application_job.rb
class ApplicationJob < ActiveJob::Base
  include Rperf::ActiveJobMiddleware
end

All subclasses inherit the label automatically:

class SendEmailJob < ApplicationJob
  def perform(user)
    # samples here get job="SendEmailJob"
  end
end

In the viewer, use tagfocus to filter by job name, or tagroot to group the flamegraph by job class.

7.3 Sidekiq

Rperf::SidekiqMiddleware profiles each job and labels it with its worker class name. This covers both Active Job-backed workers and plain Sidekiq workers.

Register it as a Sidekiq server middleware:

# config/initializers/sidekiq.rb
Rperf.start(defer: true, mode: :wall, frequency: 99)

Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add Rperf::SidekiqMiddleware
  end
end

Note

If you use Active Job with Sidekiq, choose one or the other — using both will result in duplicate labels. The Sidekiq middleware is more general (covers non-Active Job workers too).

7.4 In-browser viewer

Rperf::Viewer is a Rack middleware that serves an interactive profiling UI at a configurable mount path. It stores snapshots in memory and renders them in the browser using d3-flame-graph. No gem dependencies or build tools are required. The viewer loads d3.js and d3-flame-graph from CDNs (cdnjs.cloudflare.com, cdn.jsdelivr.net) at runtime, so an internet connection is needed on first access.

Warning

Rperf::Viewer has no built-in authentication. Profiling data — including stack traces and label values — is exposed to anyone who can access the endpoint. In production, always restrict access using your framework’s authentication (see “Access control” below).

7.4.1 Setup

# config.ru (or Rails initializer)
require "rperf/viewer"
require "rperf/rack"

Rperf.start(defer: true, mode: :wall, frequency: 999)

use Rperf::Viewer                           # serves UI at /rperf/
use Rperf::RackMiddleware                   # labels each request
run MyApp

# Take a snapshot every 60 minutes
Thread.new do
  loop do
    sleep 60 * 60
    Rperf::Viewer.instance&.take_snapshot!
  end
end

Visit /rperf/ in a browser after snapshots have been taken.

7.4.2 Options

Option	Default	Description
`path:`	`"/rperf"`	URL prefix for the viewer
`max_snapshots:`	`24`	Maximum snapshots kept in memory (oldest discarded)

7.4.3 Taking snapshots

# Programmatically (e.g., from a controller, background thread, or console)
Rperf::Viewer.instance.take_snapshot!

# Or add pre-taken data
data = Rperf.snapshot(clear: true)
Rperf::Viewer.instance.add_snapshot(data)

7.4.4 UI tabs

The viewer has three tabs:

Flamegraph — Interactive flamegraph powered by d3-flame-graph. Click a frame to zoom in, click the root to zoom out.
Top — Flat and cumulative weight table (top 50 functions). Click column headers (Flat, Cum, Function) to sort.
Tags — Shows each label key with a breakdown of values by weight and percentage. Click a value row to set tagfocus and jump to the Flamegraph tab.

7.4.5 Filtering

The controls bar at the top provides four filters:

tagfocus — Text input. Enter a regex to keep only samples whose label values match. Press Enter to apply.
tagignore — Dropdown with checkboxes. Check items to exclude matching samples. Each label key also has a (none) entry to exclude samples that do not have that key — useful for filtering out background threads that have no endpoint label.
tagroot — Dropdown with checkboxes for label keys. Checked keys are prepended as root frames in the flamegraph (e.g., [endpoint: GET /users] appears at the top of the stack).
tagleaf — Same as tagroot, but appended as leaf frames.

Label keys are sorted alphabetically. The %-prefixed VM state keys (%GC, %GVL) appear first, making it easy to add GC or GVL state as leaf/root frames.

7.4.6 Access control

Rperf::Viewer has no built-in authentication. Restrict access using your framework’s existing mechanisms:

# Rails: route constraint (admin-only)
# config/routes.rb
require "rperf/viewer"
constraints ->(req) { req.session[:admin] } do
  mount Rperf::Viewer.new(nil, path: ""), at: "/rperf"
end

7.5 On-demand profiling with Rperf.profile

If you want to profile only specific endpoints or jobs — with zero overhead elsewhere — use Rperf.start(defer: true) and Rperf.profile:

# config/initializers/rperf.rb
require "rperf"

Rperf.start(defer: true, mode: :wall, frequency: 99)

Then wrap specific code paths with profile:

class UsersController < ApplicationController
  def index
    Rperf.profile(endpoint: "GET /users") do
      @users = User.all
    end
  end
end

Outside profile blocks, the timer is paused and overhead is zero. Inside profile blocks, all threads in the process are sampled (each thread’s samples carry its own labels).

Combine with Rperf::Viewer to browse results, or save to a file with Rperf.snapshot + Rperf.save for offline analysis with rperf report.

7.6 Full Rails example

A typical Rails setup with both web and job profiling:

# config/initializers/rperf.rb
require "rperf/rack"
require "rperf/viewer"
require "rperf/sidekiq"

Rperf.start(defer: true, mode: :wall, frequency: 99)

# Viewer and request labeling
Rails.application.config.middleware.use Rperf::Viewer
Rails.application.config.middleware.use Rperf::RackMiddleware

# Label Sidekiq jobs
Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add Rperf::SidekiqMiddleware
  end
end

# Take a snapshot every 60 minutes
Thread.new do
  loop do
    sleep 60 * 60
    Rperf::Viewer.instance&.take_snapshot!
  end
end

Visit /rperf/ and use tagroot to group the flamegraph by endpoint or job class.

View on GitHub

Deep Dive

This part explores how to read and act on profiling results, and explains how rperf works under the hood.

8. Architecture Overview

This chapter describes rperf’s high-level architecture and the core data structures that underpin profiling. Understanding these internals helps you interpret edge cases in profiling results and appreciate the design trade-offs.

8.1 System diagram

rperf consists of a C extension and a Ruby wrapper:

mermaid source

graph TD
    subgraph "C Extension (ext/rperf/rperf.c)"
        Timer["Timer<br/>(signal or pthread_cond_timedwait)"]
        PJ["rb_postponed_job_trigger"]
        Sample["Sample callback<br/>(rperf_sample_job)"]
        ThreadHook["Thread event hook<br/>(SUSPENDED/READY/RESUMED/EXITED)"]
        GCHook["GC event hook<br/>(ENTER/EXIT/START/END_MARK/END_SWEEP)"]
        BufA["Sample buffer A"]
        BufB["Sample buffer B"]
        Worker["Worker thread<br/>(timer + aggregation)"]
        FrameTable["Frame table<br/>(VALUE → frame_id)"]
        AggTable["Aggregation table<br/>(stack → weight)"]
    end

    subgraph "Ruby (lib/rperf.rb)"
        API["Rperf.start / Rperf.stop"]
        Encoders["Encoders<br/>(PProf / Collapsed / Text)"]
        StatOut["stat output"]
    end

    Timer -->|triggers| PJ
    PJ -->|at safepoint| Sample
    Sample -->|writes| BufA
    ThreadHook -->|writes| BufA
    GCHook -->|writes| BufA
    BufA -->|swap when full| BufB
    BufB -->|aggregated by| Worker
    Worker -->|inserts| FrameTable
    Worker -->|inserts| AggTable
    API -->|calls| Sample
    AggTable -->|read at stop| Encoders
    FrameTable -->|resolved at stop| Encoders
    Encoders --> StatOut

The C extension handles all performance-critical operations: timer management, sample recording, and aggregation. The Ruby layer provides the user-facing API, output encoding, and statistics formatting.

8.2 Global profiler state

rperf uses a single global rperf_profiler_t struct. Only one profiling session can be active at a time (single session limitation). The struct holds:

Timer configuration (frequency, mode, signal number)
Double-buffered sample storage (two sample buffers + frame pools, swapped when full)
Frame table (VALUE → uint32 frame_id, deduplicates frame references)
Aggregation table (unique stack → accumulated weight)
Worker thread handle (unified timer + aggregation thread)
Thread-specific key for per-thread data
GC phase tracking state
Sampling overhead counters

8.3 Per-thread data

Each thread gets a rperf_thread_data_t struct stored via Ruby’s thread-specific data API (rb_internal_thread_specific_set). This tracks:

prev_time_ns: Previous time reading (for computing weight)
prev_wall_ns: Previous wall time reading
suspended_at_ns: Wall timestamp when thread was suspended
ready_at_ns: Wall timestamp when thread became ready
(Backtrace is not saved at SUSPENDED; it is re-captured at RESUMED)
label_set_id: Current label set ID (0 = no labels)

Thread data is created lazily on first encounter and freed on the EXITED event or at profiler stop.

8.4 GC safety

Frame VALUEs must be protected from garbage collection. rperf wraps the profiler struct in a TypedData object with a custom dmark function that marks three regions:

Both frame pools (active and standby buffers)
Frame table keys (unique frame VALUEs)

The frame table keys array starts at 4,096 entries and grows by 2× when full. Growth allocates a new array, copies existing data, and swaps the pointer atomically (memory_order_release). The old array is kept alive until stop to prevent use-after-free if GC’s mark phase is reading it concurrently. The dmark function loads the keys pointer with memory_order_acquire and the count with memory_order_acquire to ensure a consistent view.

8.5 Fork safety and multi-process profiling

8.5.1 Fork safety (C level)

rperf registers a pthread_atfork child handler that silently stops profiling in the forked child process (fork safety):

Clears the timer/signal state
Re-initializes mutex/condvar (may have been locked by parent’s worker thread at fork time)
Removes event hooks (thread events, GC events)
Frees sample buffers, frame table, and aggregation table
Resets GC state, stats, and profile refcount

The parent process continues profiling unaffected. The child can start a fresh profiling session if needed.

8.5.2 Multi-process profiling (Ruby level)

When multi-process profiling is enabled (the default for CLI usage), rperf uses Ruby’s Process._fork hook (available since Ruby 3.1) to automatically restart profiling in forked child processes. The flow is:

Before fork: The _fork hook creates a session directory (once, on first fork) for collecting per-process profiles. The root process’s output is redirected to the session directory.
In child: After pthread_atfork cleans up C state, _restart_in_child starts a new profiling session with output directed to the session directory. A %pid label is set for per-process identification.
On child exit: The inherited at_exit hook calls Rperf.stop, writing the child’s profile to the session directory as a .json.gz file.
On root exit: The root writes its own profile, then aggregates all .json.gz files in the session directory into a single merged output (stat report or file). The session directory is then deleted.

For spawned children (spawn, system), the child inherits RUBYLIB (pointing to rperf’s lib directory) and RUBYOPT=-rrperf, plus environment variables (RPERF_SESSION_DIR, RPERF_ROOT_PROCESS). When the child Ruby process loads rperf, the auto-start block detects it is not the root process and writes its profile directly to the session directory.

8.5.3 Session directory

The session directory is created under $RPERF_TMPDIR, $XDG_RUNTIME_DIR, or the system temp directory (in that priority order), inside a per-user subdirectory (rperf-$UID/, mode 0700). Each session gets a unique directory name: rperf-$PID-$RANDOM.

Stale session directories from crashed processes are cleaned up automatically: when a new CLI session starts, it checks for session directories whose root PID is no longer alive and removes them.

8.5.4 Label merging

When aggregating profiles from multiple processes, label sets are remapped to ensure consistency. If two child processes use the same label values (e.g., both have endpoint: "GET /users"), they share the same label set ID in the merged output. The %pid label ensures samples from different processes remain distinguishable even after merging.

8.6 Known limitations

8.6.1 Running EC race

There is a known race condition in the Ruby VM where rb_postponed_job_trigger from the timer thread may set the interrupt flag on the wrong thread’s execution context. This happens when a new thread’s native thread starts before acquiring the GVL. The result is that timer samples may miss threads doing C busy-wait, with their CPU time leaking into the next SUSPENDED event’s stack.

This is a Ruby VM bug, not a rperf bug, and affects all postponed-job-based profilers.

However, rperf mitigates this on Linux by using SIGEV_THREAD_ID to deliver the timer signal exclusively to the dedicated worker thread, rather than to an arbitrary Ruby thread. Since the worker thread calls rb_postponed_job_trigger itself — rather than having a signal handler fire on a Ruby thread that may have a stale running_ec — the race window is significantly narrower. The nanosleep fallback (macOS and signal: false) behaves similarly, as the worker thread is the sole caller of rb_postponed_job_trigger.

8.6.2 Single session

Only one profiling session can be active at a time due to the global profiler state. Calling Rperf.start while already profiling is not supported.

8.6.3 Method-level granularity

rperf profiles at the method level, not the line level. Frame labels use rb_profile_frame_full_label for qualified names (e.g., Integer#times, MyClass#method_name). Line numbers are not included.

View on GitHub

9. Sampling

This chapter explains how rperf collects samples: the timer mechanism that triggers sampling, the sampling callback itself, and the event hooks that capture GVL and GC activity.

9.1 Timer and worker thread

rperf uses a single worker thread that handles both timer triggering and periodic sample aggregation. The timer mechanism depends on the platform.

9.1.1 Linux: Signal-based timer (default)

On Linux, rperf uses timer_create with SIGEV_THREAD_ID to deliver a real-time signal (default: SIGRTMIN+8) exclusively to the worker thread at the configured frequency. The signal handler calls rb_postponed_job_trigger to schedule the sampling callback.

Using SIGEV_THREAD_ID ensures the timer signal only targets the worker thread, preventing it from interrupting nanosleep, read, or other blocking syscalls in Ruby threads (e.g., inside rb_thread_call_without_gvl).

sequenceDiagram participant Kernel as Linux Kernel participant Worker as Worker Thread participant VM as Ruby VM participant Job as Postponed Job Kernel->>Worker: SIGRTMIN+8 (every 1ms at 1000Hz) Worker->>VM: rb_postponed_job_trigger() Note over VM: Continues until safepoint... VM->>Job: rperf_sample_job() Job->>Job: rb_profile_frames() + record sample Note over Worker: Also aggregates samples
when swap_ready is set

mermaid source

sequenceDiagram
    participant Kernel as Linux Kernel
    participant Worker as Worker Thread
    participant VM as Ruby VM
    participant Job as Postponed Job

    Kernel->>Worker: SIGRTMIN+8 (every 1ms at 1000Hz)
    Worker->>VM: rb_postponed_job_trigger()
    Note over VM: Continues until safepoint...
    VM->>Job: rperf_sample_job()
    Job->>Job: rb_profile_frames() + record sample
    Note over Worker: Also aggregates samples<br/>when swap_ready is set

This approach provides precise timing (~1000us median interval at 1000Hz).

9.1.2 Fallback: pthread_cond_timedwait

On macOS, or when signal: false is set on Linux, the worker thread uses pthread_cond_timedwait with an absolute deadline as the timer:

Timeout (deadline reached): trigger rb_postponed_job_trigger and advance deadline
Signal (swap_ready set): aggregate the standby buffer immediately

The deadline-based approach avoids drift when aggregation takes time. This mode has ~100us timing imprecision compared to the signal-based approach.

9.2 Sampling callback

When the postponed job fires, rperf_sample_job runs on whatever thread currently holds the GVL. It only samples that thread using rb_thread_current().

This is a deliberate design choice:

rb_profile_frames can only capture the current thread’s stack
There’s no need to iterate Thread.list — combined with GVL event hooks, rperf gets broad coverage of all threads (though a known race in the Ruby VM can cause occasional missed samples)

The sampling callback:

Gets or creates per-thread data (rperf_thread_data_t)
Reads the current clock (CLOCK_THREAD_CPUTIME_ID for CPU mode, CLOCK_MONOTONIC for wall mode)
Computes weight as time_now - prev_time
Captures the backtrace with rb_profile_frames directly into the frame pool
Records the sample (frame start index, depth, weight, type)
Updates prev_time

9.3 GVL event tracking (wall mode)

In wall mode, rperf hooks into Ruby’s thread event API to track GVL transitions. This captures time that sampling alone would miss — time spent off the GVL.

stateDiagram-v2 [*] --> Running: Thread holds GVL Running --> Suspended: SUSPENDED event
(record sample + save timestamp) Suspended --> Ready: READY event
(record wall timestamp) Ready --> Running: RESUMED event
(record GVL blocked + wait samples) Running --> [*]: EXITED event
(cleanup thread data)

mermaid source

stateDiagram-v2
    [*] --> Running: Thread holds GVL
    Running --> Suspended: SUSPENDED event<br/>(record sample + save timestamp)
    Suspended --> Ready: READY event<br/>(record wall timestamp)
    Ready --> Running: RESUMED event<br/>(record GVL blocked + wait samples)
    Running --> [*]: EXITED event<br/>(cleanup thread data)

9.3.1 SUSPENDED

When a thread releases the GVL (e.g., before I/O):

Capture the current backtrace into the frame pool
Record a normal sample (time since last sample)
Save the wall timestamp for later use (backtrace is re-captured at RESUMED)

9.3.2 READY

When a thread becomes ready to run (e.g., I/O completed):

Record the wall timestamp (no GVL needed — only simple C operations)

9.3.3 RESUMED

When a thread reacquires the GVL:

Capture the current backtrace into the frame pool (captured here, not at SUSPENDED, to avoid buffer mismatch after a double-buffer swap; the Ruby stack is unchanged while off-GVL)
Record a sample with vm_state = GVL_BLOCKED: weight = ready_at - suspended_at (off-GVL time)
Record a sample with vm_state = GVL_WAIT: weight = resumed_at - ready_at (GVL contention time)

These vm_state values are later converted to labels (%GVL: blocked and %GVL: wait) by the Ruby layer at encoding time. This way, off-GVL time and GVL contention are accurately attributed to the code that triggered them, even though no timer-based sampling can occur while the thread is off the GVL.

9.4 GC phase tracking

rperf hooks into Ruby’s internal GC events to track garbage collection time:

Event	Action
`GC_START`	Set phase to marking
`GC_END_MARK`	Set phase to sweeping
`GC_END_SWEEP`	Clear phase
`GC_ENTER`	Capture wall timestamp and thread info
`GC_EXIT`	Capture backtrace, record sample with `vm_state = GC_MARK` or `vm_state = GC_SWEEP`

GC samples always use wall time regardless of the profiling mode, because GC time is real elapsed time that affects application latency.

9.5 Deferred string resolution

During sampling, rperf stores raw frame VALUEs (Ruby internal object references) in the frame pool — not strings. This deferred string resolution keeps the hot path allocation-free and fast.

String resolution happens at stop time:

Rperf.stop calls _c_stop
The frame table maps each unique frame VALUE to a [path, label] string pair via rb_profile_frame_full_label and rb_profile_frame_path
These resolved strings are passed to the Ruby encoders

This means sampling only writes integers (VALUE pointers and timestamps) to pre-allocated buffers. No Ruby objects are created, no GC pressure is added during profiling.

View on GitHub

10. Aggregation

This chapter explains how rperf processes raw samples into compact, deduplicated data structures. Aggregation keeps memory usage bounded during long profiling sessions.

10.1 Overview

By default (aggregate: true), rperf periodically aggregates samples in the background. Raw samples — each containing a stack trace, weight, and metadata — are merged into two hash tables: a frame table and an aggregation table. Identical stacks are combined by summing their weights.

10.2 Double buffering

Two sample buffers alternate roles to allow concurrent sampling and aggregation:

The active buffer receives new samples from sampling callbacks
When the active buffer reaches 10,000 samples, the buffers swap
The standby buffer is processed by the worker thread in the background

graph LR subgraph "Active" A["Sample buffer A
(receiving samples)"] end subgraph "Standby" B["Sample buffer B
(being aggregated)"] end A -->|"swap when
10K samples"| B B -->|"swap back
after aggregation"| A

mermaid source

graph LR
    subgraph "Active"
        A["Sample buffer A<br/>(receiving samples)"]
    end
    subgraph "Standby"
        B["Sample buffer B<br/>(being aggregated)"]
    end
    A -->|"swap when<br/>10K samples"| B
    B -->|"swap back<br/>after aggregation"| A

If the worker thread hasn’t finished processing the standby buffer, the swap is skipped and the active buffer continues growing (fallback to unbounded mode).

Each buffer has its own frame pool. The active buffer’s frame pool receives raw VALUE references from rb_profile_frames. When buffers swap, the standby buffer’s frame pool is processed and then cleared for reuse.

10.3 Frame table

The frame table (VALUE → uint32_t frame_id) deduplicates frame references. Each unique frame VALUE gets a small integer ID.

Keys: raw frame VALUEs (the only GC mark target for aggregated data)
Values: compact uint32_t IDs used in the aggregation table
Initial capacity: 4,096 entries, grows by 2× when full
Growth uses atomic pointer swaps for GC dmark safety (see GC safety)

Using uint32_t frame IDs instead of full VALUEs halves the memory needed for stack storage in the aggregation table.

10.4 Aggregation table

The aggregation table merges identical stacks by summing their weights:

Key: (frame_ids[], thread_seq, label_set_id, vm_state) — the stack (as frame IDs), thread sequence number, label set, and VM state
Value: accumulated weight in nanoseconds

Frame IDs are stored contiguously in a separate stack pool. Each aggregation table entry points into this pool (start index + depth).

The vm_state field records the VM activity at sample time (e.g., GVL_BLOCKED, GVL_WAIT, GC_MARK, GC_SWEEP, or NORMAL). It is part of the aggregation key so that samples with the same stack but different VM states are kept separate. At encoding time, the Ruby layer converts vm_state to labels (%GVL and %GC) via merge_vm_state_labels!.

The label_set_id is also part of the key, so samples with the same stack but different user labels are kept separate.

10.5 Memory usage

Buffer	Initial size	Element size	Initial memory
Sample buffer (×2)	16,384	32B	512KB × 2
Frame pool (×2)	131,072	8B (VALUE)	1MB × 2
Frame table keys	4,096	8B (VALUE)	32KB
Frame table buckets	8,192	4B (uint32)	32KB
Agg table buckets	2,048	28B	56KB
Stack pool	4,096	4B (uint32)	16KB

Total: ~3.6MB with aggregate: true, ~1.5MB with aggregate: false (single buffer only). Frame table and aggregation table grow dynamically as needed.

With aggregation enabled, memory usage stays roughly constant regardless of profiling duration — only the number of unique stacks matters, not the total number of samples collected.

View on GitHub

11. Output Encoding

This chapter describes how rperf converts its internal aggregated data into output formats. Encoding runs once at stop time, so it is not performance-critical.

11.1 From aggregated data to output

When Rperf.stop is called, the C extension returns the aggregated data to Ruby:

Frame table: An array of [path, label] string pairs, indexed by frame ID. Strings are resolved from raw VALUEs via rb_profile_frame_full_label and rb_profile_frame_path.
Aggregation table: An array of [frame_ids, weight, thread_seq, label_set_id, vm_state] entries.
Label sets: An array of frozen Hashes mapping label keys to values.

Before passing the data to encoders, the Ruby layer calls merge_vm_state_labels! to convert each sample’s vm_state into labels: vm_state = GVL_BLOCKED becomes {"%GVL" => "blocked"}, GVL_WAIT becomes {"%GVL" => "wait"}, GC_MARK becomes {"%GC" => "mark"}, and GC_SWEEP becomes {"%GC" => "sweep"}. These labels are merged into the sample’s existing label_set_id, so they appear alongside user labels like endpoint in the output.

The Ruby encoders consume these arrays to produce the final output.

11.2 JSON format (default)

rperf’s native output format is JSON. The default extension is .json.gz (gzip-compressed); .json writes plain text JSON. Both are auto-detected by Rperf.load. The JSON file preserves all internal data: aggregated samples with frame stacks, weights, thread sequence numbers, label set IDs, and the full label sets array. It also includes profiling metadata (mode, frequency, duration, sample counts).

JSON is the recommended format for use with the rperf viewer (rperf report opens it directly in the browser) and for programmatic analysis in Ruby. Unlike pprof, no external tools are needed. Use .json when you want to inspect the output with jq or a text editor.

11.3 pprof encoder

rperf encodes the pprof^{[Gang, 2010]} protobuf format entirely in Ruby, with no protobuf gem dependency. The encoder in Rperf::PProf.encode:

Builds a string table (index 0 is always the empty string)
Converts string frames to index frames and merges identical stacks
Builds location and function tables
Encodes the Profile protobuf message field by field
Compresses the result with gzip

This hand-written encoder is simple (~100 lines) and handles only the subset of the pprof protobuf schema that rperf needs.

11.3.1 Embedded metadata

Each pprof profile includes metadata as comments: rperf version, profiling mode, frequency, and Ruby version. The time_nanos and duration_nanos fields record when and how long the profile was collected.

11.3.2 Thread sequence labels

Every sample carries a thread_seq numeric label — a 1-based thread sequence number assigned when rperf first sees each thread. This allows grouping flame graphs by thread in pprof tools.

11.4 Sample labels

Rperf.label enables per-context annotation of samples. The implementation is split between Ruby and C to keep the hot path minimal:

Ruby side: Manages label sets as frozen Hash objects in an array (@label_set_table). A deduplication index (@label_set_index) maps each unique Hash to an integer ID. Rperf.label(key: value) merges the new labels with the current set, interns the result, and passes the integer ID to C.
C side: Each rperf_thread_data_t stores a label_set_id (integer). When a sample is recorded, the current thread’s label_set_id is copied into the sample — a single integer, adding zero allocation overhead to the hot path. The aggregation table includes label_set_id in its hash key, so identical stacks with different labels remain separate entries.
Encoding: At encode time, encoders read the label_sets array. The JSON encoder preserves labels as-is; the PProf encoder writes each label as a pprof Sample.Label (key-value string pair). The rperf viewer’s tagfocus, tagroot, tagleaf, and tagignore controls can filter and group by label in both formats.

11.5 Collapsed stacks encoder

The collapsed stacks encoder produces one line per unique stack: frames joined by semicolons (bottom-to-top), followed by a space and the weight in nanoseconds. This format is consumed by FlameGraph tools and speedscope.

When label sets are present, labels are not included in the collapsed output — use pprof format for label-aware analysis.

11.6 Text report encoder

The text report encoder produces a human-readable summary with:

Header: Total profiled time, sample count, and frequency
Flat table: Top 50 functions sorted by self time (the function was the leaf/deepest frame)
Cumulative table: Top 50 functions sorted by total time (the function appeared anywhere in the stack)

This format requires no external tools and is suitable for quick analysis, issue reports, or AI-assisted analysis.

View on GitHub

A. Quick Reference

A.1 CLI cheat sheet

# Quick performance overview
rperf stat ruby my_app.rb

# Performance overview with profile tables
rperf stat --report ruby my_app.rb

# Full performance report (same as stat --report)
rperf exec ruby my_app.rb

# Record to default file (rperf.json.gz, json format, cpu mode)
rperf record ruby my_app.rb

# Record with options
rperf record -m wall -f 500 -o profile.pb.gz ruby my_app.rb

# Record and print text profile to stdout
rperf record -p ruby my_app.rb

# Record to text format
rperf record -o profile.txt ruby my_app.rb

# Record to collapsed stacks
rperf record -o profile.collapsed ruby my_app.rb

# Profile a preforking server (all workers)
rperf stat -m wall bundle exec unicorn

# Disable child process tracking
rperf stat --no-inherit ruby my_app.rb

# View profile in rperf viewer
rperf report

# Print top functions
rperf report --top profile.pb.gz

# Compare two profiles
rperf diff before.pb.gz after.pb.gz

# Full documentation
rperf help

A.2 Ruby API cheat sheet

require "rperf"

# Block form
data = Rperf.start(output: "profile.pb.gz", mode: :cpu) do
  # code to profile
end

# Manual form
Rperf.start(frequency: 1000, mode: :wall)
# ...
data = Rperf.stop

# Save to file
Rperf.save("profile.pb.gz", data)
Rperf.save("profile.collapsed", data)
Rperf.save("profile.txt", data)

# Snapshot (read data without stopping)
snap = Rperf.snapshot
Rperf.save("snap.pb.gz", snap)
snap = Rperf.snapshot(clear: true)  # reset after snapshot

# Deferred start + targeted profiling
Rperf.start(defer: true, mode: :wall)
Rperf.profile(endpoint: "/users") do
  # only this block is sampled
end
data = Rperf.stop

# Labels (annotate samples with context)
Rperf.label(request: "abc") do
  # samples inside get request="abc" label
end
Rperf.labels       # get current labels

# Rack middleware (labels requests by endpoint)
require "rperf/rack"
use Rperf::RackMiddleware                    # Rails: config.middleware.use Rperf::RackMiddleware
use Rperf::RackMiddleware, label_key: :route # custom label key

# Active Job (labels jobs by class name)
require "rperf/active_job"
class ApplicationJob < ActiveJob::Base
  include Rperf::ActiveJobMiddleware
end

# Sidekiq (labels jobs by worker class name)
require "rperf/sidekiq"
Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add Rperf::SidekiqMiddleware
  end
end

A.3 Environment variables

These are used internally by the CLI to configure the auto-started profiler:

Variable	Values	Description
`RPERF_ENABLED`	`1`	Enable auto-start on require
`RPERF_OUTPUT`	path	Output file path
`RPERF_FREQUENCY`	integer	Sampling frequency in Hz
`RPERF_MODE`	`cpu`, `wall`	Profiling mode
`RPERF_FORMAT`	`json`, `pprof`, `collapsed`, `text`	Output format
`RPERF_VERBOSE`	`1`	Print statistics to stderr
`RPERF_STAT`	`1`	Enable stat mode output
`RPERF_STAT_REPORT`	`1`	Include profile tables in stat output
`RPERF_STAT_COMMAND`	string	Command string shown in stat output header
`RPERF_AGGREGATE`	`0`	Disable sample aggregation (return raw samples)
`RPERF_SIGNAL`	integer or `false`	Timer signal (Linux only): signal number, or `false` for nanosleep thread
`RPERF_DEFER`	`1`	Start with timer paused; use `Rperf.profile` to activate
`RPERF_ROOT_PROCESS`	PID string	Root process PID for multi-process session (set by CLI)
`RPERF_SESSION_DIR`	path	Session directory for multi-process profile collection (set by CLI)
`RPERF_TMPDIR`	path	Override base directory for session directories (default: `$XDG_RUNTIME_DIR` or system tmpdir)

A.4 Profiling mode comparison

Aspect	cpu	wall
Clock	`CLOCK_THREAD_CPUTIME_ID`	`CLOCK_MONOTONIC`
I/O time	Not measured	`%GVL: blocked` label
Sleep time	Not measured	`%GVL: blocked` label
GVL contention	Not measured	`%GVL: wait` label
GC time	`%GC: mark`, `%GC: sweep` labels	`%GC: mark`, `%GC: sweep` labels
Best for	CPU hotspots	Latency analysis

A.5 Output format comparison

Extension	Format	Tooling required
`.json.gz` (default)	JSON (rperf native)	None (`rperf report`)
`.pb.gz`	pprof protobuf	Go (`rperf report`)
`.collapsed`	Collapsed stacks	flamegraph.pl or speedscope
`.txt`	Text report	None

A.6 VM state labels

Label	Value	Mode	Meaning
`%GVL`	`blocked`	wall	Thread off-GVL (I/O, sleep, C ext)
`%GVL`	`wait`	wall	Thread waiting for GVL (contention)
`%GC`	`mark`	both	GC marking phase (wall time)
`%GC`	`sweep`	both	GC sweeping phase (wall time)
`%pid`	PID string	both	Child process ID (multi-process mode only)

These labels are attached to samples alongside user labels. In pprof output, filter with -tagfocus=%GVL=blocked, -tagroot=%GC, -tagfocus=%pid=1234, etc.

View on GitHub

Bibliography

[Aman, 2014] Aman Gupta. stackprof: A sampling call-stack profiler for Ruby 2.2+. 2014.
[Arnaldo, 2010] Arnaldo Carvalho de Melo. The New Linux perf Tools. 2010.
[Brendan, 2016] Brendan Gregg. "The Flame Graph". Communications of the ACM, 59(6), pp. 48--57. 2016. DOI
[Gang, 2010] Gang Ren and Eric Tune and Tipp Moseley and Yixin Shi and Silvius Rus and Robert Hundt. "Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers". IEEE Micro, 30(4), pp. 65--79. 2010. DOI
[Susan, 1982] Susan L. Graham and Peter B. Kessler and Marshall K. McKusick. "gprof: A Call Graph Execution Profiler". In Proceedings of the SIGPLAN '82 Symposium on Compiler Construction, pp. 120--126. 1982. DOI
[Todd, 2010] Todd Mytkowicz and Amer Diwan and Matthias Hauswirth and Peter F. Sweeney. "Evaluating the Accuracy of Java Profilers". In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '10), pp. 187--197. 2010. DOI

Index

%

%pid label: 3. CLI Usage

C

collapsed stacks: 1. Introduction; 1. Introduction; 6. Output Formats
cpu mode: 3. CLI Usage
cumulative time: 5. Interpreting Results

D

deferred string resolution: 9. Sampling

F

Framework Integration: 4. Ruby API
flat time: 5. Interpreting Results
fork: 3. CLI Usage
fork safety: 8. Architecture Overview

G

GVL: 1. Introduction; 5. Interpreting Results

J

JSON: 1. Introduction; 1. Introduction
json: 6. Output Formats

L

label set: 8. Architecture Overview

M

multi-process profiling: 8. Architecture Overview

P

postponed job: 1. Introduction; 1. Introduction
pprof: 1. Introduction; 1. Introduction; 4. Ruby API; 6. Output Formats
preforking server: 3. CLI Usage

R

Rperf.label: 4. Ruby API; 4. Ruby API; 6. Output Formats
Rperf.labels: 4. Ruby API
Rperf.profile: 4. Ruby API; 4. Ruby API; 4. Ruby API; 7. Framework Integration; 7. Framework Integration
Rperf.save: 4. Ruby API
Rperf.snapshot: 4. Ruby API
Rperf.start: 4. Ruby API; 7. Framework Integration
Rperf.stop: 4. Ruby API
Rperf::Viewer: 1. Introduction; 6. Output Formats
rperf: 1. Introduction
rperf diff: 3. CLI Usage
rperf exec: 3. CLI Usage
rperf record: 1. Introduction; 3. CLI Usage
rperf report: 1. Introduction; 3. CLI Usage
rperf stat: 1. Introduction; 3. CLI Usage

S

safepoint: 1. Introduction; 1. Introduction
safepoint bias: 1. Introduction; 1. Introduction
sampling: 1. Introduction
session directory: 8. Architecture Overview
single session: 8. Architecture Overview

W

wall mode: 3. CLI Usage; 3. CLI Usage; 3. CLI Usage
weight: 1. Introduction

rperf ガイド

Ruby 向けのセーフポイントベースのサンプリング性能プロファイラ。

rperf は、セーフポイントでサンプリングを行い、実際の時間差分を重みとして使用することでセーフポイントバイアスを補正する Ruby プログラムのプロファイラです。perf ライクな CLI、Ruby API を提供し、JSON、pprof、collapsed stacks、テキストレポート形式で出力します。

主な特徴:

CPU モードとウォールクロックモード
セーフポイントバイアスを補正する時間重み付きサンプリング
GVL 競合と GC フェーズの追跡
JSON（ネイティブ）、pprof、collapsed stacks、テキスト出力形式
低オーバーヘッド (デフォルト 1000 Hz でのサンプリングコールバックコスト < 0.2%)
Ruby >= 3.4.0、POSIX システム (Linux, macOS) が必要

はじめに

このパートでは、rperf とは何か、なぜ作られたのか、そしてシステムへのインストール方法を説明します。

1. イントロダクション

1.1 rperf とは？

rperf は、セーフポイントベースのサンプリング性能プロファイラです。Ruby プログラムがどこで時間を使っているかを特定します。CPU 計算、I/O 待ち、GVL 競合、ガベージコレクションのいずれであっても対応できます。

従来のサンプリングプロファイラがサンプルを均等にカウントするのに対し、rperf は各サンプルの重みとして実際の時間差分（ナノ秒単位）を使用します。これにより、postponed job ベースのサンプリングに固有のセーフポイントバイアス問題を補正し、より正確な結果を生成します。

rperf は Linux の perf^{[Arnaldo, 2010]} に着想を得ており、record、stat、report などのサブコマンドを持つ馴染みのある CLI インターフェースを提供します。

1.1.1 利点

正確なプロファイリング: 時間差分重み付けがセーフポイントバイアスを補正し、従来のカウントベースのプロファイラよりも実際の時間分布に近い結果を生成します。
GVL / GC の可視化: GVL ブロッキングと競合（wall モード）および GC marking/sweeping（両モード）をサンプルラベル（%GVL、%GC）として追跡します。別途ツールは不要です。
標準出力形式: JSON（rperf ネイティブ、デフォルト）、pprof protobuf（go tool pprof と互換）、collapsed stacks（フレームグラフ / speedscope 向け）、人間が読めるテキスト形式で出力します。
ビューア内蔵: ブラウザ内フレームグラフビューア（Rperf::Viewer）をタグフィルタリング付きで内蔵。--html 出力は CDN から d3 を読み込みます。
低オーバーヘッド: デフォルト 1000 Hz でのサンプリングコールバックコストは < 0.2% で、本番環境での使用に適しています。
シンプルな CLI: rperf stat で概要を素早く確認し、rperf record + rperf report で詳細な分析が可能です。

1.1.2 制限事項

Ruby 3.4 以降のみ: Ruby 3.4 で導入された API を必要とします。
POSIX のみ: Linux と macOS。Windows はサポートされていません。
メソッドレベルの粒度: 行番号の解像度はありません。プロファイルはメソッド名のみを表示します。
単一セッション: C 拡張のグローバル状態のため、同時に 1 つのプロファイリングセッションのみ有効です。
セーフポイントレイテンシ: サンプルは依然としてセーフポイントまで遅延されます。時間差分重み付けはバイアスを補正しますが、正確な中断された命令ポインタは復元できません。

1.2 なぜ別のプロファイラが必要なのか？

Ruby には stackprof^{[Aman, 2014]} などのプロファイリングツールが既にあります。では、なぜ rperf が必要なのでしょうか？

1.2.1 セーフポイントバイアス問題

多くの Ruby サンプリングプロファイラは、シグナルハンドラ内で直接 rb_profile_frames を呼び出してバックトレースを収集します。このアプローチはシグナルの実際のタイミングでバックトレースを取得できますが、文書化されていない VM 内部の状態に依存しています。rb_profile_frames は async-signal-safe であることが保証されておらず、VM が内部構造を更新中の場合、結果が不正確になる可能性があります。

rperf は異なるアプローチを取ります。シグナルハンドラから安全に処理を遅延させるための Ruby VM の公式 API である postponed job メカニズム（rb_postponed_job）を使用します。バックトレースの収集は次のセーフポイントまで遅延されます。セーフポイントとは、VM が一貫した状態にあり、rb_profile_frames が信頼性の高い結果を返せるポイントです。タイマーがセーフポイント間で発火した場合、実際のサンプルは次のセーフポイントまで遅延されるというトレードオフがあります。

各サンプルが均等にカウントされる場合（重み = 1）、長時間実行される C メソッドによって遅延されたサンプルは、即座に取得されたサンプルと同じ重みを持つことになります。これがセーフポイントバイアス^{[Todd, 2010]}問題です。スレッドがセーフポイントに到達したときに実行中の関数がそうでない場合よりも多く出現し、セーフポイント間の関数は過小評価されます。

sequenceDiagram participant Timer participant VM participant Profiler Timer->>VM: シグナル (t=0ms) Note over VM: C メソッド実行中...
まだ停止できない VM->>Profiler: セーフポイント到達 (t=3ms) Note over Profiler: 従来: weight = 1
rperf: weight = 3ms Timer->>VM: シグナル (t=1ms) VM->>Profiler: セーフポイント到達 (t=1ms) Note over Profiler: 従来: weight = 1
rperf: weight = 1ms

mermaid source

sequenceDiagram
    participant Timer
    participant VM
    participant Profiler

    Timer->>VM: シグナル (t=0ms)
    Note over VM: C メソッド実行中...<br/>まだ停止できない
    VM->>Profiler: セーフポイント到達 (t=3ms)
    Note over Profiler: 従来: weight = 1<br/>rperf: weight = 3ms
    Timer->>VM: シグナル (t=1ms)
    VM->>Profiler: セーフポイント到達 (t=1ms)
    Note over Profiler: 従来: weight = 1<br/>rperf: weight = 1ms

rperf は各サンプルの重みとして clock_now - clock_prev を記録することでこの問題を解決します。3ms 遅延されたサンプルは 1ms のサンプルの 3 倍の重みを持ち、実際に時間が費やされた場所を正確に反映します。

1.2.2 その他の利点

GVL と GC の認識: wall モードでは、rperf は GVL 外でのブロック時間と GVL の再取得待ち時間を追跡します（%GVL ラベル）。GC の marking/sweeping フェーズの時間は CPU・wall 両モードで追跡されます（%GC ラベル、重みは常に wall time）。これらはユーザーラベルと同様に label_sets に格納され、pprof で -tagfocus 等を使ってフィルタリングできます。
perf ライクな CLI: rperf stat コマンドで性能の概要を素早く確認でき（perf stat のように）、rperf record + rperf report で詳細なプロファイリングが可能です。
標準出力: rperf は JSON（ネイティブ形式、デフォルト）、pprof protobuf 形式（Go の pprof ツールエコシステムと互換）、collapsed stacks（フレームグラフ^{[Brendan, 2016]}や speedscope 向け）、人間が読めるテキスト形式で出力します。
低オーバーヘッド: デフォルト 1000 Hz でのサンプリングコールバックコストは < 0.2% で、本番環境での使用に適しています。

1.3 要件

Ruby >= 3.4.0
POSIX システム (Linux または macOS)
Go (オプション、rperf report で .pb.gz ファイルと rperf diff サブコマンド用)

1.4 クイックスタート

Ruby スクリプトをプロファイルして結果を表示:

# rperf をインストール
gem install rperf

# 性能の概要を表示
rperf stat ruby my_app.rb

# プロファイルを記録してインタラクティブに表示
rperf record ruby my_app.rb
rperf report

Ruby コードから rperf を使用する場合:

require "rperf"

Rperf.start(output: "profile.pb.gz") do
  # プロファイルしたいコード
end

View on GitHub

2. インストール

2.1 gem のインストール

rperf は C 拡張を持つ Ruby gem として配布されています:

gem install rperf

または Gemfile に追加:

gem "rperf"

その後:

bundle install

2.2 インストールの確認

rperf が正しくインストールされていることを確認します:

rperf --help

以下のように表示されるはずです:

Usage: rperf record [options] command [args...]
       rperf stat [options] command [args...]
       rperf exec [options] command [args...]
       rperf report [options] [file]
       rperf diff [options] base.pb.gz target.pb.gz
       rperf help

Run 'rperf help' for full documentation

2.3 プラットフォームサポート

rperf は POSIX システムをサポートしています:

プラットフォーム	タイマー実装	備考
Linux	`timer_create` + シグナル (デフォルト)	最高精度 (1000Hz で ~1000us)
Linux	`nanosleep` スレッド (`signal: false` 使用時)	フォールバック、~100us のドリフト/tick
macOS	`nanosleep` スレッド	シグナルベースのタイマーは利用不可

Linux では、rperf はデフォルトで timer_create と SIGEV_THREAD_ID を使い、専用のワーカースレッドにタイマーシグナルを送ります。これにより、Ruby スレッドを中断することなく正確なインターバルタイミングが実現されます。シグナル番号はデフォルトで SIGRTMIN+8 で、Ruby API の Rperf.start に対する signal: キーワード引数で変更できます。

macOS では（および Linux で signal: false を設定した場合）、rperf は nanosleep ループを持つ専用の pthread にフォールバックします。

2.4 オプション: Go ツールチェーン

Go は rperf report で pprof (.pb.gz) ファイルを表示する場合と、rperf diff サブコマンドでのみ必要です。どちらも go tool pprof のラッパーです。これらの機能が必要な場合は、go.dev から Go をインストールしてください。

Go がなくても、デフォルトの JSON (.json.gz) 形式での rperf report を含む rperf のすべての機能を使用できます。この形式では rperf 組み込みビューアが開きます。pprof ファイルも speedscope などの他のツールで表示したり、テキスト/collapsed 形式で直接生成したりできます。

View on GitHub

使い方

このパートでは、rperf を使って Ruby プログラムをプロファイルする方法を説明します。コマンドラインからの使い方、ライブラリとしての使い方、そして利用可能な出力形式について解説します。

3. CLI の使い方

rperf は perf ライクなコマンドラインインターフェースを提供し、5 つの主要サブコマンドがあります: record、stat、exec、report、diff。

3.1 rperf stat

rperf stat は性能の概要を最も手軽に確認する方法です。コマンドを wall モードのプロファイリングで実行し、サマリーを stderr に出力します。

rperf stat ruby my_app.rb

3.1.1 例: CPU バウンドなプログラム

シンプルなフィボナッチ計算:

# fib.rb
def fib(n)
  return n if n <= 1
  fib(n - 1) + fib(n - 2)
end
fib(35)

rperf stat の実行:

rperf stat ruby fib.rb

 Performance stats for 'ruby fib.rb':

           744.4 ms   user
            32.0 ms   sys
           491.0 ms   real

           481.0 ms 100.0%  [Rperf] CPU execution
            12.0 ms         [Ruby ] GC time (9 count: 6 minor, 3 major)
         154,468            [Ruby ] allocated objects
          66,596            [Ruby ] freed objects
              25 MB         [OS   ] peak memory (maxrss)
              31            [OS   ] context switches (10 voluntary, 21 involuntary)
               0 MB         [OS   ] disk I/O (0 MB read, 0 MB write)

  481 samples / 481 triggers, 0.1% profiler overhead

出力の意味:

user/sys/real: 標準的な時間計測（time コマンドと同様）
時間の内訳: [Rperf] プレフィックスの行はサンプリングから導出された時間内訳を示します — CPU 実行、GVL ブロック（I/O/sleep）、GVL 待ち（競合）、GC marking、GC sweeping
Ruby の統計: [Ruby ] プレフィックスの行は Ruby ランタイム情報を示します — GC 回数、割り当て済み/解放済みオブジェクト、YJIT 比率（有効な場合）
OS の統計: [OS ] プレフィックスの行は OS レベルの統計を示します — ピークメモリ、コンテキストスイッチ、ディスク I/O

--report を使用すると、フラットおよびキュムレイティブな上位 50 関数テーブルが出力に追加されます。

3.1.2 例: CPU と I/O の混合

# mixed.rb
def cpu_work(n)
  sum = 0
  n.times { |i| sum += i * i }
  sum
end

def io_work
  sleep(0.05)
end

5.times do
  cpu_work(500_000)
  io_work
end

rperf stat の実行:

rperf stat ruby mixed.rb

stat は常に wall モードを使用するため、CPU と I/O の間の時間配分が確認できます。[Rperf] GVL blocked の行は sleep/I/O に費やされた時間を示し、CPU execution は計算時間を示します。

3.1.3 stat のオプション

rperf stat [options] command [args...]

オプション	説明
`-o PATH`	プロファイルをファイルにも保存 (デフォルト: なし)
`-f HZ`	サンプリング周波数 (Hz) (デフォルト: 1000)
`-m MODE`	`cpu` または `wall` (デフォルト: `wall`)
`--report`	フラット/キュムレイティブなプロファイルテーブルを出力に含める
`--signal VALUE`	タイマーシグナル (Linux のみ): シグナル番号、または `false` で nanosleep スレッド
`--no-aggregate`	サンプル集約を無効化（生サンプルを保持）
`--no-inherit`	fork/spawn した子プロセスをプロファイルしない (デフォルト: inherit)
`-v`	追加のサンプリング統計を出力

3.2 rperf exec

rperf exec はコマンドをプロファイリング付きで実行し、完全な性能レポートを stderr に出力します。rperf stat --report と同等です。デフォルトで wall モードを使用し、ファイルには保存しません。

rperf exec ruby my_app.rb

stat が表示するすべて（時間計測、時間内訳、GC/メモリ/OS 統計）に加えて、フラットおよびキュムレイティブな上位 50 関数テーブルが出力されます。

3.2.1 exec のオプション

rperf exec [options] command [args...]

オプション	説明
`-o PATH`	プロファイルをファイルにも保存 (デフォルト: なし)
`-f HZ`	サンプリング周波数 (Hz) (デフォルト: 1000)
`-m MODE`	`cpu` または `wall` (デフォルト: `wall`)
`--signal VALUE`	タイマーシグナル (Linux のみ): シグナル番号、または `false` で nanosleep スレッド
`--no-aggregate`	サンプル集約を無効化（生サンプルを保持）
`--no-inherit`	fork/spawn した子プロセスをプロファイルしない (デフォルト: inherit)
`-v`	追加のサンプリング統計を出力

3.3 rperf record

rperf record はコマンドをプロファイルし、結果をファイルに保存します。詳細な分析のためにプロファイルをキャプチャする主要な方法です。

rperf record ruby my_app.rb

デフォルトでは、CPU モードの JSON 形式で rperf.json.gz に保存します。

3.3.1 例: プロファイルの記録

rperf record ruby fib.rb

これにより rperf.json.gz が作成されます。その後 rperf report で分析したり、他の形式に変換したりできます。

3.3.2 プロファイリングモードの選択

rperf は 2 つのプロファイリングモードをサポートしています:

cpu (デフォルト): スレッドごとの CPU 時間を計測します。CPU サイクルを消費する関数を見つけるのに最適です。sleep、I/O、GVL 待ちの時間は無視されます。
wall: ウォールクロック時間を計測します。I/O、sleep、GVL 競合を含む、wall time の使われ方を見つけるのに最適です。

# CPU モード (デフォルト)
rperf record ruby my_app.rb

# wall モード
rperf record -m wall ruby my_app.rb

3.3.3 出力形式の選択

rperf はファイル拡張子から形式を自動検出します:

# JSON 形式 (デフォルト)
rperf record -o profile.json.gz ruby my_app.rb

# pprof 形式
rperf record -o profile.pb.gz ruby my_app.rb

# collapsed stacks (FlameGraph / speedscope 向け)
rperf record -o profile.collapsed ruby my_app.rb

# 人間が読めるテキスト
rperf record -o profile.txt ruby my_app.rb

形式を明示的に指定することもできます:

rperf record --format text -o profile.dat ruby my_app.rb

3.3.4 例: テキスト出力

rperf record -o profile.txt ruby fib.rb

テキスト出力は以下のようになります:

Total: 509.5ms (cpu)
Samples: 509, Frequency: 1000Hz

 Flat:
           509.5 ms 100.0%  Object#fib (fib.rb)

 Cumulative:
           509.5 ms 100.0%  Object#fib (fib.rb)
           509.5 ms 100.0%  <main> (fib.rb)

3.3.5 例: wall モードのテキスト出力

rperf record -m wall -o wall_profile.txt ruby mixed.rb

Total: 311.8ms (wall)
Samples: 80, Frequency: 1000Hz

 Flat:
            44.1 ms  14.1%  Object#cpu_work (mixed.rb)
            13.9 ms   4.5%  Integer#times (<internal:numeric>)
             3.2 ms   1.0%  Kernel#sleep (<C method>)

 Cumulative:
           311.8 ms 100.0%  Integer#times (<internal:numeric>)
           311.8 ms 100.0%  block in <main> (mixed.rb)
           311.8 ms 100.0%  <main> (mixed.rb)
           253.8 ms  81.4%  Kernel#sleep (<C method>)
           253.8 ms  81.4%  Object#io_work (mixed.rb)
            58.0 ms  18.6%  Object#cpu_work (mixed.rb)

wall モードでは、%GVL: blocked ラベルが支配的なコストを示しています – これは io_work の sleep 時間です。cpu_work の CPU 時間は明確に分離されています。GVL や GC のアクティビティはスタックフレームではなくサンプルのラベルとして記録され、pprof の -tagfocus フラグ（例: -tagfocus=%GVL=blocked）でフィルタリングできます。カテゴリごとの時間内訳（CPU 実行、GVL ブロック、GVL 待機、GC マーキング、GC スイーピング）は rperf stat で確認できます。

3.3.6 Verbose 出力

-v フラグはプロファイリング中にサンプリング統計を stderr に出力します:

rperf record -v ruby my_app.rb

[Rperf] mode=cpu frequency=1000Hz
[Rperf] sampling: 98 calls, 0.11ms total, 1.1us/call avg
[Rperf] samples recorded: 904
[Rperf] top 10 by flat:
[Rperf]       53.4ms  50.1%  Object#cpu_work (-e)
[Rperf]       17.0ms  15.9%  Integer#times (<internal:numeric>)
...

3.3.7 record のオプション

rperf record [options] command [args...]

オプション	説明
`-o PATH`	出力ファイル (デフォルト: `rperf.json.gz`)
`-f HZ`	サンプリング周波数 (Hz) (デフォルト: 1000)
`-m MODE`	`cpu` または `wall` (デフォルト: `cpu`)
`--format FMT`	`json`、`pprof`、`collapsed`、または `text` (デフォルト: 拡張子から自動検出)
`-p, --print`	テキストプロファイルを stdout に出力 (`--format=text --output=/dev/stdout` と同等)
`--signal VALUE`	タイマーシグナル (Linux のみ): シグナル番号、または `false` で nanosleep スレッド
`--no-aggregate`	サンプル集約を無効化（生サンプルを保持）
`-v`	サンプリング統計を stderr に出力

3.4 rperf report

rperf report はプロファイルを分析用に開きます。JSON 形式 (.json.gz) のファイルには rperf の組み込みビューアを使用し（Go 不要）、pprof 形式 (.pb.gz) のファイルには go tool pprof を使用します（Go が必要）。

# インタラクティブな Web UI を開く (デフォルト)
rperf report

# 特定のファイルを開く
rperf report profile.json.gz

# 上位の関数を出力
rperf report --top

# pprof テキストサマリーを出力
rperf report --text

# 静的 HTML ビューアを出力（サーバー不要）
rperf report --html profile.json.gz > report.html

--html フラグはプロファイルデータをインラインで埋め込んだ単一の HTML ファイルを生成します。d3 と d3-flamegraph は CDN から読み込むため、表示にはインターネット接続が必要ですが、サーバーは不要です。プロファイルの共有や GitHub Pages などの静的サイトでのホスティングに便利です。

3.4.1 例: top とテキスト出力

先ほど記録した fib.rb のプロファイルを使用:

rperf report --top rperf.json.gz

 Flat:
        577.3 ms 100.0%  Object#fib (fib.rb)
          0.0 ms   0.0%  <main> (-e)

 Cumulative:
        577.3 ms 100.0%  Object#fib (fib.rb)
        577.3 ms 100.0%  <main> (-e)

.pb.gz ファイルの場合、--top と --text は go tool pprof を使用し、pprof 形式の出力を生成します。

デフォルト動作（--top や --text なし）では、フレームグラフ、上位関数ビュー、コールグラフの可視化を備えたインタラクティブな Web UI がブラウザで開きます。JSON 形式では rperf 組み込みビューア、pprof 形式では pprof^{[Gang, 2010]} を利用します。

3.4.2 report のオプション

オプション	説明
`--top`	フラットタイムによる上位の関数を出力
`--text`	テキストレポートを出力
`--html`	静的 HTML ビューアを標準出力に出力（`.json.gz` のみ）
(デフォルト)	ブラウザでインタラクティブな Web UI を開く

3.5 rperf diff

rperf diff は 2 つの pprof プロファイルを比較し、差分（target - base）を表示します。最適化の効果を測定するのに便利です。

# ブラウザで差分を開く
rperf diff before.pb.gz after.pb.gz

# 差分による上位の関数を出力
rperf diff --top before.pb.gz after.pb.gz

# テキスト差分を出力
rperf diff --text before.pb.gz after.pb.gz

3.5.1 ワークフロー例

# ベースラインをプロファイル
rperf record -o before.pb.gz ruby my_app.rb

# 最適化を実施...

# 再度プロファイル
rperf record -o after.pb.gz ruby my_app.rb

# 比較
rperf diff before.pb.gz after.pb.gz

3.6 マルチプロセスプロファイリング

デフォルトで、すべての CLI サブコマンド（stat、exec、record）は fork や spawn された Ruby 子プロセスも自動的にプロファイルします。全プロセスのプロファイルが 1 つの出力にマージされます。stat レポートには集約データが表示され、record は単一のマージ済みファイルを生成します。

これは Unicorn や Puma のようなプリフォーキングサーバーで特に有用です:

# Unicorn サーバーの全ワーカーをプロファイル
rperf stat -m wall bundle exec unicorn -c config.rb

# Puma サーバーのマージ済みプロファイルを記録
rperf record -m wall -o profile.json.gz bundle exec puma

3.6.1 仕組み

子プロセストラッキングが有効な場合（デフォルト）、rperf は:

プロセスごとのプロファイルを収集するためのセッションディレクトリを設定
fork された子プロセスでプロファイリングを再開する Process._fork フックをインストール
spawn された Ruby 子プロセス（spawn、system、バッククォート経由）は RUBYLIB（rperf の lib ディレクトリを指す）と RUBYOPT=-rrperf を継承して自動的にプロファイリングを開始
各子プロセスが終了時にプロファイルをセッションディレクトリに書き込み
ルートプロセスが全プロファイルを集約して最終出力を生成

各子プロセスのサンプルにはプロセス ID を含む %pid ラベルが付与されます。ビューアーで PID によるフィルタリング（tagfocus）や pprof での -tagfocus=%pid=1234 が可能です。

3.6.2 子プロセストラッキングの無効化

--no-inherit でルートプロセスのみをプロファイルします:

rperf stat --no-inherit ruby my_app.rb

3.6.3 例: fork で異なる処理を実行

rperf stat ruby -e '
  def parent_work = 10_000_000.times { 1 + 1 }
  def child_work  = 10_000_000.times { 1 + 1 }
  fork { child_work }
  Process.waitall
  parent_work
'

stat 出力には両プロセスの集約データが表示され、「Ruby processes profiled」のカウントが含まれます:

               2            [Rperf] Ruby processes profiled

3.6.4 制限事項

デーモン化した子プロセス: Process.daemon を呼び出して親より長く生きるプロセスのプロファイルは失われます。親が終了時にセッションディレクトリを集約・削除するためです。
クロスプロセススナップショット: Rperf.snapshot は現在のプロセスのみを対象とします。全子プロセスを同時にスナップショットする方法はありません。
非 Ruby 子プロセス: Ruby 子プロセスのみがプロファイルされます。シェルスクリプトや Python プロセスなどは影響を受けません。
独立した rperf 利用者: 子プロセスが独自に rperf を使用する場合（自身のコードで Rperf.start）、継承されたオートスタートセッションと競合します。そのようなプログラムは rperf を require する前に RPERF_ENABLED を環境変数から削除する必要があります。

3.7 rperf help

rperf help はプロファイリングモード、出力形式、VM 状態ラベル、診断のヒントを含む完全なリファレンスドキュメントを出力します。

rperf help

人間が読むのにも AI による分析にも適した詳細なドキュメントが出力されます。

View on GitHub

4. Ruby API

rperf はプログラマティックなプロファイリングのための Ruby API を提供します。特定のコードセクションをプロファイルしたり、テストスイートにプロファイリングを統合したり、カスタムプロファイリングワークフローを構築したりする場合に便利です。

4.1 基本的な使い方

4.1.1 ブロック形式（推奨）

rperf を使用する最もシンプルな方法は、Rperf.start のブロック形式です。ブロックをプロファイルし、プロファイリングデータを返します:

require "rperf"

data = Rperf.start(output: "profile.json.gz", frequency: 1000, mode: :cpu) do
  # プロファイルしたいコード
end

output: を指定すると、ブロックの終了時にプロファイルが自動的にファイルに書き込まれます。このメソッドは、さらなる処理のために生データのハッシュも返します。

4.1.2 例: フィボナッチ関数のプロファイリング

require "rperf"

def fib(n)
  return n if n <= 1
  fib(n - 1) + fib(n - 2)
end

data = Rperf.start(frequency: 1000, mode: :cpu) do
  fib(33)
end

Rperf.save("profile.txt", data)

実行すると以下のような出力が得られます:

Total: 192.7ms (cpu)
Samples: 192, Frequency: 1000Hz

Flat:
     192.7ms 100.0%  Object#fib (example.rb)

Cumulative:
     192.7ms 100.0%  Object#fib (example.rb)
     192.7ms 100.0%  block in <main> (example.rb)
     192.7ms 100.0%  Rperf.start (lib/rperf.rb)
     192.7ms 100.0%  <main> (example.rb)

4.1.3 手動での開始/停止

ブロック形式が使いにくい場合は、手動でプロファイリングを開始・停止できます:

require "rperf"

Rperf.start(frequency: 1000, mode: :wall)

# ... プロファイルしたいコード ...

data = Rperf.stop

Rperf.stop はデータハッシュを返します。プロファイラが動作していなかった場合は nil を返します。

4.2 Rperf.start のパラメータ

パラメータ	型	デフォルト	説明
`frequency:`	Integer	1000	サンプリング周波数 (Hz)
`mode:`	Symbol	`:cpu`	`:cpu` または `:wall`
`output:`	String	`nil`	停止時に書き込むファイルパス
`verbose:`	Boolean	`false`	統計を stderr に出力
`format:`	Symbol	`nil`	`:json`、`:pprof`、`:collapsed`、`:text`、または `nil`（output 拡張子から自動検出）
`signal:`	Integer/Boolean	`nil`	Linux のみ: `nil` = タイマーシグナル（デフォルト）、`false` = nanosleep スレッド、正の整数 = 特定の RT シグナル番号
`aggregate:`	Boolean	`true`	同一スタックをプロファイリング中に集約してメモリを削減。`false` は生のサンプルごとのデータを返す
`defer:`	Boolean	`false`	タイマーを一時停止した状態で開始。特定のセクションのサンプリングを有効にするには `Rperf.profile` ブロックを使用
`inherit:`	`:fork`, `true`, or `false`	`:fork`	子プロセスの追跡を制御。`:fork` は `Process._fork` フックで fork された子を追跡。`true` は spawn された Ruby 子も追跡（`RUBYLIB` と `RUBYOPT=-rrperf` を設定）。`false` は子プロセス追跡を無効化

4.3 Rperf.stop の戻り値

Rperf.stop はプロファイラが動作していなかった場合は nil を返します。それ以外の場合は Hash を返します:

{
  mode: :cpu,                # または :wall
  frequency: 1000,
  trigger_count: 1234,       # タイマートリガーの回数
  sampling_count: 1234,      # タイマーコールバックの回数
  sampling_time_ns: 56789,   # サンプリングに費やした合計時間（オーバーヘッド）
  detected_thread_count: 4,  # プロファイリング中に検出されたスレッド数
  start_time_ns: 17740...,   # CLOCK_REALTIME エポック（ナノ秒）
  duration_ns: 10000000,     # プロファイリング時間（ナノ秒）

  # aggregate: true（デフォルト）— このモードでのみ存在
  unique_frames: 42,         # ユニークフレーム数
  unique_stacks: 120,        # ユニークスタック数
  aggregated_samples: [                   # [frames, weight, thread_seq, label_set_id] の配列
    [frames, weight, seq, lsi],           #   frames: [[path, label], ...] 最深部が先頭
    ...                                   #   weight: Integer（ナノ秒）
  ],                                      #   seq: Integer（スレッド連番、1 始まり）
                                          #   lsi: Integer（ラベルセット ID、0 = ラベルなし）

  # aggregate: false — このモードでのみ存在（C は raw_samples を返す。
  # Ruby の stop はエンコーダー用に aggregated_samples も構築する）
  raw_samples: [                          # aggregated_samples と同じ要素形式
    [frames, weight, seq, lsi],
    ...
  ],

  label_sets: [{}, {request: "abc"},       # ラベルセットテーブル（ラベル使用時に存在）
              {"%GVL" => "blocked"},       # VM 状態ラベルも含まれる
              {request: "abc", "%GC" => "mark"}],
}

各サンプル（aggregated_samples と raw_samples の両方）には以下が含まれます:

frames: [path, label] ペアの配列、最深部が先頭（リーフフレームがインデックス 0）
weight: このサンプルに帰属する時間（ナノ秒）
thread_seq: スレッド連番（1 始まり、プロファイリングセッションごとに割り当て）
label_set_id: ラベルセット ID（0 = ラベルなし）。label_sets 配列へのインデックス

aggregate: true（デフォルト）の場合、同一スタックはマージされ、重みが合計されます。aggregated_samples 配列にはユニークな (stack, thread_seq, label_set_id) の組み合わせごとに 1 エントリが含まれます。aggregate: false の場合、C 拡張は個々のタイマーサンプルすべてを raw_samples として返します。Ruby の Rperf.stop はエンコーダーが常に動作するように aggregated_samples も構築します。

GVL/GC の状態は label_sets にラベルとして格納されます。C 拡張は内部的に vm_state として記録しますが、Rperf.stop が merge_vm_state_labels! で %GVL/%GC ラベルに変換してから返します。例えば、GVL ブロック中のサンプルの label_sets エントリには {"%GVL" => "blocked"} が含まれます。ユーザーラベルと VM 状態ラベルは同じ label_sets で管理されるため、{request: "abc", "%GVL" => "blocked"} のように組み合わせて使用できます。

4.4 Rperf.save

Rperf.save はプロファイリングデータをサポートされている任意の形式でファイルに書き込みます:

Rperf.save("profile.json.gz", data)      # JSON 形式（デフォルト）
Rperf.save("profile.pb.gz", data)        # pprof 形式
Rperf.save("profile.collapsed", data)    # collapsed stacks
Rperf.save("profile.txt", data)          # テキストレポート

形式はファイル拡張子から自動検出されます。format: キーワードでオーバーライドできます:

Rperf.save("output.dat", data, format: :text)

4.5 Rperf.snapshot

Rperf.snapshot は停止せずに現在のプロファイリングデータを返します。集約モード（デフォルト）でのみ動作します。プロファイリング中でない場合は nil を返します。

Rperf.start(frequency: 1000)
# ... 処理 ...
snap = Rperf.snapshot
Rperf.save("snap.json.gz", snap)
# ... さらに処理（プロファイリングは継続） ...
data = Rperf.stop

clear: true を指定すると、スナップショット取得後に集約データがリセットされます。これにより、各スナップショットが前回のクリア以降の期間のみをカバーするインターバルベースのプロファイリングが可能になります:

Rperf.start(frequency: 1000)
loop do
  sleep 10
  snap = Rperf.snapshot(clear: true)
  Rperf.save("profile-#{Time.now.to_i}.pb.gz", snap)
end

4.6 サンプルラベル

Rperf.label は現在のスレッドのサンプルにキーバリューラベルを付与します。ラベルは pprof のサンプルラベルに表示され、コンテキストごとのフィルタリング（例: リクエストごとのプロファイリング）が可能になります。プロファイリングが動作していない場合、label は暗黙的に無視されます。無条件に呼び出しても安全です（例: Rack ミドルウェアから）。

4.6.1 ブロック形式

ブロックを使用すると、ブロックの終了時にラベルが自動的に復元されます。例外が発生した場合も同様です:

Rperf.label(request: "abc-123", endpoint: "/api/users") do
  handle_request   # 内部のサンプルにこれらのラベルが付く
end
# ここでラベルは以前の状態に復元される

4.6.2 ブロックなし

ブロックなしの場合、ラベルは変更されるまで現在のスレッドに残ります:

Rperf.label(request: "abc-123")
# このスレッドの以降のすべてのサンプルに request="abc-123" が付く

4.6.3 マージと削除

新しいラベルは既存のラベルとマージされます。値を nil に設定するとキーが削除されます:

Rperf.label(request: "abc")
Rperf.label(phase: "db")       # phase を追加、request は保持
Rperf.labels                   #=> {request: "abc", phase: "db"}
Rperf.label(request: nil)      # request を削除
Rperf.labels                   #=> {phase: "db"}

4.6.4 ネストされたブロック

各ブロックはスコープを作成します。終了時に、ブロック内で何が起こったかに関係なく、ブロック前の状態にラベルが復元されます:

Rperf.label(request: "abc") do
  Rperf.label(phase: "db") do
    Rperf.labels  #=> {request: "abc", phase: "db"}
  end
  Rperf.labels    #=> {request: "abc"}
end
Rperf.labels      #=> {}

4.6.5 ラベルによるフィルタリング

ラベルはすべての出力形式で保持されます。rperf ビューア（rperf report）や Rperf::Viewer でインタラクティブにフィルタリングできます:

tagfocus: ラベル値にマッチする正規表現を入力してサンプルを絞り込み（例: abc-123）
tagroot: ラベルキーをルートフレームとして追加しフレームグラフをグループ化（例: request ごと）
tagleaf: ラベルキーをリーフフレームとして追加
tagignore: 特定のラベル値に一致するサンプルを除外

pprof 形式の場合、go tool pprof -tagfocus、-tagroot、-tagleaf、-tagignore で同等の操作が可能です。

4.6.6 ラベルの読み取り

Rperf.labels は現在のスレッドのラベルを Hash として返します:

Rperf.labels  #=> {request: "abc", phase: "db"}

ラベルが設定されていない場合は空の Hash を返します。

4.7 遅延開始と Rperf.profile

4.7.1 なぜ遅延するのか？

通常、Rperf.start はサンプリングタイマーを即座に開始します。タイマーティックごとにスタックトレースをキャプチャするためにアプリケーションが中断されます。これがプロファイリングのオーバーヘッドです。長時間実行されるサーバーでは、すべてのコードに対して常にこのコストを払いたくないかもしれません。特定のエンドポイント、ジョブ、またはコードパスのみを対象にしたい場合があります。

defer: true はこの問題を解決します。プロファイラのインフラストラクチャ（バッファ、フック、ワーカースレッド）をセットアップしますが、タイマーは開始しません。タイマーは Rperf.profile ブロック内でのみ発火します。ブロックの外ではオーバーヘッドはゼロです。シグナルも、割り込みも、スタックキャプチャもありません。

start(defer: false)     start(defer: true)
┌─────────────────┐     ┌─────────────────┐
│ start            │     │ start            │
│ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓ │     │                  │  ← タイマー非発火
│ ▓ sampling ▓▓▓▓ │     │ ┌──profile──┐   │
│ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓ │     │ │▓▓sampling▓│   │  ← タイマー有効
│ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓ │     │ └───────────┘   │
│ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓ │     │                  │  ← タイマー非発火
│ stop             │     │ stop             │
└─────────────────┘     └─────────────────┘

これはフレームワーク統合と組み合わせると特に便利です。ミドルウェアが各リクエスト/ジョブを profile ブロックでラップするため、実際のリクエスト処理のみがサンプリングされます。

4.7.2 Rperf.profile

Rperf.profile はブロックの間タイマーを有効化し、オプションでラベルを適用します。タイマー制御とラベル割り当てを 1 つの呼び出しで組み合わせます。

require "rperf"

Rperf.start(defer: true, mode: :wall)

# ここでタイマーが有効化され、サンプルに endpoint="/users" ラベルが付く
Rperf.profile(endpoint: "/users") do
  handle_request
end
# タイマー一時停止 — オーバーヘッドゼロ

Rperf.profile(endpoint: "/health") do
  check_health
end

data = Rperf.stop
Rperf.save("profile.json.gz", data)

4.7.3 ネスト

profile ブロックはネストできます。タイマーは最も外側のブロックが終了するまで有効です。ラベルは Rperf.label と同様にマージされます:

Rperf.profile(endpoint: "/users") do
  Rperf.profile(phase: "db") do
    # {endpoint: "/users", phase: "db"} でサンプリング
    query_db
  end
  # {endpoint: "/users"} でサンプリング
  render_response
end
# タイマー再び一時停止

4.7.4 Rperf.label との組み合わせ

profile はタイマーを制御し、label はタグのみを追加します。両方を一緒に使用できます:

Rperf.start(defer: true, mode: :wall)

Rperf.profile(endpoint: "/users") do
  Rperf.label(phase: "auth") do
    authenticate     # サンプリングされ、endpoint + phase のラベル付き
  end
  Rperf.label(phase: "db") do
    query_db         # サンプリングされ、endpoint + phase のラベル付き
  end
end

4.7.5 defer なしの場合

profile は通常の（非遅延）start でも動作します。この場合、タイマーは既に動作しており、profile はラベルの適用のみを行います。ブロック付きの Rperf.label と同等です:

Rperf.start(mode: :wall)
Rperf.profile(endpoint: "/users") do
  handle_request   # サンプリングされる（タイマーは既に動作中）
end

4.7.6 エラー処理

profile はプロファイリングが開始されていない場合は RuntimeError を、ブロックが与えられない場合は ArgumentError を発生させます。ブロック内で例外が発生しても、ラベルとタイマーの状態は適切に復元されます。

4.8 実践的な例

4.8.1 Web リクエストハンドラのプロファイリング

require "rperf"

class ApplicationController
  def profile_action
    data = Rperf.start(mode: :wall) do
      # 典型的なリクエストのシミュレーション
      users = User.where(active: true).limit(100)
      result = users.map { |u| serialize_user(u) }
      render json: result
    end

    Rperf.save("request_profile.txt", data)
  end
end

ここで wall モードを使用すると、CPU 時間だけでなくデータベース I/O や GVL 競合も捕捉できます。

4.8.2 CPU プロファイルと wall プロファイルの比較

require "rperf"

def workload
  # CPU と I/O の混合
  100.times do
    compute_something
    sleep(0.001)
  end
end

# CPU プロファイル: CPU サイクルの使われ方を表示
cpu_data = Rperf.start(mode: :cpu) { workload }
Rperf.save("cpu.txt", cpu_data)

# wall プロファイル: wall time の使われ方を表示
wall_data = Rperf.start(mode: :wall) { workload }
Rperf.save("wall.txt", wall_data)

CPU プロファイルは compute_something に集中し、wall プロファイルは sleep 呼び出しを %GVL=blocked ラベル付きのサンプルとして表示します。

4.8.3 サンプルの処理

サンプルデータをプログラマティックに処理できます。デフォルトでは、サンプルは集約されます（同一スタックがマージ）:

require "rperf"

data = Rperf.start(mode: :cpu) { workload }
# data[:aggregated_samples] に集約されたエントリが含まれる（ユニークスタックごとに 1 つ）

# 最もホットな関数を見つける
flat = Hash.new(0)
data[:aggregated_samples].each do |frames, weight, thread_seq|
  leaf_label = frames.first&.last  # frames[0] がリーフ
  flat[leaf_label] += weight
end

top = flat.sort_by { |_, w| -w }.first(5)
top.each do |label, weight_ns|
  puts "#{label}: #{weight_ns / 1_000_000.0}ms"
end

生の（非集約の）サンプルごとのデータを取得するには、aggregate: false を渡します。各タイマーティックが個別のエントリを生成します:

data = Rperf.start(mode: :cpu, aggregate: false) { workload }
# data[:raw_samples] にタイマーサンプルごとに 1 エントリが含まれる
data[:raw_samples].each do |frames, weight, thread_seq|
  puts "thread=#{thread_seq} weight=#{weight}ns depth=#{frames.size}"
end

4.8.4 FlameGraph 用の collapsed stacks の生成

require "rperf"

data = Rperf.start(mode: :cpu) { workload }
Rperf.save("profile.collapsed", data)

collapsed 形式はユニークスタックごとに 1 行で、Brendan Gregg の FlameGraph^{[Brendan, 2016]} ツールや speedscope と互換性があります:

frame1;frame2;...;leaf weight_ns

フレームグラフの SVG を生成できます:

flamegraph.pl profile.collapsed > flamegraph.svg

または .collapsed ファイルを speedscope で直接開くことができます。

View on GitHub

5. 結果の解釈

5.1 フラットタイムとキュムレイティブタイム

フラットタイムとキュムレイティブタイムの違いを理解することは、効果的なプロファイリングに不可欠です。これは古典的なプロファイリング文献 gprof^{[Susan, 1982]} でも述べられています。

フラットタイム: 関数に直接帰属する時間 — サンプル内でリーフ（最深部のフレーム）だった時間です。フラットタイムが高い場合、その関数自体が高コストな処理を行っています。
キュムレイティブタイム: 関数がスタックのどこかに出現するすべてのサンプルの時間です。キュムレイティブタイムが高い場合、その関数（またはそれが呼び出すもの）が高コストです。

mermaid source

graph TD
    A["main()"] -->|calls| B["process()"]
    B -->|calls| C["compute()"]
    B -->|calls| D["format()"]

    style C fill:#ff6b6b,color:#fff
    style D fill:#ffa07a,color:#fff

compute() のフラットタイムが高い場合、compute() 自体を最適化します。process() のキュムレイティブタイムが高くフラットタイムが低い場合、コストはその子関数（compute() と format()）にあります。

5.2 重みの単位

rperf のすべての重みは、プロファイリングモードに関係なくナノ秒です:

1,000 ns = 1 us (マイクロ秒)
1,000,000 ns = 1 ms (ミリ秒)
1,000,000,000 ns = 1 s (秒)

5.3 VM 状態ラベル

通常のユーザー定義ラベルに加えて、rperf は GVL や GC の状態を表す特別なラベルをサンプルに付与します。これらは合成フレームではなく、label_sets に格納されるラベルです。ビューアの tagfocus、tagroot、tagleaf でフィルタリングやグループ化が可能です。

C 拡張は各サンプルに vm_state を記録し、Ruby の Rperf.stop で merge_vm_state_labels! によってラベルに変換されます。

5.3.1 %GVL=blocked

モード: wall のみ

スレッドが GVL の外にいた時間 — I/O 操作、sleep、GVL を解放する C 拡張の実行中。サンプルには %GVL キーに blocked の値を持つラベルが付与されます。この時間は SUSPENDED イベント（スレッドが GVL を解放した時点）でキャプチャされたスタックに帰属されます。

%GVL=blocked のサンプルが多い場合、プログラムは I/O バウンドです。キュムレイティブビューで I/O をトリガーしている関数を確認してください。

ビューアで tagfocus に blocked を入力するか、tagroot で %GVL をチェックして GVL 状態ごとにグループ化できます。

5.3.2 %GVL=wait

モード: wall のみ

スレッドが準備完了後に GVL を再取得するまで待った時間。これは GVL 競合を示します。別のスレッドがこのスレッドの実行を妨げて GVL を保持しています。

%GVL=wait のサンプルが多い場合、スレッドが GVL 上でシリアライズされています。GVL を保持する処理を減らすか、Ractor を使用するか、子プロセスに処理を移動することを検討してください。

ビューアで tagfocus に wait を入力してこれらのサンプルを絞り込めます。

5.3.3 %GC=mark

モード: cpu および wall

GC の marking フェーズに費やされた時間。常に wall time で計測されます。GC をトリガーしたスタックに帰属されます。

%GC=mark のサンプルが多い場合、生存オブジェクトが多すぎます。長寿命のアロケーションを減らしてください。

5.3.4 %GC=sweep

モード: cpu および wall

GC の sweeping フェーズに費やされた時間。常に wall time で計測されます。GC をトリガーしたスタックに帰属されます。

%GC=sweep のサンプルが多い場合、短寿命のオブジェクトが多すぎます。オブジェクトの再利用やオブジェクトプールの使用を検討してください。

ビューアで tagfocus に mark や sweep を入力して絞り込むか、tagroot で %GC をチェックしてフェーズごとにグループ化できます。

pprof 形式の場合、go tool pprof -tagfocus=%GC profile.pb.gz や -tagroot=%GC で同等の操作が可能です。

5.4 よくある問題の診断

5.4.1 高い CPU 使用率

モード: cpu

フラット CPU 時間が高い関数を探します。これらが CPU サイクルを消費している関数です。

対策: ホットな関数を最適化するか（より良いアルゴリズム、キャッシュ）、呼び出し頻度を減らします。

5.4.2 遅いリクエスト / 高レイテンシ

モード: wall

キュムレイティブな wall time が高い関数を探します。

%GVL=blocked が支配的な場合: I/O または sleep がボトルネックです。データベースクエリ、HTTP 呼び出し、ファイル I/O を確認してください。
%GVL=wait が支配的な場合: GVL 競合です。GVL を保持する処理を減らすか、Ractor/子プロセスに移行してください。

5.4.3 GC プレッシャー

モード: cpu または wall

%GC=mark と %GC=sweep のサンプルを探します。

%GC=mark が多い場合: 生存オブジェクトが多すぎます。長寿命オブジェクトのアロケーションを減らしてください。
%GC=sweep が多い場合: 短寿命オブジェクトが多すぎます。オブジェクトの再利用やプーリングを行ってください。

rperf stat の出力には GC 回数やアロケーション済み/解放済みオブジェクト数も表示されるため、アロケーションの多いコードの診断に役立ちます。

5.4.4 マルチスレッドアプリが想定より遅い

モード: wall

スレッド間の %GVL=wait ラベルを持つサンプルの時間を探します。

rperf stat ruby threaded_app.rb

GVL 競合が発生しているワークロードの出力例:

 Performance stats for 'ruby threaded_app.rb':

            89.9 ms   user
            14.0 ms   sys
            41.2 ms   real

             0.5 ms   1.2%  [Rperf] CPU execution
             9.8 ms  23.8%  [Rperf] GVL blocked (I/O, sleep)
            30.8 ms  75.0%  [Rperf] GVL wait (contention)

ここでは wall time の 75% が GVL 競合です。4 つのスレッドが GVL を奪い合い、CPU 処理がシリアライズされています。

5.5 効果的なプロファイリングのヒント

デフォルト周波数 (1000 Hz) はほとんどの場合で適切です。長時間の本番プロファイリングでは、100-500 Hz でオーバーヘッドをさらに削減できます。
代表的なワークロードをプロファイルしてください。マイクロベンチマークではなく、実際のリクエストやバッチジョブをプロファイルすると、よりアクション可能な結果が得られます。
cpu と wall のプロファイルを比較して、CPU バウンドと I/O バウンドのコードを区別します。wall モードでホットだが CPU モードではない関数は、I/O でブロックされています。
rperf diff を使用して最適化の効果を測定します。前後でプロファイルし、比較します。
verbose フラグ (-v) でプロファイラのオーバーヘッドを確認し、上位の関数をすぐに確認できます。

View on GitHub

6. 出力形式

rperf は 4 つの出力形式をサポートしています。形式はファイル拡張子から自動検出されるか、--format フラグ（CLI）または format: パラメータ（API）で明示的に設定できます。

6.1 JSON (デフォルト)

JSON 形式は rperf のネイティブ形式で、デフォルトの出力形式です。2 つのバリエーションがあります:

.json.gz（デフォルト）— gzip 圧縮 JSON。コンパクトで、保存・共有に推奨。
.json — プレーンテキスト JSON。jq、テキストエディタ、その他の JSON ツールで展開なしに読める。

両方のバリエーションは Rperf.load と rperf report で自動検出されます。

表示方法:

# rperf ビューアで開く
rperf report profile.json.gz
rperf report profile.json      # こちらも動作

Ruby でロード:

data = Rperf.load("profile.json.gz")  # gzip
data = Rperf.load("profile.json")     # プレーンテキスト

利点: rperf のネイティブ形式。表示に外部ツール不要。ポータブルな形式。Ruby にロードし直したり、JSON 対応の任意のツールで処理可能。jq でパイプ処理したい場合は .json を使用。

6.2 pprof

pprof 形式は gzip 圧縮された Protocol Buffers バイナリです。これは Go の pprof ツールで使用される標準形式です。

拡張子の規約: .pb.gz

表示方法:

# インタラクティブな Web UI（Go が必要）
rperf report profile.pb.gz

# 上位の関数
rperf report --top profile.pb.gz

# テキストレポート
rperf report --text profile.pb.gz

# go tool pprof を直接使用する場合
go tool pprof -http=:8080 profile.pb.gz
go tool pprof -top profile.pb.gz

speedscope の Web インターフェースから pprof ファイルをインポートすることもできます。

利点: 幅広いツールエコシステムでサポートされている標準形式。2 つのプロファイル間の差分比較が可能。フレームグラフ、コールグラフ、ソースアノテーション付きのインタラクティブな探索。

6.2.1 埋め込みメタデータ

rperf は各 pprof プロファイルに以下のメタデータを埋め込みます:

フィールド	説明
`comment`	rperf バージョン、プロファイリングモード、周波数、Ruby バージョン
`time_nanos`	プロファイル収集開始時刻（エポックナノ秒）
`duration_nanos`	プロファイル期間（ナノ秒）
`doc_url`	rperf ドキュメントへのリンク

コメントの表示: go tool pprof -comments profile.pb.gz

6.2.2 サンプルラベル

各サンプルには thread_seq 数値ラベルが付きます。これはプロファイリングセッション中に rperf が各スレッドを初めて検出したときに割り当てられるスレッド連番（1 始まり）です。Rperf.label を使用すると、カスタムのキーバリュー文字列ラベルもサンプルに付与されます。

go tool pprof でこれらのラベルをフィルタリングできます:

go tool pprof -tagroot=thread_seq profile.pb.gz     # スレッドごとにグループ化
go tool pprof -tagfocus=request=abc-123 profile.pb.gz  # ラベルでフィルタ
go tool pprof -tagroot=request profile.pb.gz         # ラベルでグループ化

Note

rperf ビューア（JSON 形式）は Go 不要で同じタグ操作（tagfocus、tagignore、tagroot、tagleaf）をサポートしています。詳細はブラウザ内ビューアを参照してください。

6.3 Collapsed stacks

collapsed stacks 形式は、ユニークなスタックトレースごとに 1 行のプレーンテキスト形式です。各行にはセミコロン区切りのスタック（ボトムからトップ）の後にスペースとナノ秒単位の重みが続きます。

拡張子の規約: .collapsed

形式:

bottom_frame;...;top_frame weight_ns

出力例:

<main>;Integer#times;block in <main>;Object#cpu_work;Integer#times;Object#cpu_work 53419170
<main>;Integer#times;block in <main>;Object#cpu_work;Integer#times 16962309
<main>;Integer#times;block in <main>;Object#io_work;Kernel#sleep 2335151

使用方法:

# FlameGraph SVG を生成
rperf record -o profile.collapsed ruby my_app.rb
flamegraph.pl profile.collapsed > flamegraph.svg

# speedscope で開く（.collapsed ファイルをドラッグ＆ドロップ）
# macOS: open https://www.speedscope.app/
# Linux: xdg-open https://www.speedscope.app/

利点: シンプルなテキスト形式で、コマンドラインツールで処理しやすい。Brendan Gregg の FlameGraph^{[Brendan, 2016]} ツールや speedscope と互換性があります。

6.3.1 collapsed stacks のプログラマティックなパース

File.readlines("profile.collapsed").each do |line|
  stack, weight = line.rpartition(" ").then { |s, _, w| [s, w.to_i] }
  frames = stack.split(";")
  # frames[0] がボトム（main）、frames[-1] がリーフ（ホット）
  puts "#{frames.last}: #{weight / 1_000_000.0}ms"
end

6.4 テキストレポート

テキスト形式は、フラットおよびキュムレイティブな上位 N テーブルを含む人間が読める（AI にも読める）レポートです。

拡張子の規約: .txt

出力例:

Total: 509.5ms (cpu)
Samples: 509, Frequency: 1000Hz

 Flat:
           509.5 ms 100.0%  Object#fib (fib.rb)

 Cumulative:
           509.5 ms 100.0%  Object#fib (fib.rb)
           509.5 ms 100.0%  <main> (fib.rb)

セクション:

ヘッダー: プロファイルされた合計時間、サンプル数、周波数
フラットテーブル: セルフタイム（関数がリーフ/最深部フレームだった時間）でソートされた関数
キュムレイティブテーブル: トータルタイム（関数がスタックのどこかに出現した時間）でソートされた関数

利点: ツール不要 — cat で読める。テーブルごとにデフォルトで上位 50 エントリ。クイック分析、イシューレポートでの共有、AI アシスタントへの入力に適しています。

6.5 形式の比較

機能	json	pprof	collapsed	text
ファイルサイズ	中 (json + gzip)	小 (バイナリ + gzip)	中 (テキスト)	小 (テキスト)
フレームグラフ	あり (rperf ビューア)	あり (pprof Web UI)	あり (flamegraph.pl)	なし
コールグラフ	なし	あり	なし	なし
差分比較	あり (`rperf diff`、Go 必要)	あり (`rperf diff`)	なし	なし
ツール不要	はい	いいえ (Go 必要)	いいえ (flamegraph.pl 必要)	はい
Ruby にロード	あり (`Rperf.load`)	なし	なし	なし
プログラマティックなパース	容易 (JSON)	複雑 (protobuf)	シンプル	シンプル
AI フレンドリー	はい	いいえ	はい	はい

6.6 自動検出ルール

ファイル拡張子	形式
`.json.gz`	JSON (デフォルト)
`.pb.gz`	pprof
`.collapsed`	Collapsed stacks
`.txt`	テキストレポート

デフォルトの出力ファイル（rperf.json.gz）は JSON 形式を使用します。

View on GitHub

7. フレームワーク統合

rperf は、Web フレームワークやジョブプロセッサからのコンテキストで自動的にプロファイルおよびラベル付けするオプションの統合機能を提供します。これらは Rperf.profile を使用し、タイマーの有効化とラベルの設定を同時に行います。start(defer: true) とシームレスに連携します。プロファイリングの開始は別途行ってください（例: イニシャライザで）。

Note

Rperf.profile が有効化するタイマーはプロセス全体に作用します。あるスレッドの profile ブロック実行中、同時に動いている他のスレッドもサンプリングされます。各スレッドのサンプルにはそのスレッド自身のラベルが付くため、プロファイル上で区別できます。この設計により、プロファイル対象区間でのGVL競合やバックグラウンド処理の影響も可視化できます。

7.1 Rack ミドルウェア

Rperf::RackMiddleware は各リクエストをプロファイルし、エンドポイント（METHOD /path）でラベル付けします。デフォルトで動的セグメントが正規化されます（数値 ID → :id、UUID → :uuid）。label: :raw で元の PATH_INFO をそのまま使うか、label: にカスタム proc を渡してフレームワーク固有のルート正規化も可能です。

7.1.1 Rails

# config/initializers/rperf.rb
require "rperf/rack"
require "rperf/viewer"

Rperf.start(defer: true, mode: :wall, frequency: 99)

Rails.application.config.middleware.use Rperf::Viewer
Rails.application.config.middleware.use Rperf::RackMiddleware

# 60分ごとにスナップショットを取得
Thread.new do
  loop do
    sleep 60 * 60
    Rperf::Viewer.instance&.take_snapshot!
  end
end

/rperf/ にアクセスしてビューアを開きます。tagfocus でエンドポイントをフィルタリング（例: GET /api/users）したり、tagroot でエンドポイントごとにフレームグラフをグループ化できます。

7.1.2 Sinatra

require "sinatra"
require "rperf/rack"
require "rperf/viewer"

Rperf.start(defer: true, mode: :wall, frequency: 99)
use Rperf::Viewer
use Rperf::RackMiddleware

Thread.new { loop { sleep 3600; Rperf::Viewer.instance&.take_snapshot! } }

get "/hello" do
  "Hello, world!"
end

7.1.3 ラベルキーのカスタマイズ

デフォルトではミドルウェアはラベルキー :endpoint を使用します。変更できます:

use Rperf::RackMiddleware, label_key: :route

7.2 Active Job

Rperf::ActiveJobMiddleware は各ジョブをプロファイルし、クラス名（例: SendEmailJob）でラベル付けします。任意の Active Job バックエンド（Sidekiq、GoodJob、Solid Queue など）で動作します。

イニシャライザでプロファイリングを開始し、ベースジョブクラスにインクルードします:

# config/initializers/rperf.rb
Rperf.start(defer: true, mode: :wall, frequency: 99)

# app/jobs/application_job.rb
class ApplicationJob < ActiveJob::Base
  include Rperf::ActiveJobMiddleware
end

すべてのサブクラスが自動的にラベルを継承します:

class SendEmailJob < ApplicationJob
  def perform(user)
    # ここのサンプルに job="SendEmailJob" が付く
  end
end

ビューアで tagfocus を使ってジョブ名でフィルタリングしたり、tagroot でジョブクラスごとにフレームグラフをグループ化できます。

7.3 Sidekiq

Rperf::SidekiqMiddleware は各ジョブをプロファイルし、ワーカークラス名でラベル付けします。Active Job ベースのワーカーとプレーンな Sidekiq ワーカーの両方をカバーします。

Sidekiq のサーバーミドルウェアとして登録します:

# config/initializers/sidekiq.rb
Rperf.start(defer: true, mode: :wall, frequency: 99)

Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add Rperf::SidekiqMiddleware
  end
end

Note

Active Job と Sidekiq を併用する場合は、どちらか一方を選んでください。両方を使用するとラベルが重複します。Sidekiq ミドルウェアの方がより汎用的です（非 Active Job ワーカーもカバー）。

7.4 ブラウザ内ビューア

Rperf::Viewer は、設定可能なマウントパスでインタラクティブなプロファイリング UI を提供する Rack ミドルウェアです。スナップショットをメモリに保持し、d3-flame-graph を使ってブラウザ内で描画します。gem の依存やビルドツールは不要です。ビューアは実行時に CDN（cdnjs.cloudflare.com, cdn.jsdelivr.net）から d3.js と d3-flame-graph を読み込むため、初回アクセス時にインターネット接続が必要です。

Warning

Rperf::Viewer には組み込みの認証機能がなく、スタックトレースやラベル値を含むプロファイリングデータがエンドポイントにアクセスできる全員に公開されます。本番環境では必ずフレームワークの認証でアクセスを制限してください（下記「アクセス制御」参照）。

7.4.1 セットアップ

# config.ru（または Rails イニシャライザ）
require "rperf/viewer"
require "rperf/rack"

Rperf.start(defer: true, mode: :wall, frequency: 999)

use Rperf::Viewer                           # /rperf/ で UI を提供
use Rperf::RackMiddleware                   # 各リクエストにラベルを付与
run MyApp

# 60分ごとにスナップショットを取得
Thread.new do
  loop do
    sleep 60 * 60
    Rperf::Viewer.instance&.take_snapshot!
  end
end

スナップショットが取得された後、ブラウザで /rperf/ にアクセスしてください。

7.4.2 オプション

オプション	デフォルト	説明
`path:`	`"/rperf"`	ビューアの URL プレフィックス
`max_snapshots:`	`24`	メモリに保持するスナップショットの最大数（古いものから破棄）

7.4.3 スナップショットの取得

# プログラムから（コントローラ、バックグラウンドスレッド、コンソール等）
Rperf::Viewer.instance.take_snapshot!

# または事前に取得したデータを追加
data = Rperf.snapshot(clear: true)
Rperf::Viewer.instance.add_snapshot(data)

7.4.4 UI タブ

ビューアには 3 つのタブがあります:

Flamegraph — d3-flame-graph によるインタラクティブなフレームグラフ。フレームをクリックでズームイン、ルートをクリックでズームアウト。
Top — Flat（リーフ）と Cumulative（累積）の重み付けテーブル（上位 50 関数）。カラムヘッダー（Flat、Cum、Function）をクリックでソート。
Tags — 各ラベルキーについて、値ごとの重みとパーセンテージの内訳を表示。値の行をクリックすると tagfocus を設定して Flamegraph タブに遷移。

7.4.5 フィルタリング

上部のコントロールバーに 4 つのフィルタがあります:

tagfocus — テキスト入力。ラベル値にマッチする正規表現を入力。Enter で適用。
tagignore — ドロップダウン + チェックボックス。チェックした項目に一致するサンプルを除外。各ラベルキーには (none) エントリがあり、そのキーを持たないサンプルを除外できます — endpoint ラベルのないバックグラウンドスレッドを除外する際に便利です。
tagroot — ラベルキーのドロップダウン + チェックボックス。チェックしたキーがフレームグラフのルートフレームとして先頭に追加されます（例: [endpoint: GET /users]）。
tagleaf — tagroot と同様ですが、リーフフレームとして末尾に追加されます。

ラベルキーはアルファベット順にソートされます。% プレフィックスの VM 状態キー（%GC、%GVL）が先頭に来るため、GC や GVL の状態を leaf/root フレームとして追加しやすくなっています。

7.4.6 アクセス制御

Rperf::Viewer には組み込みの認証機能はありません。フレームワークの既存の仕組みでアクセスを制限してください:

# Rails: ルート制約（管理者のみ）
# config/routes.rb
require "rperf/viewer"
constraints ->(req) { req.session[:admin] } do
  mount Rperf::Viewer.new(nil, path: ""), at: "/rperf"
end

7.5 Rperf.profile によるオンデマンドプロファイリング

特定のエンドポイントやジョブのみをプロファイルし、他の部分ではオーバーヘッドをゼロにしたい場合は、Rperf.start(defer: true) と Rperf.profile を使用します:

# config/initializers/rperf.rb
require "rperf"

Rperf.start(defer: true, mode: :wall, frequency: 99)

その後、特定のコードパスを profile でラップします:

class UsersController < ApplicationController
  def index
    Rperf.profile(endpoint: "GET /users") do
      @users = User.all
    end
  end
end

profile ブロックの外ではタイマーが停止し、オーバーヘッドはゼロです。profile ブロック内では、プロセス全体のスレッドがサンプリング対象になります（各スレッドのサンプルにはそのスレッド自身のラベルが付きます）。

Rperf::Viewer と組み合わせて結果をブラウザで確認するか、Rperf.snapshot + Rperf.save でファイルに保存して rperf report でオフライン分析もできます。

7.6 Rails の完全な設定例

Web とジョブの両方をプロファイリングする典型的な Rails 設定:

# config/initializers/rperf.rb
require "rperf/rack"
require "rperf/viewer"
require "rperf/sidekiq"

Rperf.start(defer: true, mode: :wall, frequency: 99)

# ビューアとリクエストラベリング
Rails.application.config.middleware.use Rperf::Viewer
Rails.application.config.middleware.use Rperf::RackMiddleware

# Sidekiq ジョブにラベル付け
Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add Rperf::SidekiqMiddleware
  end
end

# 60分ごとにスナップショットを取得
Thread.new do
  loop do
    sleep 60 * 60
    Rperf::Viewer.instance&.take_snapshot!
  end
end

/rperf/ にアクセスし、tagroot でエンドポイントやジョブクラスごとにフレームグラフをグループ化して確認できます。

View on GitHub

詳細解説

このパートでは、プロファイリング結果の読み方と活用方法を説明し、rperf の内部動作を解説します。

8. アーキテクチャ概要

この章では、rperf のハイレベルなアーキテクチャと、プロファイリングを支えるコアデータ構造を説明します。これらの内部構造を理解することで、プロファイリング結果のエッジケースの解釈や設計上のトレードオフの評価に役立ちます。

8.1 システム図

rperf は C 拡張と Ruby ラッパーで構成されています:

mermaid source

graph TD
    subgraph "C 拡張 (ext/rperf/rperf.c)"
        Timer["タイマー<br/>(signal or pthread_cond_timedwait)"]
        PJ["rb_postponed_job_trigger"]
        Sample["サンプルコールバック<br/>(rperf_sample_job)"]
        ThreadHook["スレッドイベントフック<br/>(SUSPENDED/READY/RESUMED/EXITED)"]
        GCHook["GC イベントフック<br/>(ENTER/EXIT/START/END_MARK/END_SWEEP)"]
        BufA["サンプルバッファ A"]
        BufB["サンプルバッファ B"]
        Worker["ワーカースレッド<br/>(timer + aggregation)"]
        FrameTable["フレームテーブル<br/>(VALUE → frame_id)"]
        AggTable["集約テーブル<br/>(stack → weight)"]
    end

    subgraph "Ruby (lib/rperf.rb)"
        API["Rperf.start / Rperf.stop"]
        Encoders["エンコーダー<br/>(PProf / Collapsed / Text)"]
        StatOut["stat 出力"]
    end

    Timer -->|triggers| PJ
    PJ -->|at safepoint| Sample
    Sample -->|writes| BufA
    ThreadHook -->|writes| BufA
    GCHook -->|writes| BufA
    BufA -->|swap when full| BufB
    BufB -->|aggregated by| Worker
    Worker -->|inserts| FrameTable
    Worker -->|inserts| AggTable
    API -->|calls| Sample
    AggTable -->|read at stop| Encoders
    FrameTable -->|resolved at stop| Encoders
    Encoders --> StatOut

C 拡張はすべての性能重視の操作（タイマー管理、サンプル記録、集約）を処理します。Ruby レイヤーはユーザー向け API、出力エンコーディング、統計のフォーマットを提供します。

8.2 グローバルプロファイラ状態

rperf は単一のグローバルな rperf_profiler_t 構造体を使用します。一度に 1 つのプロファイリングセッションのみ有効です（単一セッション制限）。この構造体は以下を保持します:

タイマー設定（周波数、モード、シグナル番号）
ダブルバッファリングされたサンプルストレージ（2 つのサンプルバッファ + フレームプール、満杯時にスワップ）
フレームテーブル（VALUE → uint32 frame_id、フレーム参照の重複排除）
集約テーブル（ユニークスタック → 累積重み）
ワーカースレッドハンドル（統合タイマー + 集約スレッド）
スレッドごとのデータ用のスレッド固有キー
GC フェーズ追跡状態
サンプリングオーバーヘッドカウンター

8.3 スレッドごとのデータ

各スレッドには Ruby のスレッド固有データ API（rb_internal_thread_specific_set）で格納される rperf_thread_data_t 構造体があります。以下を追跡します:

prev_time_ns: 前回の時刻読み取り（重み計算用）
prev_wall_ns: 前回の wall time 読み取り
suspended_at_ns: スレッドがサスペンドされた wall タイムスタンプ
ready_at_ns: スレッドが準備完了になった wall タイムスタンプ
（バックトレースは SUSPENDED 時には保存せず、RESUMED 時に再取得します）
label_set_id: 現在のラベルセット ID（0 = ラベルなし）

スレッドデータは最初の検出時に遅延作成され、EXITED イベントまたはプロファイラ停止時に解放されます。

8.4 GC 安全性

フレーム VALUE はガベージコレクションから保護する必要があります。rperf はプロファイラ構造体を TypedData オブジェクトでラップし、3 つの領域をマークするカスタム dmark 関数を持ちます:

両方のフレームプール（アクティブバッファとスタンバイバッファ）
フレームテーブルキー（ユニークフレーム VALUE）

フレームテーブルキー配列は 4,096 エントリから始まり、満杯時に 2 倍に拡張されます。拡張時は新しい配列を確保し、既存データをコピーし、ポインタをアトミックにスワップします（memory_order_release）。古い配列は stop まで保持され、GC の mark フェーズが同時にそれを読み取っている場合の use-after-free を防ぎます。dmark 関数はキーポインタを memory_order_acquire でロードし、カウントも memory_order_acquire でロードして一貫したビューを保証します。

8.5 Fork 安全性とマルチプロセスプロファイリング

8.5.1 Fork 安全性（C レベル）

rperf は fork された子プロセスでプロファイリングを静かに停止する pthread_atfork 子ハンドラを登録します（fork 安全性）:

タイマー/シグナル状態をクリア
mutex/condvar を再初期化（fork 時に親のワーカースレッドがロックしていた可能性がある）
イベントフック（スレッドイベント、GC イベント）を削除
サンプルバッファ、フレームテーブル、集約テーブルを解放
GC 状態、統計、profile refcount をリセット

親プロセスは影響を受けずにプロファイリングを継続します。子プロセスは必要に応じて新しいプロファイリングセッションを開始できます。

8.5.2 マルチプロセスプロファイリング（Ruby レベル）

マルチプロセスプロファイリングが有効な場合（CLI のデフォルト）、rperf は Ruby の Process._fork フック（Ruby 3.1 以降）を使用して、fork された子プロセスで自動的にプロファイリングを再開します。流れは:

fork 前: _fork フックがセッションディレクトリを作成（初回 fork 時のみ）。ルートプロセスの出力先がセッションディレクトリにリダイレクトされます。
子プロセス内: pthread_atfork が C 状態をクリーンアップした後、_restart_in_child が新しいプロファイリングセッションを開始。出力先はセッションディレクトリ。%pid ラベルが設定されます。
子プロセス終了時: 継承された at_exit フックが Rperf.stop を呼び、子のプロファイルを .json.gz ファイルとしてセッションディレクトリに書き込みます。
ルートプロセス終了時: ルートが自身のプロファイルを書き込んだ後、セッションディレクトリ内のすべての .json.gz ファイルを集約して単一の出力（stat レポートまたはファイル）を生成。その後セッションディレクトリを削除します。

spawn された子プロセス（spawn、system）の場合、子は RUBYLIB（rperf の lib ディレクトリを指す）と RUBYOPT=-rrperf、および環境変数（RPERF_SESSION_DIR、RPERF_ROOT_PROCESS）を継承します。子の Ruby プロセスが rperf を読み込むと、オートスタートブロックがルートプロセスでないことを検出し、プロファイルをセッションディレクトリに直接書き込みます。

8.5.3 セッションディレクトリ

セッションディレクトリは $RPERF_TMPDIR、$XDG_RUNTIME_DIR、またはシステムの一時ディレクトリ（優先順）の下に、ユーザーごとのサブディレクトリ（rperf-$UID/、モード 0700）内に作成されます。各セッションは一意のディレクトリ名を持ちます: rperf-$PID-$RANDOM。

クラッシュしたプロセスの古いセッションディレクトリは自動的にクリーンアップされます: 新しい CLI セッション開始時に、ルート PID が生存していないセッションディレクトリを検出して削除します。

8.5.4 ラベルのマージ

複数プロセスのプロファイルを集約する際、ラベルセットは一貫性を保つために再マッピングされます。2 つの子プロセスが同じラベル値を使用している場合（例: 両方とも endpoint: "GET /users"）、マージ後の出力で同じラベルセット ID を共有します。%pid ラベルにより、マージ後も異なるプロセスのサンプルを区別できます。

8.6 既知の制限事項

8.6.1 Running EC レース

Ruby VM には、タイマースレッドからの rb_postponed_job_trigger が誤ったスレッドの実行コンテキストに割り込みフラグを設定してしまう既知のレース条件があります。これは、新しいスレッドのネイティブスレッドが GVL を取得する前に開始された場合に発生します。結果として、C のビジーウェイトを行っているスレッドのタイマーサンプルが欠落し、その CPU 時間が次の SUSPENDED イベントのスタックに漏れ出す可能性があります。

これは rperf のバグではなく Ruby VM のバグであり、すべての postponed job ベースのプロファイラに影響します。

ただし、rperf は Linux において SIGEV_THREAD_ID を使用してタイマーシグナルを専用のワーカースレッドにのみ配信することで、この問題を緩和しています。任意の Ruby スレッド上でシグナルハンドラが発火するのではなく、ワーカースレッド自身が rb_postponed_job_trigger を呼び出すため、古い running_ec を参照するレースウィンドウが大幅に狭くなります。nanosleep フォールバック（macOS および signal: false）も同様に、ワーカースレッドが rb_postponed_job_trigger の唯一の呼び出し元であるため、同じ効果があります。

8.6.2 単一セッション

グローバルプロファイラ状態のため、一度に 1 つのプロファイリングセッションのみ有効です。プロファイリング中に Rperf.start を呼び出すことはサポートされていません。

8.6.3 メソッドレベルの粒度

rperf はメソッドレベルでプロファイルし、行レベルではありません。フレームラベルは修飾名（例: Integer#times、MyClass#method_name）に rb_profile_frame_full_label を使用します。行番号は含まれません。

View on GitHub

9. サンプリング

この章では、rperf がサンプルを収集する方法を説明します。サンプリングをトリガーするタイマーメカニズム、サンプリングコールバック自体、そして GVL と GC のアクティビティをキャプチャするイベントフックについて解説します。

9.1 タイマーとワーカースレッド

rperf は、タイマーのトリガーと定期的なサンプル集約の両方を処理する単一のワーカースレッドを使用します。タイマーメカニズムはプラットフォームによって異なります。

9.1.1 Linux: シグナルベースのタイマー（デフォルト）

Linux では、rperf は timer_create と SIGEV_THREAD_ID を使用して、設定された周波数でワーカースレッドにのみリアルタイムシグナル（デフォルト: SIGRTMIN+8）を配信します。シグナルハンドラは rb_postponed_job_trigger を呼び出してサンプリングコールバックをスケジュールします。

SIGEV_THREAD_ID を使用することで、タイマーシグナルがワーカースレッドのみをターゲットにし、Ruby スレッドの nanosleep、read、その他のブロッキングシステムコール（例: rb_thread_call_without_gvl 内）を中断しないようにします。

sequenceDiagram participant Kernel as Linux カーネル participant Worker as ワーカースレッド participant VM as Ruby VM participant Job as Postponed Job Kernel->>Worker: SIGRTMIN+8 (1000Hz で 1ms ごと) Worker->>VM: rb_postponed_job_trigger() Note over VM: セーフポイントまで継続... VM->>Job: rperf_sample_job() Job->>Job: rb_profile_frames() + サンプル記録 Note over Worker: swap_ready が設定されると
サンプルも集約

mermaid source

sequenceDiagram
    participant Kernel as Linux カーネル
    participant Worker as ワーカースレッド
    participant VM as Ruby VM
    participant Job as Postponed Job

    Kernel->>Worker: SIGRTMIN+8 (1000Hz で 1ms ごと)
    Worker->>VM: rb_postponed_job_trigger()
    Note over VM: セーフポイントまで継続...
    VM->>Job: rperf_sample_job()
    Job->>Job: rb_profile_frames() + サンプル記録
    Note over Worker: swap_ready が設定されると<br/>サンプルも集約

このアプローチにより、正確なタイミング（1000Hz で中央値 ~1000us のインターバル）が実現されます。

9.1.2 フォールバック: pthread_cond_timedwait

macOS、または Linux で signal: false が設定されている場合、ワーカースレッドは絶対デッドラインを持つ pthread_cond_timedwait をタイマーとして使用します:

タイムアウト（デッドライン到達）: rb_postponed_job_trigger をトリガーし、デッドラインを進める
シグナル（swap_ready が設定）: スタンバイバッファを即座に集約

デッドラインベースのアプローチにより、集約に時間がかかる場合のドリフトを回避します。このモードはシグナルベースのアプローチと比較して ~100us のタイミング不精度があります。

9.2 サンプリングコールバック

postponed job が発火すると、rperf_sample_job は現在 GVL を保持しているスレッド上で実行されます。rb_thread_current() を使用してそのスレッドのみをサンプリングします。

これは意図的な設計判断です:

rb_profile_frames は現在のスレッドのスタックのみをキャプチャできる
Thread.list を反復する必要がない — GVL イベントフックと組み合わせることで、rperf はすべてのスレッドを広範にカバーします（ただし Ruby VM の既知のレースにより、サンプルが欠落する場合があります）

サンプリングコールバックの処理:

スレッドごとのデータ（rperf_thread_data_t）を取得または作成
現在のクロックを読み取り（CPU モードは CLOCK_THREAD_CPUTIME_ID、wall モードは CLOCK_MONOTONIC）
重みを time_now - prev_time として計算
rb_profile_frames でフレームプールに直接バックトレースをキャプチャ
サンプルを記録（フレーム開始インデックス、深さ、重み、vm_state）
prev_time を更新

vm_state フィールドは通常のタイマーサンプルでは NORMAL、GVL/GC イベントでは対応する状態（GVL_BLOCKED、GVL_WAIT、GC_MARK、GC_SWEEP）が設定されます。この vm_state は停止時に Ruby 側で merge_vm_state_labels! によって %GVL/%GC ラベルに変換されます。

9.3 GVL イベント追跡（wall モード）

wall モードでは、rperf は Ruby のスレッドイベント API にフックして GVL の遷移を追跡します。これにより、サンプリングだけでは見逃す時間（GVL 外の時間）をキャプチャします。

stateDiagram-v2 [*] --> Running: スレッドが GVL を保持 Running --> Suspended: SUSPENDED イベント
(サンプル記録 + タイムスタンプを保存) Suspended --> Ready: READY イベント
(wall タイムスタンプを記録) Ready --> Running: RESUMED イベント
(vm_state=GVL_BLOCKED/GVL_WAIT サンプルを記録) Running --> [*]: EXITED イベント
(スレッドデータをクリーンアップ)

mermaid source

stateDiagram-v2
    [*] --> Running: スレッドが GVL を保持
    Running --> Suspended: SUSPENDED イベント<br/>(サンプル記録 + タイムスタンプを保存)
    Suspended --> Ready: READY イベント<br/>(wall タイムスタンプを記録)
    Ready --> Running: RESUMED イベント<br/>(vm_state=GVL_BLOCKED/GVL_WAIT サンプルを記録)
    Running --> [*]: EXITED イベント<br/>(スレッドデータをクリーンアップ)

9.3.1 SUSPENDED

スレッドが GVL を解放するとき（例: I/O の前）:

現在のバックトレースをフレームプールにキャプチャ
通常のサンプルを記録（前回のサンプルからの時間）
wall タイムスタンプを保存（バックトレースは RESUMED 時に再取得します）

9.3.2 READY

スレッドが実行可能になったとき（例: I/O 完了）:

wall タイムスタンプを記録（GVL 不要 — シンプルな C 操作のみ）

9.3.3 RESUMED

スレッドが GVL を再取得したとき:

現在のバックトレースをフレームプールにキャプチャ（ダブルバッファスワップ後の不整合を避けるため、SUSPENDED 時ではなくここで取得します。GVL 外にいる間 Ruby スタックは変化しません）
vm_state=GVL_BLOCKED のサンプルを記録: 重み = ready_at - suspended_at（GVL 外の時間）
vm_state=GVL_WAIT のサンプルを記録: 重み = resumed_at - ready_at（GVL 競合時間）

これにより、スレッドが GVL 外にある間はタイマーベースのサンプリングが不可能であるにもかかわらず、GVL 外の時間と GVL 競合がそれらをトリガーしたコードに正確に帰属されます。停止時に vm_state は %GVL=blocked/%GVL=wait ラベルに変換されます。

9.4 GC フェーズ追跡

rperf は Ruby の内部 GC イベントにフックしてガベージコレクション時間を追跡します:

イベント	アクション
`GC_START`	フェーズを marking に設定
`GC_END_MARK`	フェーズを sweeping に設定
`GC_END_SWEEP`	フェーズをクリア
`GC_ENTER`	wall タイムスタンプ、スレッド情報をキャプチャ
`GC_EXIT`	バックトレースをキャプチャし、`vm_state=GC_MARK` または `vm_state=GC_SWEEP` のサンプルを記録

GC サンプルはプロファイリングモードに関係なく常に wall time を使用します。GC 時間はアプリケーションのレイテンシに影響する実際の経過時間だからです。停止時に vm_state は %GC=mark/%GC=sweep ラベルに変換されます。

9.5 遅延文字列解決

サンプリング中、rperf はフレームプールに生のフレーム VALUE（Ruby 内部のオブジェクト参照）を格納します。文字列ではありません。この遅延文字列解決により、ホットパスがアロケーションフリーかつ高速に保たれます。

文字列解決は停止時に行われます:

Rperf.stop が _c_stop を呼び出す
フレームテーブルが各ユニークフレーム VALUE を rb_profile_frame_full_label と rb_profile_frame_path で [path, label] 文字列ペアにマッピング
解決された文字列が Ruby エンコーダーに渡される

これにより、サンプリングは事前確保されたバッファに整数（VALUE ポインタとタイムスタンプ）のみを書き込みます。Ruby オブジェクトは作成されず、プロファイリング中に GC プレッシャーが追加されません。

View on GitHub

10. 集約

この章では、rperf が生サンプルをコンパクトで重複排除されたデータ構造に処理する方法を説明します。集約により、長時間のプロファイリングセッション中のメモリ使用量が抑制されます。

10.1 概要

デフォルト（aggregate: true）では、rperf はバックグラウンドで定期的にサンプルを集約します。生サンプル（それぞれスタックトレース、重み、メタデータを含む）は、フレームテーブルと集約テーブルの 2 つのハッシュテーブルにマージされます。同一スタックは重みを合計してまとめられます。

10.2 ダブルバッファリング

2 つのサンプルバッファが役割を交互に切り替え、サンプリングと集約の同時実行を可能にします:

アクティブバッファがサンプリングコールバックからの新しいサンプルを受け取る
アクティブバッファが 10,000 サンプルに達するとバッファがスワップ
スタンバイバッファがワーカースレッドによってバックグラウンドで処理される

graph LR subgraph "アクティブ" A["サンプルバッファ A
(サンプル受信中)"] end subgraph "スタンバイ" B["サンプルバッファ B
(集約中)"] end A -->|"10K サンプルで
スワップ"| B B -->|"集約後に
スワップバック"| A

mermaid source

graph LR
    subgraph "アクティブ"
        A["サンプルバッファ A<br/>(サンプル受信中)"]
    end
    subgraph "スタンバイ"
        B["サンプルバッファ B<br/>(集約中)"]
    end
    A -->|"10K サンプルで<br/>スワップ"| B
    B -->|"集約後に<br/>スワップバック"| A

ワーカースレッドがスタンバイバッファの処理を完了していない場合、スワップはスキップされ、アクティブバッファは増大を続けます（無制限モードへのフォールバック）。

各バッファは独自のフレームプールを持ちます。アクティブバッファのフレームプールは rb_profile_frames からの生の VALUE 参照を受け取ります。バッファがスワップされると、スタンバイバッファのフレームプールが処理され、再利用のためにクリアされます。

10.3 フレームテーブル

フレームテーブル（VALUE → uint32_t frame_id）はフレーム参照を重複排除します。各ユニークフレーム VALUE に小さな整数 ID が割り当てられます。

キー: 生のフレーム VALUE（集約データの唯一の GC マーク対象）
値: 集約テーブルで使用されるコンパクトな uint32_t ID
初期容量: 4,096 エントリ、満杯時に 2 倍に拡張
拡張は GC dmark 安全性のためにアトミックポインタスワップを使用（GC 安全性を参照）

完全な VALUE の代わりに uint32_t フレーム ID を使用することで、集約テーブルでのスタック格納に必要なメモリが半減します。

10.4 集約テーブル

集約テーブルは同一スタックの重みを合計してマージします:

キー: (frame_ids[], thread_seq, label_set_id, vm_state) — スタック（フレーム ID として）、スレッド連番、ラベルセット、VM 状態
値: ナノ秒単位の累積重み

フレーム ID は別のスタックプールに連続して格納されます。各集約テーブルエントリはこのプールを指します（開始インデックス + 深さ）。

vm_state フィールドがキーの一部であるため、同じスタックでも GVL/GC の状態が異なるサンプルは別々に保持されます。停止時に vm_state は Ruby 側で %GVL/%GC ラベルに変換され、label_sets にマージされます。

label_set_id もキーの一部であるため、同じスタックでも異なるユーザーラベルを持つサンプルは別々に保持されます。

10.5 メモリ使用量

バッファ	初期サイズ	要素サイズ	初期メモリ
サンプルバッファ (×2)	16,384	32B	512KB × 2
フレームプール (×2)	131,072	8B (VALUE)	1MB × 2
フレームテーブルキー	4,096	8B (VALUE)	32KB
フレームテーブルバケット	8,192	4B (uint32)	32KB
集約テーブルバケット	2,048	28B	56KB
スタックプール	4,096	4B (uint32)	16KB

合計: aggregate: true で約 3.6MB、aggregate: false（単一バッファのみ）で約 1.5MB。フレームテーブルと集約テーブルは必要に応じて動的に拡張されます。

集約が有効な場合、メモリ使用量はプロファイリング時間に関係なくほぼ一定です。重要なのは収集されたサンプルの総数ではなく、ユニークスタックの数のみです。

View on GitHub

11. 出力エンコーディング

この章では、rperf が内部の集約データを出力形式に変換する方法を説明します。エンコーディングは停止時に一度だけ実行されるため、パフォーマンスクリティカルではありません。

11.1 集約データから出力へ

Rperf.stop が呼び出されると、C 拡張は集約データを Ruby に返します:

フレームテーブル: フレーム ID でインデックスされた [path, label] 文字列ペアの配列。文字列は rb_profile_frame_full_label と rb_profile_frame_path を通じて生の VALUE から解決されます。
集約テーブル: [frame_ids, weight, thread_seq, label_set_id, vm_state] エントリの配列。
ラベルセット: ラベルキーと値のマッピングを持つ frozen Hash の配列。

Ruby の Rperf.stop は merge_vm_state_labels! を呼び出して、各サンプルの vm_state を %GVL/%GC ラベルに変換し、既存の label_sets にマージします。例えば、vm_state=GVL_BLOCKED は %GVL: "blocked" ラベルになります。これにより、VM 状態がユーザーラベル（endpoint など）と同じ仕組みで pprof のサンプルラベルに書き込まれます。

Ruby エンコーダーがこれらの配列を消費して最終出力を生成します。

11.2 JSON 形式（デフォルト）

rperf のネイティブ出力形式は JSON です。デフォルトの拡張子は .json.gz（gzip 圧縮）で、.json はプレーンテキスト JSON を出力します。どちらも Rperf.load で自動検出されます。JSON ファイルには内部データがすべて保持されます: フレームスタック付きの集約サンプル、重み、スレッド連番、ラベルセット ID、完全なラベルセット配列。プロファイリングメタデータ（モード、周波数、期間、サンプル数）も含まれます。

JSON は rperf ビューア（rperf report でブラウザで直接開ける）や Ruby でのプログラム的な分析に推奨される形式です。pprof と異なり、外部ツールは不要です。jq やテキストエディタで直接確認したい場合は .json を使用してください。

11.3 pprof エンコーダー

rperf は pprof^{[Gang, 2010]} protobuf 形式を protobuf gem に依存せず、完全に Ruby でエンコードします。Rperf::PProf.encode のエンコーダー:

文字列テーブルを構築（インデックス 0 は常に空文字列）
文字列フレームをインデックスフレームに変換し、同一スタックをマージ
ロケーションテーブルと関数テーブルを構築
Profile protobuf メッセージをフィールドごとにエンコード
結果を gzip で圧縮

この手書きのエンコーダーはシンプル（約 100 行）で、rperf が必要とする pprof protobuf スキーマのサブセットのみを処理します。

11.3.1 埋め込みメタデータ

各 pprof プロファイルにはコメントとしてメタデータが含まれます: rperf バージョン、プロファイリングモード、周波数、Ruby バージョン。time_nanos と duration_nanos フィールドには、プロファイルの収集時期と期間が記録されます。

11.3.2 スレッド連番ラベル

すべてのサンプルには thread_seq 数値ラベルが付きます。これは rperf が各スレッドを初めて検出したときに割り当てられる 1 始まりのスレッド連番です。これにより、pprof ツールでスレッドごとにフレームグラフをグループ化できます。

11.4 サンプルラベル

Rperf.label はサンプルのコンテキストごとのアノテーションを可能にします。ホットパスを最小限に保つため、実装は Ruby と C に分割されています:

Ruby 側: ラベルセットを frozen Hash オブジェクトとして配列（@label_set_table）で管理します。重複排除インデックス（@label_set_index）が各ユニーク Hash を整数 ID にマッピングします。Rperf.label(key: value) は新しいラベルを現在のセットとマージし、結果をインターンし、整数 ID を C に渡します。
C 側: 各 rperf_thread_data_t は label_set_id（整数）を格納します。サンプルが記録されるとき、現在のスレッドの label_set_id がサンプルにコピーされます。単一の整数であり、ホットパスにアロケーションオーバーヘッドがゼロです。集約テーブルはハッシュキーに label_set_id を含むため、同一スタックで異なるラベルのサンプルは別エントリとして維持されます。
エンコーディング: エンコード時に、エンコーダーが label_sets 配列を読みます。JSON エンコーダーはラベルをそのまま保持し、PProf エンコーダーは各ラベルを pprof の Sample.Label（キーバリュー文字列ペア）として書き込みます。rperf ビューアの tagfocus、tagroot、tagleaf、tagignore コントロールで、どちらの形式でもラベルによるフィルタリングとグループ化が可能です。

11.5 Collapsed stacks エンコーダー

collapsed stacks エンコーダーはユニークスタックごとに 1 行を生成します。フレームがセミコロンで結合され（ボトムからトップ）、スペースとナノ秒単位の重みが続きます。この形式は FlameGraph ツールや speedscope で消費されます。

ラベルセットが存在する場合、collapsed 出力にラベルは含まれません。ラベル対応の分析には pprof 形式を使用してください。

11.6 テキストレポートエンコーダー

テキストレポートエンコーダーは以下を含む人間が読めるサマリーを生成します:

ヘッダー: プロファイルされた合計時間、サンプル数、周波数
フラットテーブル: セルフタイム（関数がリーフ/最深部フレームだった時間）でソートされた上位 50 関数
キュムレイティブテーブル: トータルタイム（関数がスタックのどこかに出現した時間）でソートされた上位 50 関数

この形式は外部ツールを必要とせず、クイック分析、イシューレポート、AI による分析に適しています。

View on GitHub

A. クイックリファレンス

A.1 CLI チートシート

# 性能の概要を表示
rperf stat ruby my_app.rb

# プロファイルテーブル付きの性能概要
rperf stat --report ruby my_app.rb

# 完全な性能レポート（stat --report と同等）
rperf exec ruby my_app.rb

# デフォルトファイルに記録（rperf.json.gz、json 形式、cpu モード）
rperf record ruby my_app.rb

# オプション付きで記録
rperf record -m wall -f 500 -o profile.pb.gz ruby my_app.rb

# テキストプロファイルを stdout に出力して記録
rperf record -p ruby my_app.rb

# テキスト形式で記録
rperf record -o profile.txt ruby my_app.rb

# collapsed stacks で記録
rperf record -o profile.collapsed ruby my_app.rb

# プリフォーキングサーバーの全ワーカーをプロファイル
rperf stat -m wall bundle exec unicorn

# 子プロセストラッキングを無効化
rperf stat --no-inherit ruby my_app.rb

# rperf ビューアでプロファイルを表示
rperf report

# 上位の関数を出力
rperf report --top profile.pb.gz

# 2 つのプロファイルを比較
rperf diff before.pb.gz after.pb.gz

# 完全なドキュメント
rperf help

A.2 Ruby API チートシート

require "rperf"

# ブロック形式
data = Rperf.start(output: "profile.pb.gz", mode: :cpu) do
  # プロファイルしたいコード
end

# 手動形式
Rperf.start(frequency: 1000, mode: :wall)
# ...
data = Rperf.stop

# ファイルに保存
Rperf.save("profile.pb.gz", data)
Rperf.save("profile.collapsed", data)
Rperf.save("profile.txt", data)

# スナップショット（停止せずにデータを取得）
snap = Rperf.snapshot
Rperf.save("snap.pb.gz", snap)
snap = Rperf.snapshot(clear: true)  # スナップショット後にリセット

# 遅延開始 + 対象を絞ったプロファイリング
Rperf.start(defer: true, mode: :wall)
Rperf.profile(endpoint: "/users") do
  # このブロックのみサンプリングされる
end
data = Rperf.stop

# ラベル（サンプルにコンテキストを付与）
Rperf.label(request: "abc") do
  # 内部のサンプルに request="abc" ラベルが付く
end
Rperf.labels       # 現在のラベルを取得

# Rack ミドルウェア（エンドポイントでリクエストにラベル付け）
require "rperf/rack"
use Rperf::RackMiddleware                    # Rails: config.middleware.use Rperf::RackMiddleware
use Rperf::RackMiddleware, label_key: :route # カスタムラベルキー

# Active Job（クラス名でジョブにラベル付け）
require "rperf/active_job"
class ApplicationJob < ActiveJob::Base
  include Rperf::ActiveJobMiddleware
end

# Sidekiq（ワーカークラス名でジョブにラベル付け）
require "rperf/sidekiq"
Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add Rperf::SidekiqMiddleware
  end
end

A.3 環境変数

これらは CLI がオートスタートプロファイラを設定するために内部的に使用します:

変数	値	説明
`RPERF_ENABLED`	`1`	require 時にオートスタート
`RPERF_OUTPUT`	path	出力ファイルパス
`RPERF_FREQUENCY`	integer	サンプリング周波数 (Hz)
`RPERF_MODE`	`cpu`, `wall`	プロファイリングモード
`RPERF_FORMAT`	`json`, `pprof`, `collapsed`, `text`	出力形式
`RPERF_VERBOSE`	`1`	統計を stderr に出力
`RPERF_STAT`	`1`	stat モード出力を有効化
`RPERF_STAT_REPORT`	`1`	stat 出力にプロファイルテーブルを含める
`RPERF_STAT_COMMAND`	string	stat 出力ヘッダーに表示するコマンド文字列
`RPERF_AGGREGATE`	`0`	サンプル集約を無効化（生サンプルを返す）
`RPERF_SIGNAL`	integer または `false`	タイマーシグナル (Linux のみ): シグナル番号、または `false` で nanosleep スレッド
`RPERF_DEFER`	`1`	タイマーを停止状態で開始; `Rperf.profile` で有効化
`RPERF_ROOT_PROCESS`	PID 文字列	マルチプロセスセッションのルートプロセス PID (CLI が設定)
`RPERF_SESSION_DIR`	パス	マルチプロセスプロファイル収集用セッションディレクトリ (CLI が設定)
`RPERF_TMPDIR`	パス	セッションディレクトリのベースディレクトリを上書き (デフォルト: `$XDG_RUNTIME_DIR` またはシステム tmpdir)

A.4 プロファイリングモードの比較

項目	cpu	wall
クロック	`CLOCK_THREAD_CPUTIME_ID`	`CLOCK_MONOTONIC`
I/O 時間	計測されない	`%GVL=blocked` ラベル
Sleep 時間	計測されない	`%GVL=blocked` ラベル
GVL 競合	計測されない	`%GVL=wait` ラベル
GC 時間	`%GC=mark`, `%GC=sweep` ラベル	`%GC=mark`, `%GC=sweep` ラベル
適したケース	CPU ホットスポット	レイテンシ分析

A.5 出力形式の比較

拡張子	形式	必要なツール
`.json.gz` (デフォルト)	JSON (rperf 独自形式)	なし (`rperf report`)
`.pb.gz`	pprof protobuf	Go (`rperf report`)
`.collapsed`	Collapsed stacks	flamegraph.pl or speedscope
`.txt`	テキストレポート	なし

A.6 VM 状態ラベル

ラベル	モード	意味
`%GVL=blocked`	wall	スレッドが GVL 外（I/O, sleep, C 拡張）
`%GVL=wait`	wall	スレッドが GVL 待ち（競合）
`%GC=mark`	両方	GC marking フェーズ（wall time）
`%GC=sweep`	両方	GC sweeping フェーズ（wall time）
`%pid`	PID 文字列	両方	子プロセスのプロセス ID（マルチプロセスモードのみ）

これらはサンプルのラベルとして label_sets に格納されます。pprof で -tagfocus=%GVL=blocked、-tagfocus=%pid=1234 のようにフィルタリングできます。

View on GitHub

参考文献

[Aman, 2014] Aman Gupta. stackprof: A sampling call-stack profiler for Ruby 2.2+. 2014.
[Arnaldo, 2010] Arnaldo Carvalho de Melo. The New Linux perf Tools. 2010.
[Brendan, 2016] Brendan Gregg. "The Flame Graph". Communications of the ACM, 59(6), pp. 48--57. 2016. DOI
[Gang, 2010] Gang Ren and Eric Tune and Tipp Moseley and Yixin Shi and Silvius Rus and Robert Hundt. "Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers". IEEE Micro, 30(4), pp. 65--79. 2010. DOI
[Susan, 1982] Susan L. Graham and Peter B. Kessler and Marshall K. McKusick. "gprof: A Call Graph Execution Profiler". In Proceedings of the SIGPLAN '82 Symposium on Compiler Construction, pp. 120--126. 1982. DOI
[Todd, 2010] Todd Mytkowicz and Amer Diwan and Matthias Hauswirth and Peter F. Sweeney. "Evaluating the Accuracy of Java Profilers". In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '10), pp. 187--197. 2010. DOI

索引

%

%pid label: 3. CLI の使い方

C

collapsed stacks: 1. イントロダクション; 1. イントロダクション; 6. 出力形式
cpu mode: 3. CLI の使い方
cumulative time: 5. 結果の解釈

D

deferred string resolution: 9. サンプリング

F

Framework Integration: 4. Ruby API
flat time: 5. 結果の解釈
fork safety: 8. アーキテクチャ概要

G

GVL: 1. イントロダクション; 5. 結果の解釈

L

label set: 8. アーキテクチャ概要

M

multi-process profiling: 8. アーキテクチャ概要

P

postponed job: 1. イントロダクション; 1. イントロダクション
pprof: 1. イントロダクション; 1. イントロダクション; 4. Ruby API; 6. 出力形式
preforking server: 3. CLI の使い方

R

Rperf.label: 4. Ruby API; 4. Ruby API; 6. 出力形式
Rperf.labels: 4. Ruby API
Rperf.profile: 4. Ruby API; 4. Ruby API; 4. Ruby API; 7. フレームワーク統合; 7. フレームワーク統合
Rperf.save: 4. Ruby API
Rperf.snapshot: 4. Ruby API
Rperf.start: 4. Ruby API; 7. フレームワーク統合
Rperf.stop: 4. Ruby API
Rperf::Viewer: 1. イントロダクション; 6. 出力形式
rperf: 1. イントロダクション
rperf diff: 3. CLI の使い方
rperf exec: 3. CLI の使い方
rperf record: 1. イントロダクション; 3. CLI の使い方
rperf report: 1. イントロダクション; 3. CLI の使い方
rperf stat: 1. イントロダクション; 3. CLI の使い方

S

safepoint: 1. イントロダクション; 1. イントロダクション
safepoint bias: 1. イントロダクション; 1. イントロダクション
sampling: 1. イントロダクション
session directory: 8. アーキテクチャ概要
single session: 8. アーキテクチャ概要

W

wall mode: 3. CLI の使い方; 3. CLI の使い方; 3. CLI の使い方
weight: 1. イントロダクション

rperf guide

rperf ガイド

rperf guide

Contents

Getting Started

1. Introduction

1.1 What is rperf?

1.1.1 Advantages

1.1.2 Limitations

1.2 Why another profiler?

1.2.1 The safepoint bias problem

1.2.2 Other advantages

1.3 Requirements

1.4 Quick start

2. Installation

2.1 Installing the gem

2.2 Verifying the installation

2.3 Platform support

2.4 Optional: Go toolchain

How to Use

3. CLI Usage

3.1 rperf stat

3.1.1 Example: CPU-bound program

3.1.2 Example: Mixed CPU and I/O

3.1.3 stat options

3.2 rperf exec

3.2.1 exec options

3.3 rperf record

3.3.1 Example: Recording a profile

3.3.2 Choosing a profiling mode

3.3.3 Choosing an output format

3.3.4 Example: Text output

3.3.5 Example: Wall mode text output

3.3.6 Verbose output

3.3.7 record options

3.4 rperf report

3.4.1 Example: Top and text output

3.4.2 report options

3.5 rperf diff

3.5.1 Workflow example

3.6 Multi-process profiling

3.6.1 How it works

3.6.2 Disabling child tracking

3.6.3 Example: Fork with distinct work

3.6.4 Limitations

3.7 rperf help

4. Ruby API

4.1 Basic usage

4.1.1 Block form (recommended)

4.1.2 Example: Profiling a Fibonacci function

4.1.3 Manual start/stop

4.2 Rperf.start parameters

4.3 Rperf.stop return value

4.4 Rperf.save

4.5 Rperf.snapshot

4.6 Sample labels

4.6.1 Block form

4.6.2 Without block

4.6.3 Merging and deleting

4.6.4 Nested blocks

4.6.5 Filtering by label

4.6.6 Reading labels

4.7 Deferred start and Rperf.profile

4.7.1 Why defer?

4.7.2 Rperf.profile

4.7.3 Nesting

4.7.4 Combining with Rperf.label

4.7.5 Without defer

4.7.6 Error handling

4.8 Practical examples

4.8.1 Profiling a web request handler

4.8.2 Comparing CPU and wall profiles

4.8.3 Processing samples

4.8.4 Generating collapsed stacks for FlameGraph

5. Interpreting Results

5.1 Flat vs. cumulative time

5.2 Weight unit

5.3 VM state labels

5.3.1 %GVL: blocked

5.3.2 %GVL: wait