sperf guide
A safepoint-based sampling performance profiler for Ruby.
sperf profiles your Ruby programs by sampling at safepoints and using actual time deltas as weights to correct safepoint bias. It provides a perf-like CLI, a Ruby API, and outputs standard pprof format.
Key features:
- CPU and wall-clock profiling modes
- Time-weighted sampling that corrects safepoint bias
- GVL contention and GC phase tracking
- pprof, collapsed stacks, and text output formats
- Low overhead (< 0.2% at default 1000 Hz)
- Requires Ruby >= 3.4.0 on POSIX systems (Linux, macOS)
Getting Started
This part covers what sperf is, why it exists, and how to install it on your system.
1. Introduction
1.1 What is sperf?
sperf is a safepoint-based sampling performance profiler for Ruby. It helps you find where your Ruby program spends its time — whether that’s CPU computation, I/O waits, GVL contention, or garbage collection.
Unlike traditional sampling profilers that count samples uniformly, sperf uses actual time deltas (in nanoseconds) as weights for each sample. This corrects the safepoint bias problem inherent in postponed job-based sampling, producing more accurate results.
sperf is inspired by Linux perf[Arnaldo, 2010], providing a familiar CLI interface with subcommands like record, stat, and report.
1.1.1 Advantages
- Accurate profiling: Time-delta weighting corrects safepoint bias, producing results closer to real time distribution than traditional count-based profilers.
- GVL / GC visibility: In wall mode, tracks off-GVL blocking, GVL contention, and GC marking/sweeping as distinct frames — no separate tool needed.
- Standard output formats: Outputs pprof protobuf (compatible with
go tool pprof), collapsed stacks (for flame graphs / speedscope), and human-readable text. - Low overhead: Default 1000 Hz sampling adds < 0.2% overhead, suitable for production.
- Simple CLI:
sperf statfor a quick overview,sperf record+sperf reportfor detailed analysis.
1.1.2 Limitations
- Ruby 3.4+ only: Requires APIs introduced in Ruby 3.4.
- POSIX only: Linux and macOS. Windows is not supported.
- Method-level granularity: No line-number resolution — profiles show method names only.
- Single session: Only one profiling session can be active at a time (global state in the C extension).
- Safepoint latency: Samples are still deferred to safepoints. The time-delta weighting corrects the bias but cannot recover the exact interrupted instruction pointer.
1.2 Why another profiler?
Ruby already has profiling tools like stackprof[Aman, 2014]. So why sperf?
1.2.1 The safepoint bias problem
Most Ruby sampling profilers collect backtraces by calling rb_profile_frames directly in the signal handler. This approach yields backtraces at the actual signal timing, but relies on undocumented internal VM state — rb_profile_frames is not guaranteed to be async-signal-safe, and the results can be unreliable if the VM is in the middle of updating its internal structures.
sperf takes a different approach: it uses the postponed job mechanism (rb_postponed_job), which is the Ruby VM’s official API for safely deferring work from signal handlers. Backtrace collection is deferred to the next safepoint — a point where the VM is in a consistent state and rb_profile_frames can return reliable results. The trade-off is that when a timer fires between safepoints, the actual sample is delayed until the next safepoint.
If each sample were counted equally (weight = 1), a sample delayed by a long-running C method would get the same weight as one taken immediately. This is the safepoint bias[Todd, 2010] problem: functions that happen to be running when the thread reaches a safepoint appear more often than they should, while functions between safepoints are under-represented.
cannot stop yet VM->>Profiler: Safepoint reached (t=3ms) Note over Profiler: Traditional: weight = 1
sperf: weight = 3ms Timer->>VM: Signal (t=1ms) VM->>Profiler: Safepoint reached (t=1ms) Note over Profiler: Traditional: weight = 1
sperf: weight = 1ms
mermaid source
sequenceDiagram
participant Timer
participant VM
participant Profiler
Timer->>VM: Signal (t=0ms)
Note over VM: Running C method...<br/>cannot stop yet
VM->>Profiler: Safepoint reached (t=3ms)
Note over Profiler: Traditional: weight = 1<br/>sperf: weight = 3ms
Timer->>VM: Signal (t=1ms)
VM->>Profiler: Safepoint reached (t=1ms)
Note over Profiler: Traditional: weight = 1<br/>sperf: weight = 1ms
sperf solves this by recording clock_now - clock_prev as the weight of each sample. A sample delayed by 3ms gets 3x the weight of a 1ms sample, accurately reflecting where time was actually spent.
1.2.2 Other advantages
- GVL and GC awareness: In wall mode, sperf tracks time spent blocked off the GVL, waiting to reacquire the GVL, and in GC marking/sweeping phases — each as distinct synthetic frames.
- perf-like CLI: The
sperf statcommand gives you a quick performance overview (likeperf stat), whilesperf record+sperf reportgives you detailed profiling. - Standard output: sperf outputs pprof protobuf format, compatible with Go’s
pproftool ecosystem, as well as collapsed stacks for flame graphs[Brendan, 2016] and speedscope. - Low overhead: Default 1000 Hz sampling adds < 0.2% overhead, suitable for production use.
1.3 Requirements
- Ruby >= 3.4.0
- POSIX system (Linux or macOS)
- Go (optional, for
sperf reportandsperf diffsubcommands)
1.4 Quick start
Profile a Ruby script and view the results:
# Install sperf
gem install sperf
# Quick performance overview
sperf stat ruby my_app.rb
# Record a profile and view it interactively
sperf record ruby my_app.rb
sperf report
Or use sperf from Ruby code:
require "sperf"
Sperf.start(output: "profile.pb.gz") do
# code to profile
end
2. Installation
2.1 Installing the gem
sperf is distributed as a Ruby gem with a C extension:
gem install sperf
Or add it to your Gemfile:
gem "sperf"
Then run:
bundle install
2.2 Verifying the installation
Check that sperf is installed correctly:
sperf --help
You should see:
Usage: sperf record [options] command [args...]
sperf stat [options] command [args...]
sperf report [options] [file]
sperf diff [options] base.pb.gz target.pb.gz
sperf help
Run 'sperf help' for full documentation
2.3 Platform support
sperf supports POSIX systems:
| Platform | Timer implementation | Notes |
|---|---|---|
| Linux | timer_create + signal (default) |
Best precision (~1000us at 1000Hz) |
| Linux | nanosleep thread (with signal: false) |
Fallback, ~100us drift/tick |
| macOS | nanosleep thread |
Signal-based timer not available |
On Linux, sperf uses timer_create with SIGEV_SIGNAL and a sigaction handler by default. This provides precise interval timing with no extra thread. The signal number defaults to SIGRTMIN+8 and can be changed via the signal: option.
On macOS (and when signal: false is set on Linux), sperf falls back to a dedicated pthread with a nanosleep loop.
2.4 Optional: Go toolchain
The sperf report and sperf diff subcommands are thin wrappers around go tool pprof. If you want to use these commands, install Go from go.dev.
Without Go, you can still use all other sperf features. You can view pprof files with other tools like speedscope or generate text/collapsed output directly.
How to Use
This part explains how to profile Ruby programs with sperf — from the command line and as a library — and describes the available output formats.
3. CLI Usage
sperf provides a perf-like command-line interface with four main subcommands: record, stat, report, and diff.
3.1 sperf stat
sperf stat is the quickest way to get a performance overview. It runs your command with wall-mode profiling and prints a summary to stderr.
sperf stat ruby my_app.rb
3.1.1 Example: CPU-bound program
Here is a simple Fibonacci computation:
# fib.rb
def fib(n)
return n if n <= 1
fib(n - 1) + fib(n - 2)
end
fib(35)
Running sperf stat:
sperf stat ruby fib.rb
Performance stats for 'ruby fib.rb':
661.9 ms user
24.1 ms sys
602.8 ms real
590.1 ms 100.0% CPU execution
9.0 ms [Ruby] GC time (8 count: 5 minor, 3 major)
99,774 [Ruby] allocated objects
49,270 [Ruby] freed objects
18 MB [OS] peak memory (maxrss)
20 [OS] context switches (4 voluntary, 16 involuntary)
0 MB [OS] disk I/O (0 MB read, 0 MB write)
Top 1 by flat:
590.1 ms 100.0% Object#fib (fib.rb)
590 samples (16 unique stacks), 0.2% profiler overhead
The output tells you:
- user/sys/real: Standard timing (like the
timecommand) - Time breakdown: Where wall time was spent — CPU execution, GVL blocked (I/O/sleep), GVL wait (contention), GC marking, GC sweeping
- Ruby stats: GC count, allocated/freed objects, YJIT ratio (if enabled)
- OS stats: Peak memory, context switches, disk I/O
- Top functions: Hottest functions by flat time
3.1.2 Example: Mixed CPU and I/O
# mixed.rb
def cpu_work(n)
sum = 0
n.times { |i| sum += i * i }
sum
end
def io_work
sleep(0.05)
end
5.times do
cpu_work(500_000)
io_work
end
Running sperf stat:
sperf stat ruby mixed.rb
Because stat always uses wall mode, you can see how time is divided between CPU and I/O. The [Ruby] GVL blocked line shows time spent sleeping/in I/O, while CPU execution shows compute time.
3.1.3 stat options
sperf stat [options] command [args...]
| Option | Description |
|---|---|
-o PATH |
Also save profile to file (default: none) |
-f HZ |
Sampling frequency in Hz (default: 1000) |
-v |
Print additional sampling statistics |
3.2 sperf record
sperf record profiles a command and saves the result to a file. This is the primary way to capture profiles for detailed analysis.
sperf record ruby my_app.rb
By default, it saves to sperf.data in pprof format with CPU mode.
3.2.1 Example: Recording a profile
sperf record ruby fib.rb
This creates sperf.data. You can then analyze it with sperf report or other pprof-compatible tools.
3.2.2 Choosing a profiling mode
sperf supports two profiling modes:
- cpu (default): Measures per-thread CPU time. Best for finding functions that consume CPU cycles. Ignores time spent sleeping, in I/O, or waiting for the GVL.
- wall: Measures wall-clock time. Best for finding where wall time goes, including I/O, sleep, and GVL contention.
# CPU mode (default)
sperf record ruby my_app.rb
# Wall mode
sperf record -m wall ruby my_app.rb
3.2.3 Choosing an output format
sperf auto-detects the format from the file extension:
# pprof format (default)
sperf record -o profile.pb.gz ruby my_app.rb
# Collapsed stacks (for FlameGraph / speedscope)
sperf record -o profile.collapsed ruby my_app.rb
# Human-readable text
sperf record -o profile.txt ruby my_app.rb
You can also force the format explicitly:
sperf record --format text -o profile.dat ruby my_app.rb
3.2.4 Example: Text output
sperf record -o profile.txt ruby fib.rb
The text output looks like this:
Total: 509.5ms (cpu)
Samples: 509, Frequency: 1000Hz
Flat:
509.5ms 100.0% Object#fib (fib.rb)
Cumulative:
509.5ms 100.0% Object#fib (fib.rb)
509.5ms 100.0% <main> (fib.rb)
3.2.5 Example: Wall mode text output
sperf record -m wall -o wall_profile.txt ruby mixed.rb
Total: 311.8ms (wall)
Samples: 80, Frequency: 1000Hz
Flat:
250.6ms 80.4% [GVL blocked] (<GVL>)
44.1ms 14.1% Object#cpu_work (mixed.rb)
13.9ms 4.5% Integer#times (<internal:numeric>)
3.2ms 1.0% Kernel#sleep (<C method>)
0.0ms 0.0% [GVL wait] (<GVL>)
Cumulative:
311.8ms 100.0% Integer#times (<internal:numeric>)
311.8ms 100.0% block in <main> (mixed.rb)
311.8ms 100.0% <main> (mixed.rb)
253.8ms 81.4% Kernel#sleep (<C method>)
253.8ms 81.4% Object#io_work (mixed.rb)
250.6ms 80.4% [GVL blocked] (<GVL>)
58.0ms 18.6% Object#cpu_work (mixed.rb)
0.0ms 0.0% [GVL wait] (<GVL>)
In wall mode, [GVL blocked] appears as the dominant cost — this is the sleep time in io_work. The CPU time for cpu_work is clearly separated.
3.2.6 Verbose output
The -v flag prints sampling statistics to stderr during profiling:
sperf record -v ruby my_app.rb
[sperf] mode=cpu frequency=1000Hz
[sperf] sampling: 98 calls, 0.11ms total, 1.1us/call avg
[sperf] samples recorded: 904
[sperf] top 10 by flat:
[sperf] 53.4ms 50.1% Object#cpu_work (-e)
[sperf] 17.0ms 15.9% Integer#times (<internal:numeric>)
...
3.2.7 record options
sperf record [options] command [args...]
| Option | Description |
|---|---|
-o PATH |
Output file (default: sperf.data) |
-f HZ |
Sampling frequency in Hz (default: 1000) |
-m MODE |
cpu or wall (default: cpu) |
--format FMT |
pprof, collapsed, or text (default: auto from extension) |
-v |
Print sampling statistics to stderr |
3.3 sperf report
sperf report opens a pprof profile for analysis. It wraps go tool pprof and requires Go to be installed.
# Open interactive web UI (default)
sperf report
# Open a specific file
sperf report profile.pb.gz
# Print top functions
sperf report --top
# Print text report
sperf report --text
3.3.1 Example: Top and text output
Using the fib.rb profile recorded earlier:
sperf report --top sperf.data
Type: cpu
Showing nodes accounting for 577.31ms, 100% of 577.31ms total
flat flat% sum% cum cum%
577.31ms 100% 100% 577.31ms 100% Object#fib
0 0% 100% 577.31ms 100% <main>
The default behavior (without --top or --text) opens an interactive web UI in your browser with flame graphs, top function views, and call graph visualizations powered by pprof[Gang, 2010].
3.3.2 report options
| Option | Description |
|---|---|
--top |
Print top functions by flat time |
--text |
Print text report |
| (default) | Open interactive web UI in browser |
3.4 sperf diff
sperf diff compares two pprof profiles, showing the difference (target - base). This is useful for measuring the impact of optimizations.
# Open diff in browser
sperf diff before.pb.gz after.pb.gz
# Print top functions by diff
sperf diff --top before.pb.gz after.pb.gz
# Print text diff
sperf diff --text before.pb.gz after.pb.gz
3.4.1 Workflow example
# Profile the baseline
sperf record -o before.pb.gz ruby my_app.rb
# Make your optimization changes...
# Profile again
sperf record -o after.pb.gz ruby my_app.rb
# Compare
sperf diff before.pb.gz after.pb.gz
3.5 sperf help
sperf help prints the full reference documentation, including profiling modes, output formats, synthetic frames, and diagnostic tips.
sperf help
This outputs detailed documentation suitable for both human reading and AI-assisted analysis.
4. Ruby API
sperf provides a Ruby API for programmatic profiling. This is useful when you want to profile specific sections of code, integrate profiling into test suites, or build custom profiling workflows.
4.1 Basic usage
4.1.1 Block form (recommended)
The simplest way to use sperf is with the block form of Sperf.start. It profiles the block and returns the profiling data:
require "sperf"
data = Sperf.start(output: "profile.pb.gz", frequency: 1000, mode: :cpu) do
# code to profile
end
When output: is specified, the profile is automatically written to the file when the block finishes. The method also returns the raw data hash for further processing.
4.1.2 Example: Profiling a Fibonacci function
require "sperf"
def fib(n)
return n if n <= 1
fib(n - 1) + fib(n - 2)
end
data = Sperf.start(frequency: 1000, mode: :cpu) do
fib(33)
end
Sperf.save("profile.txt", data)
Running this produces:
Total: 192.7ms (cpu)
Samples: 192, Frequency: 1000Hz
Flat:
192.7ms 100.0% Object#fib (example.rb)
Cumulative:
192.7ms 100.0% Object#fib (example.rb)
192.7ms 100.0% block in <main> (example.rb)
192.7ms 100.0% Sperf.start (lib/sperf.rb)
192.7ms 100.0% <main> (example.rb)
4.1.3 Manual start/stop
For cases where block form is awkward, you can manually start and stop profiling:
require "sperf"
Sperf.start(frequency: 1000, mode: :wall)
# ... code to profile ...
data = Sperf.stop
Sperf.stop returns the data hash, or nil if the profiler was not running.
4.2 Sperf.start parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
frequency: |
Integer | 1000 | Sampling frequency in Hz |
mode: |
Symbol | :cpu |
:cpu or :wall |
output: |
String | nil |
File path to write on stop |
verbose: |
Boolean | false |
Print statistics to stderr |
format: |
Symbol | nil |
:pprof, :collapsed, :text, or nil (auto-detect from output extension) |
signal: |
Integer/Boolean | nil |
Linux only: nil = timer signal (default), false = nanosleep thread, positive integer = specific RT signal number |
4.3 Sperf.stop return value
Sperf.stop returns nil if the profiler was not running. Otherwise it returns a Hash:
{
mode: :cpu, # or :wall
frequency: 1000,
sampling_count: 1234, # number of timer callbacks
sampling_time_ns: 56789, # total time spent sampling (overhead)
samples: [ # Array of [frames, weight]
[frames, weight], # frames: [[path, label], ...] deepest-first
... # weight: Integer (nanoseconds)
]
}
Each sample has:
- frames: An array of
[path, label]pairs, ordered deepest-first (leaf frame at index 0) - weight: Time in nanoseconds attributed to this sample
4.4 Sperf.save
Sperf.save writes profiling data to a file in any supported format:
Sperf.save("profile.pb.gz", data) # pprof format
Sperf.save("profile.collapsed", data) # collapsed stacks
Sperf.save("profile.txt", data) # text report
The format is auto-detected from the file extension. You can override it with the format: keyword:
Sperf.save("output.dat", data, format: :text)
4.5 Practical examples
4.5.1 Profiling a web request handler
require "sperf"
class ApplicationController
def profile_action
data = Sperf.start(mode: :wall) do
# Simulate a typical request
users = User.where(active: true).limit(100)
result = users.map { |u| serialize_user(u) }
render json: result
end
Sperf.save("request_profile.txt", data)
end
end
Using wall mode here captures not just CPU time but also database I/O and any GVL contention.
4.5.2 Comparing CPU and wall profiles
require "sperf"
def workload
# Mix of CPU and I/O
100.times do
compute_something
sleep(0.001)
end
end
# CPU profile: shows where CPU cycles go
cpu_data = Sperf.start(mode: :cpu) { workload }
Sperf.save("cpu.txt", cpu_data)
# Wall profile: shows where wall time goes
wall_data = Sperf.start(mode: :wall) { workload }
Sperf.save("wall.txt", wall_data)
The CPU profile will focus on compute_something, while the wall profile will show the sleep calls as [GVL blocked] time.
4.5.3 Processing raw samples
You can work with the raw sample data programmatically:
require "sperf"
data = Sperf.start(mode: :cpu) { workload }
# Find the hottest function
flat = Hash.new(0)
data[:samples].each do |frames, weight|
leaf_label = frames.first&.last # frames[0] is the leaf
flat[leaf_label] += weight
end
top = flat.sort_by { |_, w| -w }.first(5)
top.each do |label, weight_ns|
puts "#{label}: #{weight_ns / 1_000_000.0}ms"
end
4.5.4 Generating collapsed stacks for FlameGraph
require "sperf"
data = Sperf.start(mode: :cpu) { workload }
Sperf.save("profile.collapsed", data)
The collapsed format is one line per unique stack, compatible with Brendan Gregg’s FlameGraph[Brendan, 2016] tools and speedscope:
frame1;frame2;...;leaf weight_ns
You can generate a flame graph SVG:
flamegraph.pl profile.collapsed > flamegraph.svg
Or open the .collapsed file directly in speedscope.
5. Output Formats
sperf supports three output formats. The format is auto-detected from the file extension, or can be set explicitly with the --format flag (CLI) or format: parameter (API).
5.1 pprof (default)
The pprof format is a gzip-compressed Protocol Buffers binary. This is the standard format used by Go’s pprof tooling and is the default output of sperf.
Extension convention: .pb.gz
How to view:
# Interactive web UI (requires Go)
sperf report profile.pb.gz
# Top functions
sperf report --top profile.pb.gz
# Text report
sperf report --text profile.pb.gz
# Or use go tool pprof directly
go tool pprof -http=:8080 profile.pb.gz
go tool pprof -top profile.pb.gz
go tool pprof -flame profile.pb.gz
You can also import pprof files into speedscope via its web interface.
Advantages: Standard format supported by a wide ecosystem of tools. Supports diff comparison between two profiles. Interactive exploration with flame graphs, call graphs, and source annotations.
5.2 Collapsed stacks
The collapsed stacks format is a plain text format with one line per unique stack trace. Each line contains a semicolon-separated stack (bottom-to-top) followed by a space and the weight in nanoseconds.
Extension convention: .collapsed
Format:
bottom_frame;...;top_frame weight_ns
Example output:
<main>;Integer#times;block in <main>;Object#cpu_work;Integer#times;Object#cpu_work 53419170
<main>;Integer#times;block in <main>;Object#cpu_work;Integer#times 16962309
<main>;Integer#times;block in <main>;Object#io_work;Kernel#sleep 2335151
How to use:
# Generate a FlameGraph SVG
sperf record -o profile.collapsed ruby my_app.rb
flamegraph.pl profile.collapsed > flamegraph.svg
# Open in speedscope (drag-and-drop the file)
open https://www.speedscope.app/
Advantages: Simple text format, easy to process with command-line tools. Compatible with Brendan Gregg’s FlameGraph[Brendan, 2016] tools and speedscope.
5.2.1 Parsing collapsed stacks programmatically
File.readlines("profile.collapsed").each do |line|
stack, weight = line.rpartition(" ").then { |s, _, w| [s, w.to_i] }
frames = stack.split(";")
# frames[0] is bottom (main), frames[-1] is leaf (hot)
puts "#{frames.last}: #{weight / 1_000_000.0}ms"
end
5.3 Text report
The text format is a human-readable (and AI-readable) report showing flat and cumulative top-N tables.
Extension convention: .txt
Example output:
Total: 509.5ms (cpu)
Samples: 509, Frequency: 1000Hz
Flat:
509.5ms 100.0% Object#fib (fib.rb)
Cumulative:
509.5ms 100.0% Object#fib (fib.rb)
509.5ms 100.0% <main> (fib.rb)
Sections:
- Header: Total profiled time, sample count, and frequency
- Flat table: Functions sorted by self time (the function was the leaf/deepest frame)
- Cumulative table: Functions sorted by total time (the function appeared anywhere in the stack)
Advantages: No tooling required — readable with cat. Top 50 entries per table by default. Good for quick analysis, sharing in issue reports, or feeding to AI assistants.
5.4 Format comparison
| Feature | pprof | collapsed | text |
|---|---|---|---|
| File size | Small (binary + gzip) | Medium (text) | Small (text) |
| Flame graph | Yes (via pprof web UI) | Yes (via flamegraph.pl) | No |
| Call graph | Yes | No | No |
| Diff comparison | Yes (sperf diff) |
No | No |
| No tools needed | No (requires Go) | No (requires flamegraph.pl) | Yes |
| Programmatic parsing | Complex (protobuf) | Simple | Simple |
| AI-friendly | No | Yes | Yes |
5.5 Auto-detection rules
| File extension | Format |
|---|---|
.collapsed |
Collapsed stacks |
.txt |
Text report |
| Anything else | pprof |
The default output file (sperf.data) uses pprof format.
Deep Dive
This part explores how to read and act on profiling results, and explains how sperf works under the hood.
6. Interpreting Results
6.1 Flat vs. cumulative time
Understanding the difference between flat and cumulative time is essential for effective profiling, as described in classic profiling literature gprof[Susan, 1982].
- Flat time: Time attributed directly to a function — it was the leaf (deepest frame) in the sample. High flat time means the function itself is doing expensive work.
- Cumulative time: Time for all samples where the function appears anywhere in the stack. High cumulative time means the function (or something it calls) is expensive.
mermaid source
graph TD
A["main()"] -->|calls| B["process()"]
B -->|calls| C["compute()"]
B -->|calls| D["format()"]
style C fill:#ff6b6b,color:#fff
style D fill:#ffa07a,color:#fff
If compute() has high flat time, optimize compute() itself. If process() has high cumulative but low flat time, the cost is in its children (compute() and format()).
6.2 Weight unit
All weights in sperf are in nanoseconds, regardless of profiling mode:
- 1,000 ns = 1 us (microsecond)
- 1,000,000 ns = 1 ms (millisecond)
- 1,000,000,000 ns = 1 s (second)
6.3 Synthetic frames
In addition to normal Ruby frames, sperf records synthetic frames that represent non-CPU activity. These always appear as the leaf (deepest) frame in a sample.
6.3.1 [GVL blocked]
Mode: wall only
Time the thread spent off the GVL — during I/O operations, sleep, or C extensions that release the GVL. This is one of the synthetic frames. This time is attributed to the stack captured at the SUSPENDED event (when the thread released the GVL).
High [GVL blocked] time indicates your program is I/O bound. Look at the cumulative view to find which functions are triggering the I/O.
6.3.2 [GVL wait]
Mode: wall only
Time the thread spent waiting to reacquire the GVL after becoming ready. This indicates GVL contention — another thread is holding the GVL while this thread wants to run.
High [GVL wait] time means your threads are serialized on the GVL. Consider reducing GVL-holding work, using Ractors, or moving work to child processes.
6.3.3 [GC marking]
Mode: cpu and wall
Time spent in the GC marking phase. Always measured in wall time. Attributed to the stack that triggered GC.
High [GC marking] time means too many live objects. Reduce the number of long-lived allocations.
6.3.4 [GC sweeping]
Mode: cpu and wall
Time spent in the GC sweeping phase. Always measured in wall time. Attributed to the stack that triggered GC.
High [GC sweeping] time means too many short-lived objects. Consider reusing objects or using object pools.
6.4 Diagnosing common problems
6.4.1 High CPU usage
Mode: cpu
Look for functions with high flat CPU time. These are the functions consuming CPU cycles.
Action: Optimize the hot function (better algorithm, caching) or call it less frequently.
6.4.2 Slow requests / high latency
Mode: wall
Look for functions with high cumulative wall time.
- If
[GVL blocked]is dominant: I/O or sleep is the bottleneck. Check database queries, HTTP calls, file I/O. - If
[GVL wait]is dominant: GVL contention. Reduce GVL-holding work or move to Ractors/child processes.
6.4.3 GC pressure
Mode: cpu or wall
Look for [GC marking] and [GC sweeping] samples.
- High
[GC marking]: Too many live objects. Reduce allocations of long-lived objects. - High
[GC sweeping]: Too many short-lived objects. Reuse or pool objects.
The sperf stat output also shows GC counts and allocated/freed object counts, which can help diagnose allocation-heavy code.
6.4.4 Multithreaded app slower than expected
Mode: wall
Look for [GVL wait] time across threads.
sperf stat ruby threaded_app.rb
Example output for a GVL-contended workload:
Performance stats for 'ruby threaded_app.rb':
89.9 ms user
14.0 ms sys
41.2 ms real
0.5 ms 1.2% CPU execution
9.8 ms 23.8% [Ruby] GVL blocked (I/O, sleep)
30.8 ms 75.0% [Ruby] GVL wait (contention)
Here, 75% of wall time is GVL contention. The four threads are fighting for the GVL, serializing their CPU work.
6.5 Tips for effective profiling
- Default frequency (1000 Hz) works well for most cases. For long-running production profiling, 100-500 Hz reduces overhead further.
- Profile representative workloads, not micro-benchmarks. Profiling a real request or batch job gives more actionable results than profiling a tight loop.
- Compare cpu and wall profiles to distinguish CPU-bound from I/O-bound code. If a function is hot in wall mode but not in CPU mode, it’s blocked on I/O.
- Use
sperf diffto measure the impact of optimizations. Profile before and after, then compare. - Use the verbose flag (
-v) to check profiler overhead and see top functions immediately.
7. Internals
This chapter describes how sperf works under the hood. Understanding the internals helps you interpret edge cases in profiling results and appreciate the design trade-offs.
7.1 Architecture overview
sperf consists of a C extension and a Ruby wrapper:
(signal or nanosleep)"] PJ["rb_postponed_job_trigger"] Sample["Sample callback
(sperf_sample_job)"] ThreadHook["Thread event hook
(SUSPENDED/READY/RESUMED/EXITED)"] GCHook["GC event hook
(ENTER/EXIT/START/END_MARK/END_SWEEP)"] SampleBuf["Sample buffer"] FramePool["Frame pool"] end subgraph "Ruby (lib/sperf.rb)" API["Sperf.start / Sperf.stop"] Encoders["Encoders
(PProf / Collapsed / Text)"] StatOut["stat output"] end Timer -->|triggers| PJ PJ -->|at safepoint| Sample Sample -->|writes| SampleBuf Sample -->|writes| FramePool ThreadHook -->|writes| SampleBuf GCHook -->|writes| SampleBuf API -->|calls| Sample SampleBuf -->|read at stop| Encoders FramePool -->|resolved at stop| Encoders Encoders --> StatOut
mermaid source
graph TD
subgraph "C Extension (ext/sperf/sperf.c)"
Timer["Timer<br/>(signal or nanosleep)"]
PJ["rb_postponed_job_trigger"]
Sample["Sample callback<br/>(sperf_sample_job)"]
ThreadHook["Thread event hook<br/>(SUSPENDED/READY/RESUMED/EXITED)"]
GCHook["GC event hook<br/>(ENTER/EXIT/START/END_MARK/END_SWEEP)"]
SampleBuf["Sample buffer"]
FramePool["Frame pool"]
end
subgraph "Ruby (lib/sperf.rb)"
API["Sperf.start / Sperf.stop"]
Encoders["Encoders<br/>(PProf / Collapsed / Text)"]
StatOut["stat output"]
end
Timer -->|triggers| PJ
PJ -->|at safepoint| Sample
Sample -->|writes| SampleBuf
Sample -->|writes| FramePool
ThreadHook -->|writes| SampleBuf
GCHook -->|writes| SampleBuf
API -->|calls| Sample
SampleBuf -->|read at stop| Encoders
FramePool -->|resolved at stop| Encoders
Encoders --> StatOut
7.2 Global profiler state
sperf uses a single global sperf_profiler_t struct. Only one profiling session can be active at a time (single session limitation). The struct holds:
- Timer configuration (frequency, mode, signal number)
- Sample buffer (dynamically growing array)
- Frame pool (dynamically growing VALUE array)
- Thread-specific key for per-thread data
- GC phase tracking state
- Sampling overhead counters
7.3 Timer implementation
sperf uses a timer to periodically trigger sampling. The timer implementation depends on the platform:
7.3.1 Linux: Signal-based timer (default)
On Linux, sperf uses timer_create with SIGEV_SIGNAL to deliver a real-time signal (default: SIGRTMIN+8) at the configured frequency. A sigaction handler calls rb_postponed_job_trigger to schedule the sampling callback.
mermaid source
sequenceDiagram
participant Kernel as Linux Kernel
participant Signal as Signal Handler
participant VM as Ruby VM
participant Job as Postponed Job
Kernel->>Signal: SIGRTMIN+8 (every 1ms at 1000Hz)
Signal->>VM: rb_postponed_job_trigger()
Note over VM: Continues until safepoint...
VM->>Job: sperf_sample_job()
Job->>Job: rb_profile_frames() + record sample
This approach provides precise timing (~1000us median interval at 1000Hz) with no extra thread.
7.3.2 Fallback: nanosleep thread
On macOS, or when signal: false is set on Linux, sperf creates a dedicated pthread that loops with nanosleep:
while (prof->running) {
rb_postponed_job_trigger(prof->pj_handle);
nanosleep(&interval, NULL);
}
This is simpler but has ~100us drift per tick due to nanosleep’s imprecision.
7.4 Sampling: current-thread-only
When the postponed job fires, sperf_sample_job runs on whatever thread currently holds the GVL. It only samples that thread using rb_thread_current().
This is a deliberate design choice:
rb_profile_framescan only capture the current thread’s stack- There’s no need to iterate
Thread.list— combined with GVL event hooks, sperf gets complete coverage of all threads
The sampling callback:
- Gets or creates per-thread data (
sperf_thread_data_t) - Reads the current clock (
CLOCK_THREAD_CPUTIME_IDfor CPU mode,CLOCK_MONOTONICfor wall mode) - Computes weight as
time_now - prev_time - Captures the backtrace with
rb_profile_framesdirectly into the frame pool - Records the sample (frame start index, depth, weight, type)
- Updates
prev_time
7.5 GVL event tracking (wall mode)
In wall mode, sperf hooks into Ruby’s thread event API to track GVL transitions. This captures time that sampling alone would miss — time spent off the GVL.
(capture backtrace + timestamp) Suspended --> Ready: READY event
(record wall timestamp) Ready --> Running: RESUMED event
(record GVL blocked + wait samples) Running --> [*]: EXITED event
(cleanup thread data)
mermaid source
stateDiagram-v2
[*] --> Running: Thread holds GVL
Running --> Suspended: SUSPENDED event<br/>(capture backtrace + timestamp)
Suspended --> Ready: READY event<br/>(record wall timestamp)
Ready --> Running: RESUMED event<br/>(record GVL blocked + wait samples)
Running --> [*]: EXITED event<br/>(cleanup thread data)
7.5.1 SUSPENDED
When a thread releases the GVL (e.g., before I/O):
- Capture the current backtrace into the frame pool
- Record a normal sample (time since last sample)
- Save the backtrace and wall timestamp for later use
7.5.2 READY
When a thread becomes ready to run (e.g., I/O completed):
- Record the wall timestamp (no GVL needed — only simple C operations)
7.5.3 RESUMED
When a thread reacquires the GVL:
- Record a
[GVL blocked]sample: weight =ready_at - suspended_at(off-GVL time) - Record a
[GVL wait]sample: weight =resumed_at - ready_at(GVL contention time) - Both samples reuse the backtrace captured at SUSPENDED
This way, off-GVL time and GVL contention are accurately attributed to the code that triggered them, even though no timer-based sampling can occur while the thread is off the GVL.
7.6 GC phase tracking
sperf hooks into Ruby’s internal GC events to track garbage collection time:
| Event | Action |
|---|---|
GC_START |
Set phase to marking |
GC_END_MARK |
Set phase to sweeping |
GC_END_SWEEP |
Clear phase |
GC_ENTER |
Capture backtrace + wall timestamp |
GC_EXIT |
Record [GC marking] or [GC sweeping] sample |
GC samples always use wall time regardless of the profiling mode, because GC time is real elapsed time that affects application latency.
7.7 Deferred string resolution
During sampling, sperf stores raw frame VALUEs (Ruby internal object references) in the frame pool — not strings. This deferred string resolution keeps the hot path allocation-free and fast.
String resolution happens at stop time:
Sperf.stopcalls_c_stopwhich returns raw data- For each frame VALUE,
rb_profile_frame_full_labelandrb_profile_frame_pathare called to get human-readable strings - These strings are then passed to the Ruby encoders
This means sampling only writes integers (VALUE pointers and timestamps) to pre-allocated buffers. No Ruby objects are created, no GC pressure is added during profiling.
7.8 Frame pool and GC safety
The frame pool is a contiguous array of VALUEs that stores raw frame references from rb_profile_frames. Since these are Ruby object references, they must be protected from garbage collection.
sperf wraps the profiler struct in a TypedData object with a custom dmark function that calls rb_gc_mark_locations on the entire frame pool. This tells the GC that all VALUEs in the pool are reachable and must not be collected.
The frame pool starts at ~1MB and doubles in size via realloc when needed. Similarly, the sample buffer starts with 1024 entries and grows by 2x.
7.9 Per-thread data
Each thread gets a sperf_thread_data_t struct stored via Ruby’s thread-specific data API (rb_internal_thread_specific_set). This tracks:
prev_cpu_ns: Previous CPU time reading (for computing weight)prev_wall_ns: Previous wall time readingsuspended_at_ns: Wall timestamp when thread was suspendedready_at_ns: Wall timestamp when thread became readysuspended_frame_start/depth: Saved backtrace from SUSPENDED event
Thread data is created lazily on first encounter and freed on the EXITED event or at profiler stop.
7.10 Fork safety
sperf registers a pthread_atfork child handler that silently stops profiling in the forked child process (fork safety):
- Clears the timer/signal state
- Removes event hooks
- Frees sample and frame buffers
The parent process continues profiling unaffected. The child can start a fresh profiling session if needed.
7.11 pprof encoder
sperf encodes the pprof[Gang, 2010] protobuf format entirely in Ruby, with no protobuf gem dependency. The encoder in Sperf::PProf.encode:
- Builds a string table (index 0 is always the empty string)
- Converts string frames to index frames and merges identical stacks
- Builds location and function tables
- Encodes the Profile protobuf message field by field
This hand-written encoder is simple (~100 lines) and only runs once at stop time, so performance is not a concern.
7.12 Known limitations
7.12.1 Running EC race
There is a known race condition in the Ruby VM where rb_postponed_job_trigger from the timer thread may set the interrupt flag on the wrong thread’s execution context. This happens when a new thread’s native thread starts before acquiring the GVL. The result is that timer samples may miss threads doing C busy-wait, with their CPU time leaking into the next SUSPENDED event’s stack.
This is a Ruby VM bug, not a sperf bug, and affects all postponed-job-based profilers.
7.12.2 Single session
Only one profiling session can be active at a time due to the global profiler state. Calling Sperf.start while already profiling is not supported.
7.12.3 Method-level granularity
sperf profiles at the method level, not the line level. Frame labels use rb_profile_frame_full_label for qualified names (e.g., Integer#times, MyClass#method_name). Line numbers are not included.
A. Quick Reference
A.1 CLI cheat sheet
# Quick performance overview
sperf stat ruby my_app.rb
# Record to default file (sperf.data, pprof format, cpu mode)
sperf record ruby my_app.rb
# Record with options
sperf record -m wall -f 500 -o profile.pb.gz ruby my_app.rb
# Record to text format
sperf record -o profile.txt ruby my_app.rb
# Record to collapsed stacks
sperf record -o profile.collapsed ruby my_app.rb
# View profile in browser (requires Go)
sperf report
# Print top functions
sperf report --top profile.pb.gz
# Compare two profiles
sperf diff before.pb.gz after.pb.gz
# Full documentation
sperf help
A.2 Ruby API cheat sheet
require "sperf"
# Block form
data = Sperf.start(output: "profile.pb.gz", mode: :cpu) do
# code to profile
end
# Manual form
Sperf.start(frequency: 1000, mode: :wall)
# ...
data = Sperf.stop
# Save to file
Sperf.save("profile.pb.gz", data)
Sperf.save("profile.collapsed", data)
Sperf.save("profile.txt", data)
A.3 Environment variables
These are used internally by the CLI to configure the auto-started profiler:
| Variable | Values | Description |
|---|---|---|
SPERF_ENABLED |
1 |
Enable auto-start on require |
SPERF_OUTPUT |
path | Output file path |
SPERF_FREQUENCY |
integer | Sampling frequency in Hz |
SPERF_MODE |
cpu, wall |
Profiling mode |
SPERF_FORMAT |
pprof, collapsed, text |
Output format |
SPERF_VERBOSE |
1 |
Print statistics to stderr |
A.4 Profiling mode comparison
| Aspect | cpu | wall |
|---|---|---|
| Clock | CLOCK_THREAD_CPUTIME_ID |
CLOCK_MONOTONIC |
| I/O time | Not measured | [GVL blocked] |
| Sleep time | Not measured | [GVL blocked] |
| GVL contention | Not measured | [GVL wait] |
| GC time | [GC marking], [GC sweeping] |
[GC marking], [GC sweeping] |
| Best for | CPU hotspots | Latency analysis |
A.5 Output format comparison
| Extension | Format | Tooling required |
|---|---|---|
.pb.gz (default) |
pprof protobuf | Go (sperf report) |
.collapsed |
Collapsed stacks | flamegraph.pl or speedscope |
.txt |
Text report | None |
A.6 Synthetic frames
| Frame | Mode | Meaning |
|---|---|---|
[GVL blocked] |
wall | Thread off-GVL (I/O, sleep, C ext) |
[GVL wait] |
wall | Thread waiting for GVL (contention) |
[GC marking] |
both | GC marking phase (wall time) |
[GC sweeping] |
both | GC sweeping phase (wall time) |
Bibliography
- [Aman, 2014] Aman Gupta. stackprof: A sampling call-stack profiler for Ruby 2.2+. 2014.
- [Arnaldo, 2010] Arnaldo Carvalho de Melo. The New Linux perf Tools. 2010.
- [Brendan, 2016] Brendan Gregg. "The Flame Graph". Communications of the ACM, 59(6), pp. 48--57. 2016. DOI
- [Gang, 2010] Gang Ren and Eric Tune and Tipp Moseley and Yixin Shi and Silvius Rus and Robert Hundt. "Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers". IEEE Micro, 30(4), pp. 65--79. 2010. DOI
- [Susan, 1982] Susan L. Graham and Peter B. Kessler and Marshall K. McKusick. "gprof: A Call Graph Execution Profiler". In Proceedings of the SIGPLAN '82 Symposium on Compiler Construction, pp. 120--126. 1982. DOI
- [Todd, 2010] Todd Mytkowicz and Amer Diwan and Matthias Hauswirth and Peter F. Sweeney. "Evaluating the Accuracy of Java Profilers". In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '10), pp. 187--197. 2010. DOI
Index
C
- collapsed stacks
- 1. Introduction
- 1. Introduction
- 5. Output Formats
- cpu mode
- 3. CLI Usage
- cumulative time
- 6. Interpreting Results
D
- deferred string resolution
- 7. Internals
F
- flat time
- 6. Interpreting Results
- fork safety
- 7. Internals
- frame pool
- 7. Internals
P
- postponed job
- 1. Introduction
- 1. Introduction
- pprof
- 1. Introduction
- 1. Introduction
- 5. Output Formats
S
- Sperf.save
- 4. Ruby API
- Sperf.start
- 4. Ruby API
- Sperf.stop
- 4. Ruby API
- safepoint
- 1. Introduction
- 1. Introduction
- safepoint bias
- 1. Introduction
- 1. Introduction
- sampling
- 1. Introduction
- single session
- 7. Internals
- sperf
- 1. Introduction
- sperf diff
- 3. CLI Usage
- sperf record
- 1. Introduction
- 3. CLI Usage
- sperf report
- 1. Introduction
- 3. CLI Usage
- sperf stat
- 1. Introduction
- 3. CLI Usage
- synthetic frames
- 1. Introduction
- 6. Interpreting Results
W
- wall mode
- 3. CLI Usage
- 3. CLI Usage
- weight
- 1. Introduction
\
- \[GC marking\]
- 6. Interpreting Results
- \[GC sweeping\]
- 6. Interpreting Results
- \[GVL blocked\]
- 6. Interpreting Results
- \[GVL wait\]
- 6. Interpreting Results