sperf guide

sperf project

AI Generated
  • Getting Started
    • 1. Introduction
      • 1.1 What is sperf?
      • 1.1.1 Advantages
      • 1.1.2 Limitations
      • 1.2 Why another profiler?
      • 1.2.1 The safepoint bias problem
      • 1.2.2 Other advantages
      • 1.3 Requirements
      • 1.4 Quick start
    • 2. Installation
      • 2.1 Installing the gem
      • 2.2 Verifying the installation
      • 2.3 Platform support
      • 2.4 Optional: Go toolchain
  • How to Use
    • 3. CLI Usage
      • 3.1 sperf stat
      • 3.1.1 Example: CPU-bound program
      • 3.1.2 Example: Mixed CPU and I/O
      • 3.1.3 stat options
      • 3.2 sperf record
      • 3.2.1 Example: Recording a profile
      • 3.2.2 Choosing a profiling mode
      • 3.2.3 Choosing an output format
      • 3.2.4 Example: Text output
      • 3.2.5 Example: Wall mode text output
      • 3.2.6 Verbose output
      • 3.2.7 record options
      • 3.3 sperf report
      • 3.3.1 Example: Top and text output
      • 3.3.2 report options
      • 3.4 sperf diff
      • 3.4.1 Workflow example
      • 3.5 sperf help
    • 4. Ruby API
      • 4.1 Basic usage
      • 4.1.1 Block form (recommended)
      • 4.1.2 Example: Profiling a Fibonacci function
      • 4.1.3 Manual start/stop
      • 4.2 Sperf.start parameters
      • 4.3 Sperf.stop return value
      • 4.4 Sperf.save
      • 4.5 Practical examples
      • 4.5.1 Profiling a web request handler
      • 4.5.2 Comparing CPU and wall profiles
      • 4.5.3 Processing raw samples
      • 4.5.4 Generating collapsed stacks for FlameGraph
    • 5. Output Formats
      • 5.1 pprof (default)
      • 5.2 Collapsed stacks
      • 5.2.1 Parsing collapsed stacks programmatically
      • 5.3 Text report
      • 5.4 Format comparison
      • 5.5 Auto-detection rules
  • Deep Dive
    • 6. Interpreting Results
      • 6.1 Flat vs. cumulative time
      • 6.2 Weight unit
      • 6.3 Synthetic frames
      • 6.3.1 [GVL blocked]
      • 6.3.2 [GVL wait]
      • 6.3.3 [GC marking]
      • 6.3.4 [GC sweeping]
      • 6.4 Diagnosing common problems
      • 6.4.1 High CPU usage
      • 6.4.2 Slow requests / high latency
      • 6.4.3 GC pressure
      • 6.4.4 Multithreaded app slower than expected
      • 6.5 Tips for effective profiling
    • 7. Internals
      • 7.1 Architecture overview
      • 7.2 Global profiler state
      • 7.3 Timer implementation
      • 7.3.1 Linux: Signal-based timer (default)
      • 7.3.2 Fallback: nanosleep thread
      • 7.4 Sampling: current-thread-only
      • 7.5 GVL event tracking (wall mode)
      • 7.5.1 SUSPENDED
      • 7.5.2 READY
      • 7.5.3 RESUMED
      • 7.6 GC phase tracking
      • 7.7 Deferred string resolution
      • 7.8 Frame pool and GC safety
      • 7.9 Per-thread data
      • 7.10 Fork safety
      • 7.11 pprof encoder
      • 7.12 Known limitations
      • 7.12.1 Running EC race
      • 7.12.2 Single session
      • 7.12.3 Method-level granularity
  • Appendix
    • A. Quick Reference
      • A.1 CLI cheat sheet
      • A.2 Ruby API cheat sheet
      • A.3 Environment variables
      • A.4 Profiling mode comparison
      • A.5 Output format comparison
      • A.6 Synthetic frames
  • Bibliography
  • Index

sperf guide

sperf logo

A safepoint-based sampling performance profiler for Ruby.

sperf profiles your Ruby programs by sampling at safepoints and using actual time deltas as weights to correct safepoint bias. It provides a perf-like CLI, a Ruby API, and outputs standard pprof format.

Key features:

  • CPU and wall-clock profiling modes
  • Time-weighted sampling that corrects safepoint bias
  • GVL contention and GC phase tracking
  • pprof, collapsed stacks, and text output formats
  • Low overhead (< 0.2% at default 1000 Hz)
  • Requires Ruby >= 3.4.0 on POSIX systems (Linux, macOS)

Getting Started

This part covers what sperf is, why it exists, and how to install it on your system.

← sperf guide 1. Introduction →

1. Introduction

1.1 What is sperf?

sperf is a safepoint-based sampling performance profiler for Ruby. It helps you find where your Ruby program spends its time — whether that’s CPU computation, I/O waits, GVL contention, or garbage collection.

Unlike traditional sampling profilers that count samples uniformly, sperf uses actual time deltas (in nanoseconds) as weights for each sample. This corrects the safepoint bias problem inherent in postponed job-based sampling, producing more accurate results.

sperf is inspired by Linux perf[Arnaldo, 2010], providing a familiar CLI interface with subcommands like record, stat, and report.

1.1.1 Advantages

  • Accurate profiling: Time-delta weighting corrects safepoint bias, producing results closer to real time distribution than traditional count-based profilers.
  • GVL / GC visibility: In wall mode, tracks off-GVL blocking, GVL contention, and GC marking/sweeping as distinct frames — no separate tool needed.
  • Standard output formats: Outputs pprof protobuf (compatible with go tool pprof), collapsed stacks (for flame graphs / speedscope), and human-readable text.
  • Low overhead: Default 1000 Hz sampling adds < 0.2% overhead, suitable for production.
  • Simple CLI: sperf stat for a quick overview, sperf record + sperf report for detailed analysis.

1.1.2 Limitations

  • Ruby 3.4+ only: Requires APIs introduced in Ruby 3.4.
  • POSIX only: Linux and macOS. Windows is not supported.
  • Method-level granularity: No line-number resolution — profiles show method names only.
  • Single session: Only one profiling session can be active at a time (global state in the C extension).
  • Safepoint latency: Samples are still deferred to safepoints. The time-delta weighting corrects the bias but cannot recover the exact interrupted instruction pointer.

1.2 Why another profiler?

Ruby already has profiling tools like stackprof[Aman, 2014]. So why sperf?

1.2.1 The safepoint bias problem

Most Ruby sampling profilers collect backtraces by calling rb_profile_frames directly in the signal handler. This approach yields backtraces at the actual signal timing, but relies on undocumented internal VM state — rb_profile_frames is not guaranteed to be async-signal-safe, and the results can be unreliable if the VM is in the middle of updating its internal structures.

sperf takes a different approach: it uses the postponed job mechanism (rb_postponed_job), which is the Ruby VM’s official API for safely deferring work from signal handlers. Backtrace collection is deferred to the next safepoint — a point where the VM is in a consistent state and rb_profile_frames can return reliable results. The trade-off is that when a timer fires between safepoints, the actual sample is delayed until the next safepoint.

If each sample were counted equally (weight = 1), a sample delayed by a long-running C method would get the same weight as one taken immediately. This is the safepoint bias[Todd, 2010] problem: functions that happen to be running when the thread reaches a safepoint appear more often than they should, while functions between safepoints are under-represented.

sequenceDiagram participant Timer participant VM participant Profiler Timer->>VM: Signal (t=0ms) Note over VM: Running C method...
cannot stop yet VM->>Profiler: Safepoint reached (t=3ms) Note over Profiler: Traditional: weight = 1
sperf: weight = 3ms Timer->>VM: Signal (t=1ms) VM->>Profiler: Safepoint reached (t=1ms) Note over Profiler: Traditional: weight = 1
sperf: weight = 1ms
mermaid source
sequenceDiagram
    participant Timer
    participant VM
    participant Profiler

    Timer->>VM: Signal (t=0ms)
    Note over VM: Running C method...<br/>cannot stop yet
    VM->>Profiler: Safepoint reached (t=3ms)
    Note over Profiler: Traditional: weight = 1<br/>sperf: weight = 3ms
    Timer->>VM: Signal (t=1ms)
    VM->>Profiler: Safepoint reached (t=1ms)
    Note over Profiler: Traditional: weight = 1<br/>sperf: weight = 1ms

sperf solves this by recording clock_now - clock_prev as the weight of each sample. A sample delayed by 3ms gets 3x the weight of a 1ms sample, accurately reflecting where time was actually spent.

1.2.2 Other advantages

  • GVL and GC awareness: In wall mode, sperf tracks time spent blocked off the GVL, waiting to reacquire the GVL, and in GC marking/sweeping phases — each as distinct synthetic frames.
  • perf-like CLI: The sperf stat command gives you a quick performance overview (like perf stat), while sperf record + sperf report gives you detailed profiling.
  • Standard output: sperf outputs pprof protobuf format, compatible with Go’s pprof tool ecosystem, as well as collapsed stacks for flame graphs[Brendan, 2016] and speedscope.
  • Low overhead: Default 1000 Hz sampling adds < 0.2% overhead, suitable for production use.

1.3 Requirements

  • Ruby >= 3.4.0
  • POSIX system (Linux or macOS)
  • Go (optional, for sperf report and sperf diff subcommands)

1.4 Quick start

Profile a Ruby script and view the results:

# Install sperf
gem install sperf

# Quick performance overview
sperf stat ruby my_app.rb

# Record a profile and view it interactively
sperf record ruby my_app.rb
sperf report

Or use sperf from Ruby code:

require "sperf"

Sperf.start(output: "profile.pb.gz") do
  # code to profile
end
This chapter was generated by AI. Accuracy is not guaranteed.
← Getting Started 2. Installation →

2. Installation

2.1 Installing the gem

sperf is distributed as a Ruby gem with a C extension:

gem install sperf

Or add it to your Gemfile:

gem "sperf"

Then run:

bundle install

2.2 Verifying the installation

Check that sperf is installed correctly:

sperf --help

You should see:

Usage: sperf record [options] command [args...]
       sperf stat [options] command [args...]
       sperf report [options] [file]
       sperf diff [options] base.pb.gz target.pb.gz
       sperf help

Run 'sperf help' for full documentation

2.3 Platform support

sperf supports POSIX systems:

Platform Timer implementation Notes
Linux timer_create + signal (default) Best precision (~1000us at 1000Hz)
Linux nanosleep thread (with signal: false) Fallback, ~100us drift/tick
macOS nanosleep thread Signal-based timer not available

On Linux, sperf uses timer_create with SIGEV_SIGNAL and a sigaction handler by default. This provides precise interval timing with no extra thread. The signal number defaults to SIGRTMIN+8 and can be changed via the signal: option.

On macOS (and when signal: false is set on Linux), sperf falls back to a dedicated pthread with a nanosleep loop.

2.4 Optional: Go toolchain

The sperf report and sperf diff subcommands are thin wrappers around go tool pprof. If you want to use these commands, install Go from go.dev.

Without Go, you can still use all other sperf features. You can view pprof files with other tools like speedscope or generate text/collapsed output directly.

This chapter was generated by AI. Accuracy is not guaranteed.
← 1. Introduction How to Use →

How to Use

This part explains how to profile Ruby programs with sperf — from the command line and as a library — and describes the available output formats.

← 2. Installation 3. CLI Usage →

3. CLI Usage

sperf provides a perf-like command-line interface with four main subcommands: record, stat, report, and diff.

3.1 sperf stat

sperf stat is the quickest way to get a performance overview. It runs your command with wall-mode profiling and prints a summary to stderr.

sperf stat ruby my_app.rb

3.1.1 Example: CPU-bound program

Here is a simple Fibonacci computation:

# fib.rb
def fib(n)
  return n if n <= 1
  fib(n - 1) + fib(n - 2)
end
fib(35)

Running sperf stat:

sperf stat ruby fib.rb
 Performance stats for 'ruby fib.rb':

           661.9 ms   user
            24.1 ms   sys
           602.8 ms   real

           590.1 ms 100.0%  CPU execution
             9.0 ms         [Ruby] GC time (8 count: 5 minor, 3 major)
          99,774            [Ruby] allocated objects
          49,270            [Ruby] freed objects
              18 MB         [OS] peak memory (maxrss)
              20            [OS] context switches (4 voluntary, 16 involuntary)
               0 MB         [OS] disk I/O (0 MB read, 0 MB write)

 Top 1 by flat:
           590.1 ms 100.0%  Object#fib (fib.rb)

  590 samples (16 unique stacks), 0.2% profiler overhead

The output tells you:

  • user/sys/real: Standard timing (like the time command)
  • Time breakdown: Where wall time was spent — CPU execution, GVL blocked (I/O/sleep), GVL wait (contention), GC marking, GC sweeping
  • Ruby stats: GC count, allocated/freed objects, YJIT ratio (if enabled)
  • OS stats: Peak memory, context switches, disk I/O
  • Top functions: Hottest functions by flat time

3.1.2 Example: Mixed CPU and I/O

# mixed.rb
def cpu_work(n)
  sum = 0
  n.times { |i| sum += i * i }
  sum
end

def io_work
  sleep(0.05)
end

5.times do
  cpu_work(500_000)
  io_work
end

Running sperf stat:

sperf stat ruby mixed.rb

Because stat always uses wall mode, you can see how time is divided between CPU and I/O. The [Ruby] GVL blocked line shows time spent sleeping/in I/O, while CPU execution shows compute time.

3.1.3 stat options

sperf stat [options] command [args...]
Option Description
-o PATH Also save profile to file (default: none)
-f HZ Sampling frequency in Hz (default: 1000)
-v Print additional sampling statistics

3.2 sperf record

sperf record profiles a command and saves the result to a file. This is the primary way to capture profiles for detailed analysis.

sperf record ruby my_app.rb

By default, it saves to sperf.data in pprof format with CPU mode.

3.2.1 Example: Recording a profile

sperf record ruby fib.rb

This creates sperf.data. You can then analyze it with sperf report or other pprof-compatible tools.

3.2.2 Choosing a profiling mode

sperf supports two profiling modes:

  • cpu (default): Measures per-thread CPU time. Best for finding functions that consume CPU cycles. Ignores time spent sleeping, in I/O, or waiting for the GVL.
  • wall: Measures wall-clock time. Best for finding where wall time goes, including I/O, sleep, and GVL contention.
# CPU mode (default)
sperf record ruby my_app.rb

# Wall mode
sperf record -m wall ruby my_app.rb

3.2.3 Choosing an output format

sperf auto-detects the format from the file extension:

# pprof format (default)
sperf record -o profile.pb.gz ruby my_app.rb

# Collapsed stacks (for FlameGraph / speedscope)
sperf record -o profile.collapsed ruby my_app.rb

# Human-readable text
sperf record -o profile.txt ruby my_app.rb

You can also force the format explicitly:

sperf record --format text -o profile.dat ruby my_app.rb

3.2.4 Example: Text output

sperf record -o profile.txt ruby fib.rb

The text output looks like this:

Total: 509.5ms (cpu)
Samples: 509, Frequency: 1000Hz

Flat:
     509.5ms 100.0%  Object#fib (fib.rb)

Cumulative:
     509.5ms 100.0%  Object#fib (fib.rb)
     509.5ms 100.0%  <main> (fib.rb)

3.2.5 Example: Wall mode text output

sperf record -m wall -o wall_profile.txt ruby mixed.rb
Total: 311.8ms (wall)
Samples: 80, Frequency: 1000Hz

Flat:
     250.6ms  80.4%  [GVL blocked] (<GVL>)
      44.1ms  14.1%  Object#cpu_work (mixed.rb)
      13.9ms   4.5%  Integer#times (<internal:numeric>)
       3.2ms   1.0%  Kernel#sleep (<C method>)
       0.0ms   0.0%  [GVL wait] (<GVL>)

Cumulative:
     311.8ms 100.0%  Integer#times (<internal:numeric>)
     311.8ms 100.0%  block in <main> (mixed.rb)
     311.8ms 100.0%  <main> (mixed.rb)
     253.8ms  81.4%  Kernel#sleep (<C method>)
     253.8ms  81.4%  Object#io_work (mixed.rb)
     250.6ms  80.4%  [GVL blocked] (<GVL>)
      58.0ms  18.6%  Object#cpu_work (mixed.rb)
       0.0ms   0.0%  [GVL wait] (<GVL>)

In wall mode, [GVL blocked] appears as the dominant cost — this is the sleep time in io_work. The CPU time for cpu_work is clearly separated.

3.2.6 Verbose output

The -v flag prints sampling statistics to stderr during profiling:

sperf record -v ruby my_app.rb
[sperf] mode=cpu frequency=1000Hz
[sperf] sampling: 98 calls, 0.11ms total, 1.1us/call avg
[sperf] samples recorded: 904
[sperf] top 10 by flat:
[sperf]       53.4ms  50.1%  Object#cpu_work (-e)
[sperf]       17.0ms  15.9%  Integer#times (<internal:numeric>)
...

3.2.7 record options

sperf record [options] command [args...]
Option Description
-o PATH Output file (default: sperf.data)
-f HZ Sampling frequency in Hz (default: 1000)
-m MODE cpu or wall (default: cpu)
--format FMT pprof, collapsed, or text (default: auto from extension)
-v Print sampling statistics to stderr

3.3 sperf report

sperf report opens a pprof profile for analysis. It wraps go tool pprof and requires Go to be installed.

# Open interactive web UI (default)
sperf report

# Open a specific file
sperf report profile.pb.gz

# Print top functions
sperf report --top

# Print text report
sperf report --text

3.3.1 Example: Top and text output

Using the fib.rb profile recorded earlier:

sperf report --top sperf.data
Type: cpu
Showing nodes accounting for 577.31ms, 100% of 577.31ms total
      flat  flat%   sum%        cum   cum%
  577.31ms   100%   100%   577.31ms   100%  Object#fib
         0     0%   100%   577.31ms   100%  <main>

The default behavior (without --top or --text) opens an interactive web UI in your browser with flame graphs, top function views, and call graph visualizations powered by pprof[Gang, 2010].

3.3.2 report options

Option Description
--top Print top functions by flat time
--text Print text report
(default) Open interactive web UI in browser

3.4 sperf diff

sperf diff compares two pprof profiles, showing the difference (target - base). This is useful for measuring the impact of optimizations.

# Open diff in browser
sperf diff before.pb.gz after.pb.gz

# Print top functions by diff
sperf diff --top before.pb.gz after.pb.gz

# Print text diff
sperf diff --text before.pb.gz after.pb.gz

3.4.1 Workflow example

# Profile the baseline
sperf record -o before.pb.gz ruby my_app.rb

# Make your optimization changes...

# Profile again
sperf record -o after.pb.gz ruby my_app.rb

# Compare
sperf diff before.pb.gz after.pb.gz

3.5 sperf help

sperf help prints the full reference documentation, including profiling modes, output formats, synthetic frames, and diagnostic tips.

sperf help

This outputs detailed documentation suitable for both human reading and AI-assisted analysis.

This chapter was generated by AI. Accuracy is not guaranteed.
← How to Use 4. Ruby API →

4. Ruby API

sperf provides a Ruby API for programmatic profiling. This is useful when you want to profile specific sections of code, integrate profiling into test suites, or build custom profiling workflows.

4.1 Basic usage

4.1.1 Block form (recommended)

The simplest way to use sperf is with the block form of Sperf.start. It profiles the block and returns the profiling data:

require "sperf"

data = Sperf.start(output: "profile.pb.gz", frequency: 1000, mode: :cpu) do
  # code to profile
end

When output: is specified, the profile is automatically written to the file when the block finishes. The method also returns the raw data hash for further processing.

4.1.2 Example: Profiling a Fibonacci function

require "sperf"

def fib(n)
  return n if n <= 1
  fib(n - 1) + fib(n - 2)
end

data = Sperf.start(frequency: 1000, mode: :cpu) do
  fib(33)
end

Sperf.save("profile.txt", data)

Running this produces:

Total: 192.7ms (cpu)
Samples: 192, Frequency: 1000Hz

Flat:
     192.7ms 100.0%  Object#fib (example.rb)

Cumulative:
     192.7ms 100.0%  Object#fib (example.rb)
     192.7ms 100.0%  block in <main> (example.rb)
     192.7ms 100.0%  Sperf.start (lib/sperf.rb)
     192.7ms 100.0%  <main> (example.rb)

4.1.3 Manual start/stop

For cases where block form is awkward, you can manually start and stop profiling:

require "sperf"

Sperf.start(frequency: 1000, mode: :wall)

# ... code to profile ...

data = Sperf.stop

Sperf.stop returns the data hash, or nil if the profiler was not running.

4.2 Sperf.start parameters

Parameter Type Default Description
frequency: Integer 1000 Sampling frequency in Hz
mode: Symbol :cpu :cpu or :wall
output: String nil File path to write on stop
verbose: Boolean false Print statistics to stderr
format: Symbol nil :pprof, :collapsed, :text, or nil (auto-detect from output extension)
signal: Integer/Boolean nil Linux only: nil = timer signal (default), false = nanosleep thread, positive integer = specific RT signal number

4.3 Sperf.stop return value

Sperf.stop returns nil if the profiler was not running. Otherwise it returns a Hash:

{
  mode: :cpu,              # or :wall
  frequency: 1000,
  sampling_count: 1234,    # number of timer callbacks
  sampling_time_ns: 56789, # total time spent sampling (overhead)
  samples: [               # Array of [frames, weight]
    [frames, weight],      #   frames: [[path, label], ...] deepest-first
    ...                    #   weight: Integer (nanoseconds)
  ]
}

Each sample has:

  • frames: An array of [path, label] pairs, ordered deepest-first (leaf frame at index 0)
  • weight: Time in nanoseconds attributed to this sample

4.4 Sperf.save

Sperf.save writes profiling data to a file in any supported format:

Sperf.save("profile.pb.gz", data)        # pprof format
Sperf.save("profile.collapsed", data)    # collapsed stacks
Sperf.save("profile.txt", data)          # text report

The format is auto-detected from the file extension. You can override it with the format: keyword:

Sperf.save("output.dat", data, format: :text)

4.5 Practical examples

4.5.1 Profiling a web request handler

require "sperf"

class ApplicationController
  def profile_action
    data = Sperf.start(mode: :wall) do
      # Simulate a typical request
      users = User.where(active: true).limit(100)
      result = users.map { |u| serialize_user(u) }
      render json: result
    end

    Sperf.save("request_profile.txt", data)
  end
end

Using wall mode here captures not just CPU time but also database I/O and any GVL contention.

4.5.2 Comparing CPU and wall profiles

require "sperf"

def workload
  # Mix of CPU and I/O
  100.times do
    compute_something
    sleep(0.001)
  end
end

# CPU profile: shows where CPU cycles go
cpu_data = Sperf.start(mode: :cpu) { workload }
Sperf.save("cpu.txt", cpu_data)

# Wall profile: shows where wall time goes
wall_data = Sperf.start(mode: :wall) { workload }
Sperf.save("wall.txt", wall_data)

The CPU profile will focus on compute_something, while the wall profile will show the sleep calls as [GVL blocked] time.

4.5.3 Processing raw samples

You can work with the raw sample data programmatically:

require "sperf"

data = Sperf.start(mode: :cpu) { workload }

# Find the hottest function
flat = Hash.new(0)
data[:samples].each do |frames, weight|
  leaf_label = frames.first&.last  # frames[0] is the leaf
  flat[leaf_label] += weight
end

top = flat.sort_by { |_, w| -w }.first(5)
top.each do |label, weight_ns|
  puts "#{label}: #{weight_ns / 1_000_000.0}ms"
end

4.5.4 Generating collapsed stacks for FlameGraph

require "sperf"

data = Sperf.start(mode: :cpu) { workload }
Sperf.save("profile.collapsed", data)

The collapsed format is one line per unique stack, compatible with Brendan Gregg’s FlameGraph[Brendan, 2016] tools and speedscope:

frame1;frame2;...;leaf weight_ns

You can generate a flame graph SVG:

flamegraph.pl profile.collapsed > flamegraph.svg

Or open the .collapsed file directly in speedscope.

This chapter was generated by AI. Accuracy is not guaranteed.
← 3. CLI Usage 5. Output Formats →

5. Output Formats

sperf supports three output formats. The format is auto-detected from the file extension, or can be set explicitly with the --format flag (CLI) or format: parameter (API).

5.1 pprof (default)

The pprof format is a gzip-compressed Protocol Buffers binary. This is the standard format used by Go’s pprof tooling and is the default output of sperf.

Extension convention: .pb.gz

How to view:

# Interactive web UI (requires Go)
sperf report profile.pb.gz

# Top functions
sperf report --top profile.pb.gz

# Text report
sperf report --text profile.pb.gz

# Or use go tool pprof directly
go tool pprof -http=:8080 profile.pb.gz
go tool pprof -top profile.pb.gz
go tool pprof -flame profile.pb.gz

You can also import pprof files into speedscope via its web interface.

Advantages: Standard format supported by a wide ecosystem of tools. Supports diff comparison between two profiles. Interactive exploration with flame graphs, call graphs, and source annotations.

5.2 Collapsed stacks

The collapsed stacks format is a plain text format with one line per unique stack trace. Each line contains a semicolon-separated stack (bottom-to-top) followed by a space and the weight in nanoseconds.

Extension convention: .collapsed

Format:

bottom_frame;...;top_frame weight_ns

Example output:

<main>;Integer#times;block in <main>;Object#cpu_work;Integer#times;Object#cpu_work 53419170
<main>;Integer#times;block in <main>;Object#cpu_work;Integer#times 16962309
<main>;Integer#times;block in <main>;Object#io_work;Kernel#sleep 2335151

How to use:

# Generate a FlameGraph SVG
sperf record -o profile.collapsed ruby my_app.rb
flamegraph.pl profile.collapsed > flamegraph.svg

# Open in speedscope (drag-and-drop the file)
open https://www.speedscope.app/

Advantages: Simple text format, easy to process with command-line tools. Compatible with Brendan Gregg’s FlameGraph[Brendan, 2016] tools and speedscope.

5.2.1 Parsing collapsed stacks programmatically

File.readlines("profile.collapsed").each do |line|
  stack, weight = line.rpartition(" ").then { |s, _, w| [s, w.to_i] }
  frames = stack.split(";")
  # frames[0] is bottom (main), frames[-1] is leaf (hot)
  puts "#{frames.last}: #{weight / 1_000_000.0}ms"
end

5.3 Text report

The text format is a human-readable (and AI-readable) report showing flat and cumulative top-N tables.

Extension convention: .txt

Example output:

Total: 509.5ms (cpu)
Samples: 509, Frequency: 1000Hz

Flat:
     509.5ms 100.0%  Object#fib (fib.rb)

Cumulative:
     509.5ms 100.0%  Object#fib (fib.rb)
     509.5ms 100.0%  <main> (fib.rb)

Sections:

  • Header: Total profiled time, sample count, and frequency
  • Flat table: Functions sorted by self time (the function was the leaf/deepest frame)
  • Cumulative table: Functions sorted by total time (the function appeared anywhere in the stack)

Advantages: No tooling required — readable with cat. Top 50 entries per table by default. Good for quick analysis, sharing in issue reports, or feeding to AI assistants.

5.4 Format comparison

Feature pprof collapsed text
File size Small (binary + gzip) Medium (text) Small (text)
Flame graph Yes (via pprof web UI) Yes (via flamegraph.pl) No
Call graph Yes No No
Diff comparison Yes (sperf diff) No No
No tools needed No (requires Go) No (requires flamegraph.pl) Yes
Programmatic parsing Complex (protobuf) Simple Simple
AI-friendly No Yes Yes

5.5 Auto-detection rules

File extension Format
.collapsed Collapsed stacks
.txt Text report
Anything else pprof

The default output file (sperf.data) uses pprof format.

This chapter was generated by AI. Accuracy is not guaranteed.
← 4. Ruby API Deep Dive →

Deep Dive

This part explores how to read and act on profiling results, and explains how sperf works under the hood.

← 5. Output Formats 6. Interpreting Results →

6. Interpreting Results

6.1 Flat vs. cumulative time

Understanding the difference between flat and cumulative time is essential for effective profiling, as described in classic profiling literature gprof[Susan, 1982].

  • Flat time: Time attributed directly to a function — it was the leaf (deepest frame) in the sample. High flat time means the function itself is doing expensive work.
  • Cumulative time: Time for all samples where the function appears anywhere in the stack. High cumulative time means the function (or something it calls) is expensive.
graph TD A["main()"] -->|calls| B["process()"] B -->|calls| C["compute()"] B -->|calls| D["format()"] style C fill:#ff6b6b,color:#fff style D fill:#ffa07a,color:#fff
mermaid source
graph TD
    A["main()"] -->|calls| B["process()"]
    B -->|calls| C["compute()"]
    B -->|calls| D["format()"]

    style C fill:#ff6b6b,color:#fff
    style D fill:#ffa07a,color:#fff

If compute() has high flat time, optimize compute() itself. If process() has high cumulative but low flat time, the cost is in its children (compute() and format()).

6.2 Weight unit

All weights in sperf are in nanoseconds, regardless of profiling mode:

  • 1,000 ns = 1 us (microsecond)
  • 1,000,000 ns = 1 ms (millisecond)
  • 1,000,000,000 ns = 1 s (second)

6.3 Synthetic frames

In addition to normal Ruby frames, sperf records synthetic frames that represent non-CPU activity. These always appear as the leaf (deepest) frame in a sample.

6.3.1 [GVL blocked]

Mode: wall only

Time the thread spent off the GVL — during I/O operations, sleep, or C extensions that release the GVL. This is one of the synthetic frames. This time is attributed to the stack captured at the SUSPENDED event (when the thread released the GVL).

High [GVL blocked] time indicates your program is I/O bound. Look at the cumulative view to find which functions are triggering the I/O.

6.3.2 [GVL wait]

Mode: wall only

Time the thread spent waiting to reacquire the GVL after becoming ready. This indicates GVL contention — another thread is holding the GVL while this thread wants to run.

High [GVL wait] time means your threads are serialized on the GVL. Consider reducing GVL-holding work, using Ractors, or moving work to child processes.

6.3.3 [GC marking]

Mode: cpu and wall

Time spent in the GC marking phase. Always measured in wall time. Attributed to the stack that triggered GC.

High [GC marking] time means too many live objects. Reduce the number of long-lived allocations.

6.3.4 [GC sweeping]

Mode: cpu and wall

Time spent in the GC sweeping phase. Always measured in wall time. Attributed to the stack that triggered GC.

High [GC sweeping] time means too many short-lived objects. Consider reusing objects or using object pools.

6.4 Diagnosing common problems

6.4.1 High CPU usage

Mode: cpu

Look for functions with high flat CPU time. These are the functions consuming CPU cycles.

Action: Optimize the hot function (better algorithm, caching) or call it less frequently.

6.4.2 Slow requests / high latency

Mode: wall

Look for functions with high cumulative wall time.

  • If [GVL blocked] is dominant: I/O or sleep is the bottleneck. Check database queries, HTTP calls, file I/O.
  • If [GVL wait] is dominant: GVL contention. Reduce GVL-holding work or move to Ractors/child processes.

6.4.3 GC pressure

Mode: cpu or wall

Look for [GC marking] and [GC sweeping] samples.

  • High [GC marking]: Too many live objects. Reduce allocations of long-lived objects.
  • High [GC sweeping]: Too many short-lived objects. Reuse or pool objects.

The sperf stat output also shows GC counts and allocated/freed object counts, which can help diagnose allocation-heavy code.

6.4.4 Multithreaded app slower than expected

Mode: wall

Look for [GVL wait] time across threads.

sperf stat ruby threaded_app.rb

Example output for a GVL-contended workload:

 Performance stats for 'ruby threaded_app.rb':

            89.9 ms   user
            14.0 ms   sys
            41.2 ms   real

             0.5 ms   1.2%  CPU execution
             9.8 ms  23.8%  [Ruby] GVL blocked (I/O, sleep)
            30.8 ms  75.0%  [Ruby] GVL wait (contention)

Here, 75% of wall time is GVL contention. The four threads are fighting for the GVL, serializing their CPU work.

6.5 Tips for effective profiling

  • Default frequency (1000 Hz) works well for most cases. For long-running production profiling, 100-500 Hz reduces overhead further.
  • Profile representative workloads, not micro-benchmarks. Profiling a real request or batch job gives more actionable results than profiling a tight loop.
  • Compare cpu and wall profiles to distinguish CPU-bound from I/O-bound code. If a function is hot in wall mode but not in CPU mode, it’s blocked on I/O.
  • Use sperf diff to measure the impact of optimizations. Profile before and after, then compare.
  • Use the verbose flag (-v) to check profiler overhead and see top functions immediately.
This chapter was generated by AI. Accuracy is not guaranteed.
← Deep Dive 7. Internals →

7. Internals

This chapter describes how sperf works under the hood. Understanding the internals helps you interpret edge cases in profiling results and appreciate the design trade-offs.

7.1 Architecture overview

sperf consists of a C extension and a Ruby wrapper:

graph TD subgraph "C Extension (ext/sperf/sperf.c)" Timer["Timer
(signal or nanosleep)"] PJ["rb_postponed_job_trigger"] Sample["Sample callback
(sperf_sample_job)"] ThreadHook["Thread event hook
(SUSPENDED/READY/RESUMED/EXITED)"] GCHook["GC event hook
(ENTER/EXIT/START/END_MARK/END_SWEEP)"] SampleBuf["Sample buffer"] FramePool["Frame pool"] end subgraph "Ruby (lib/sperf.rb)" API["Sperf.start / Sperf.stop"] Encoders["Encoders
(PProf / Collapsed / Text)"] StatOut["stat output"] end Timer -->|triggers| PJ PJ -->|at safepoint| Sample Sample -->|writes| SampleBuf Sample -->|writes| FramePool ThreadHook -->|writes| SampleBuf GCHook -->|writes| SampleBuf API -->|calls| Sample SampleBuf -->|read at stop| Encoders FramePool -->|resolved at stop| Encoders Encoders --> StatOut
mermaid source
graph TD
    subgraph "C Extension (ext/sperf/sperf.c)"
        Timer["Timer<br/>(signal or nanosleep)"]
        PJ["rb_postponed_job_trigger"]
        Sample["Sample callback<br/>(sperf_sample_job)"]
        ThreadHook["Thread event hook<br/>(SUSPENDED/READY/RESUMED/EXITED)"]
        GCHook["GC event hook<br/>(ENTER/EXIT/START/END_MARK/END_SWEEP)"]
        SampleBuf["Sample buffer"]
        FramePool["Frame pool"]
    end

    subgraph "Ruby (lib/sperf.rb)"
        API["Sperf.start / Sperf.stop"]
        Encoders["Encoders<br/>(PProf / Collapsed / Text)"]
        StatOut["stat output"]
    end

    Timer -->|triggers| PJ
    PJ -->|at safepoint| Sample
    Sample -->|writes| SampleBuf
    Sample -->|writes| FramePool
    ThreadHook -->|writes| SampleBuf
    GCHook -->|writes| SampleBuf
    API -->|calls| Sample
    SampleBuf -->|read at stop| Encoders
    FramePool -->|resolved at stop| Encoders
    Encoders --> StatOut

7.2 Global profiler state

sperf uses a single global sperf_profiler_t struct. Only one profiling session can be active at a time (single session limitation). The struct holds:

  • Timer configuration (frequency, mode, signal number)
  • Sample buffer (dynamically growing array)
  • Frame pool (dynamically growing VALUE array)
  • Thread-specific key for per-thread data
  • GC phase tracking state
  • Sampling overhead counters

7.3 Timer implementation

sperf uses a timer to periodically trigger sampling. The timer implementation depends on the platform:

7.3.1 Linux: Signal-based timer (default)

On Linux, sperf uses timer_create with SIGEV_SIGNAL to deliver a real-time signal (default: SIGRTMIN+8) at the configured frequency. A sigaction handler calls rb_postponed_job_trigger to schedule the sampling callback.

sequenceDiagram participant Kernel as Linux Kernel participant Signal as Signal Handler participant VM as Ruby VM participant Job as Postponed Job Kernel->>Signal: SIGRTMIN+8 (every 1ms at 1000Hz) Signal->>VM: rb_postponed_job_trigger() Note over VM: Continues until safepoint... VM->>Job: sperf_sample_job() Job->>Job: rb_profile_frames() + record sample
mermaid source
sequenceDiagram
    participant Kernel as Linux Kernel
    participant Signal as Signal Handler
    participant VM as Ruby VM
    participant Job as Postponed Job

    Kernel->>Signal: SIGRTMIN+8 (every 1ms at 1000Hz)
    Signal->>VM: rb_postponed_job_trigger()
    Note over VM: Continues until safepoint...
    VM->>Job: sperf_sample_job()
    Job->>Job: rb_profile_frames() + record sample

This approach provides precise timing (~1000us median interval at 1000Hz) with no extra thread.

7.3.2 Fallback: nanosleep thread

On macOS, or when signal: false is set on Linux, sperf creates a dedicated pthread that loops with nanosleep:

while (prof->running) {
    rb_postponed_job_trigger(prof->pj_handle);
    nanosleep(&interval, NULL);
}

This is simpler but has ~100us drift per tick due to nanosleep’s imprecision.

7.4 Sampling: current-thread-only

When the postponed job fires, sperf_sample_job runs on whatever thread currently holds the GVL. It only samples that thread using rb_thread_current().

This is a deliberate design choice:

  1. rb_profile_frames can only capture the current thread’s stack
  2. There’s no need to iterate Thread.list — combined with GVL event hooks, sperf gets complete coverage of all threads

The sampling callback:

  1. Gets or creates per-thread data (sperf_thread_data_t)
  2. Reads the current clock (CLOCK_THREAD_CPUTIME_ID for CPU mode, CLOCK_MONOTONIC for wall mode)
  3. Computes weight as time_now - prev_time
  4. Captures the backtrace with rb_profile_frames directly into the frame pool
  5. Records the sample (frame start index, depth, weight, type)
  6. Updates prev_time

7.5 GVL event tracking (wall mode)

In wall mode, sperf hooks into Ruby’s thread event API to track GVL transitions. This captures time that sampling alone would miss — time spent off the GVL.

stateDiagram-v2 [*] --> Running: Thread holds GVL Running --> Suspended: SUSPENDED event
(capture backtrace + timestamp) Suspended --> Ready: READY event
(record wall timestamp) Ready --> Running: RESUMED event
(record GVL blocked + wait samples) Running --> [*]: EXITED event
(cleanup thread data)
mermaid source
stateDiagram-v2
    [*] --> Running: Thread holds GVL
    Running --> Suspended: SUSPENDED event<br/>(capture backtrace + timestamp)
    Suspended --> Ready: READY event<br/>(record wall timestamp)
    Ready --> Running: RESUMED event<br/>(record GVL blocked + wait samples)
    Running --> [*]: EXITED event<br/>(cleanup thread data)

7.5.1 SUSPENDED

When a thread releases the GVL (e.g., before I/O):

  1. Capture the current backtrace into the frame pool
  2. Record a normal sample (time since last sample)
  3. Save the backtrace and wall timestamp for later use

7.5.2 READY

When a thread becomes ready to run (e.g., I/O completed):

  1. Record the wall timestamp (no GVL needed — only simple C operations)

7.5.3 RESUMED

When a thread reacquires the GVL:

  1. Record a [GVL blocked] sample: weight = ready_at - suspended_at (off-GVL time)
  2. Record a [GVL wait] sample: weight = resumed_at - ready_at (GVL contention time)
  3. Both samples reuse the backtrace captured at SUSPENDED

This way, off-GVL time and GVL contention are accurately attributed to the code that triggered them, even though no timer-based sampling can occur while the thread is off the GVL.

7.6 GC phase tracking

sperf hooks into Ruby’s internal GC events to track garbage collection time:

Event Action
GC_START Set phase to marking
GC_END_MARK Set phase to sweeping
GC_END_SWEEP Clear phase
GC_ENTER Capture backtrace + wall timestamp
GC_EXIT Record [GC marking] or [GC sweeping] sample

GC samples always use wall time regardless of the profiling mode, because GC time is real elapsed time that affects application latency.

7.7 Deferred string resolution

During sampling, sperf stores raw frame VALUEs (Ruby internal object references) in the frame pool — not strings. This deferred string resolution keeps the hot path allocation-free and fast.

String resolution happens at stop time:

  1. Sperf.stop calls _c_stop which returns raw data
  2. For each frame VALUE, rb_profile_frame_full_label and rb_profile_frame_path are called to get human-readable strings
  3. These strings are then passed to the Ruby encoders

This means sampling only writes integers (VALUE pointers and timestamps) to pre-allocated buffers. No Ruby objects are created, no GC pressure is added during profiling.

7.8 Frame pool and GC safety

The frame pool is a contiguous array of VALUEs that stores raw frame references from rb_profile_frames. Since these are Ruby object references, they must be protected from garbage collection.

sperf wraps the profiler struct in a TypedData object with a custom dmark function that calls rb_gc_mark_locations on the entire frame pool. This tells the GC that all VALUEs in the pool are reachable and must not be collected.

The frame pool starts at ~1MB and doubles in size via realloc when needed. Similarly, the sample buffer starts with 1024 entries and grows by 2x.

7.9 Per-thread data

Each thread gets a sperf_thread_data_t struct stored via Ruby’s thread-specific data API (rb_internal_thread_specific_set). This tracks:

  • prev_cpu_ns: Previous CPU time reading (for computing weight)
  • prev_wall_ns: Previous wall time reading
  • suspended_at_ns: Wall timestamp when thread was suspended
  • ready_at_ns: Wall timestamp when thread became ready
  • suspended_frame_start/depth: Saved backtrace from SUSPENDED event

Thread data is created lazily on first encounter and freed on the EXITED event or at profiler stop.

7.10 Fork safety

sperf registers a pthread_atfork child handler that silently stops profiling in the forked child process (fork safety):

  • Clears the timer/signal state
  • Removes event hooks
  • Frees sample and frame buffers

The parent process continues profiling unaffected. The child can start a fresh profiling session if needed.

7.11 pprof encoder

sperf encodes the pprof[Gang, 2010] protobuf format entirely in Ruby, with no protobuf gem dependency. The encoder in Sperf::PProf.encode:

  1. Builds a string table (index 0 is always the empty string)
  2. Converts string frames to index frames and merges identical stacks
  3. Builds location and function tables
  4. Encodes the Profile protobuf message field by field

This hand-written encoder is simple (~100 lines) and only runs once at stop time, so performance is not a concern.

7.12 Known limitations

7.12.1 Running EC race

There is a known race condition in the Ruby VM where rb_postponed_job_trigger from the timer thread may set the interrupt flag on the wrong thread’s execution context. This happens when a new thread’s native thread starts before acquiring the GVL. The result is that timer samples may miss threads doing C busy-wait, with their CPU time leaking into the next SUSPENDED event’s stack.

This is a Ruby VM bug, not a sperf bug, and affects all postponed-job-based profilers.

7.12.2 Single session

Only one profiling session can be active at a time due to the global profiler state. Calling Sperf.start while already profiling is not supported.

7.12.3 Method-level granularity

sperf profiles at the method level, not the line level. Frame labels use rb_profile_frame_full_label for qualified names (e.g., Integer#times, MyClass#method_name). Line numbers are not included.

This chapter was generated by AI. Accuracy is not guaranteed.
← 6. Interpreting Results A. Quick Reference →

A. Quick Reference

A.1 CLI cheat sheet

# Quick performance overview
sperf stat ruby my_app.rb

# Record to default file (sperf.data, pprof format, cpu mode)
sperf record ruby my_app.rb

# Record with options
sperf record -m wall -f 500 -o profile.pb.gz ruby my_app.rb

# Record to text format
sperf record -o profile.txt ruby my_app.rb

# Record to collapsed stacks
sperf record -o profile.collapsed ruby my_app.rb

# View profile in browser (requires Go)
sperf report

# Print top functions
sperf report --top profile.pb.gz

# Compare two profiles
sperf diff before.pb.gz after.pb.gz

# Full documentation
sperf help

A.2 Ruby API cheat sheet

require "sperf"

# Block form
data = Sperf.start(output: "profile.pb.gz", mode: :cpu) do
  # code to profile
end

# Manual form
Sperf.start(frequency: 1000, mode: :wall)
# ...
data = Sperf.stop

# Save to file
Sperf.save("profile.pb.gz", data)
Sperf.save("profile.collapsed", data)
Sperf.save("profile.txt", data)

A.3 Environment variables

These are used internally by the CLI to configure the auto-started profiler:

Variable Values Description
SPERF_ENABLED 1 Enable auto-start on require
SPERF_OUTPUT path Output file path
SPERF_FREQUENCY integer Sampling frequency in Hz
SPERF_MODE cpu, wall Profiling mode
SPERF_FORMAT pprof, collapsed, text Output format
SPERF_VERBOSE 1 Print statistics to stderr

A.4 Profiling mode comparison

Aspect cpu wall
Clock CLOCK_THREAD_CPUTIME_ID CLOCK_MONOTONIC
I/O time Not measured [GVL blocked]
Sleep time Not measured [GVL blocked]
GVL contention Not measured [GVL wait]
GC time [GC marking], [GC sweeping] [GC marking], [GC sweeping]
Best for CPU hotspots Latency analysis

A.5 Output format comparison

Extension Format Tooling required
.pb.gz (default) pprof protobuf Go (sperf report)
.collapsed Collapsed stacks flamegraph.pl or speedscope
.txt Text report None

A.6 Synthetic frames

Frame Mode Meaning
[GVL blocked] wall Thread off-GVL (I/O, sleep, C ext)
[GVL wait] wall Thread waiting for GVL (contention)
[GC marking] both GC marking phase (wall time)
[GC sweeping] both GC sweeping phase (wall time)
This chapter was generated by AI. Accuracy is not guaranteed.
← 7. Internals

Bibliography

  • [Aman, 2014] Aman Gupta. stackprof: A sampling call-stack profiler for Ruby 2.2+. 2014.
  • [Arnaldo, 2010] Arnaldo Carvalho de Melo. The New Linux perf Tools. 2010.
  • [Brendan, 2016] Brendan Gregg. "The Flame Graph". Communications of the ACM, 59(6), pp. 48--57. 2016. DOI
  • [Gang, 2010] Gang Ren and Eric Tune and Tipp Moseley and Yixin Shi and Silvius Rus and Robert Hundt. "Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers". IEEE Micro, 30(4), pp. 65--79. 2010. DOI
  • [Susan, 1982] Susan L. Graham and Peter B. Kessler and Marshall K. McKusick. "gprof: A Call Graph Execution Profiler". In Proceedings of the SIGPLAN '82 Symposium on Compiler Construction, pp. 120--126. 1982. DOI
  • [Todd, 2010] Todd Mytkowicz and Amer Diwan and Matthias Hauswirth and Peter F. Sweeney. "Evaluating the Accuracy of Java Profilers". In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '10), pp. 187--197. 2010. DOI

Index

C

collapsed stacks
1. Introduction
1. Introduction
5. Output Formats
cpu mode
3. CLI Usage
cumulative time
6. Interpreting Results

D

deferred string resolution
7. Internals

F

flat time
6. Interpreting Results
fork safety
7. Internals
frame pool
7. Internals

G

GVL
1. Introduction
6. Interpreting Results

P

postponed job
1. Introduction
1. Introduction
pprof
1. Introduction
1. Introduction
5. Output Formats

S

Sperf.save
4. Ruby API
Sperf.start
4. Ruby API
Sperf.stop
4. Ruby API
safepoint
1. Introduction
1. Introduction
safepoint bias
1. Introduction
1. Introduction
sampling
1. Introduction
single session
7. Internals
sperf
1. Introduction
sperf diff
3. CLI Usage
sperf record
1. Introduction
3. CLI Usage
sperf report
1. Introduction
3. CLI Usage
sperf stat
1. Introduction
3. CLI Usage
synthetic frames
1. Introduction
6. Interpreting Results

W

wall mode
3. CLI Usage
3. CLI Usage
weight
1. Introduction

\

\[GC marking\]
6. Interpreting Results
\[GC sweeping\]
6. Interpreting Results
\[GVL blocked\]
6. Interpreting Results
\[GVL wait\]
6. Interpreting Results