Dissecting Puma: Anatomy of a Ruby Web Server

Puma is the most popular¹ web server for running Ruby applications. If you’ve recently started working on a new Rails application, chances are you’re using Puma to serve HTTP traffic, as it is the default Ruby on Rails server. Puma has stood the test of time, having gone through three major Ruby versions and been around for over 15 years, making it a treasure trove of interesting and mature design decisions.

Technically, Puma is a preforking, multi-threaded HTTP 1.1 server that can run Rack-compatible Ruby programs.

The purpose of this article is to take a deep dive into the internals of a real, battle-tested web server, explore the trade-offs it makes, and highlight curiosities I personally find interesting. If you want a brief overview or condensed practical guidance instead, refer to Puma’s documentation.

If you want to learn more about how real HTTP servers operate, are studying socket programming and want to build a toy web server inspired by a real example, or if you’re a seasoned veteran curious about the design choices behind the most popular Ruby web server, this article will hopefully have something to offer you.

If you’re a Ruby programmer, this article may introduce you to concepts that you may not have known about or previously worked with.

To read this article effortlessly, it’s ideal to have a basic grasp of Ruby, be familiar with the Ruby ecosystem in general (and Rack specifically), and have some surface-level knowledge of networking. If you’ve at least heard of sockets, you’re more than ready.

The code discussed and shown is sourced from version 6.5.0 of Puma. Ruby specifics are valid as of version 3.4 running in MRI/CRuby on Linux.

For those already familiar with Puma and seeking insights into specific topics, be warned: some parts of Puma will be omitted and not explored in detail for brevity. Most of the auxiliary code for observability and monitoring, such as the control application, will be skipped, along with lesser-used features such as SystemD socket activation. We will also not cover Puma’s SSL/TLS implementation or its HTTP parser.

This article turned out to be quite long, which is not surprising, since there is a lot of ground to cover - from socket programming to the management of forked processes and thread pools.

Although it’s recommended to read the article sequentially, as subsequent sections will not cover already explored aspects in detail, it may still be possible to navigate the table of contents and jump straight to a topic of interest.

While the first section provides a relatively descriptive overview of the boot process, the next dives deeper into the underlying mechanics of connection handling - explaining not just what Puma does, but why it does things a certain way, the possible alternatives, and how to make the best use of this information when running an application with Puma.

If you want to skip the configuration and setup process and go directly to connection handling, click here.

Finally, before we start, a disclaimer is in order: I’m not a maintainer or contributor to Puma. All of the research and insights come from the project’s GitHub discussions, git blame, local experimentation, and personally observed production experience. If you notice any mistakes or errors, please do not hesitate to contact me.

Booting up

Let’s begin our tour of Puma by looking into its entry points.

There are two different ways to start a Rack app, such as a Rails application, with Puma. The first is by using the puma CLI command directly. The second is utilized when the underlying framework’s CLI, such as rails server, is used.

CLI

The entry point to the Puma CLI is a shebang bin/puma script, which requires the source code file containing the CLI, instantiates a CLI object, and calls run on it.

Let’s see what happens when the CLI is instantiated:

module Puma
  class CLI
    # ...

    def initialize(argv, log_writer = LogWriter.stdio, events = Events.new, env: ENV)
      # ...

      @argv = argv.dup
      # ...
      @events = events

      @conf = nil

      # ...

      setup_options env

      begin
        @parser.parse! @argv

        if file = @argv.shift
          @conf.configure do |user_config, file_config|
            file_config.rackup file
          end
        end
      rescue UnsupportedOption
        exit 1
      end

      @conf.configure do |user_config, file_config|
        if @stdout || @stderr
          user_config.stdout_redirect @stdout, @stderr, @append
        end

        if @control_url
          user_config.activate_control_app @control_url, @control_options
        end
      end

      @launcher = Puma::Launcher.new(@conf, env: ENV, log_writer: @log_writer, events: @events, argv: argv)
    end

    # ...
  end
end

The most significant part of the initializer is the setup of configuration.

Some configuration options seem to be supplied using CLI arguments and extracted from the ARGV array. However, if you’ve used Puma in the past, you know that most option tuning is usually done in a special configuration file.

Let’s examine the setup_options method to figure out when and how exactly the interpretation of this file takes place:

module Puma
  class CLI
    # ...

    def setup_options
      @conf = Configuration.new({}, {events: @events}, env) do |user_config, file_config|
        @parser = OptionParser.new do |o|
          o.on "-b", "--bind URI", "URI to bind to (tcp://, unix://, ssl://)" do |arg|
            user_config.bind arg
          end

          # ... and a multitude of other options
        end
      end
    end

    # ...
  end
end

As can be seen, this method instantiates a Configuration object.

Its initializer takes a block argument, which this method uses to create an instance of OptionParser - a plain-Ruby way to implement CLI applications.

The parser declares all of the available arguments that can be passed to the Puma executable. Blocks provided to each on call on the yielded parser object are invoked when the corresponding argument is detected. For example, when the URI to which Puma must bind is provided using the -b argument, bind is called on a user_config with the provided argument. This, however, happens later - the blocks are actually evaluated only when parse! is invoked on the parser object, which happens right after the call to setup_options in the CLI initializer.

An aspect of this method that needs closer examination is the Configuration object and its yielded user_config and file_config arguments.

Configuration

Puma, like any other background process or daemon, needs to solve the problem of configuration - users should be able to provide overrides to the defaults, and usually in several ways. In the case of Puma, it’s either via CLI arguments or by using the aforementioned file. This implies an inherent ordering of values provided for the same option.

Puma takes a direct approach to this problem by materializing this hierarchy in a single class - UserFileDefaultOptions. But before we get to it, let’s first look at Configuration’s initializer:

module Puma
  class Configuration
    DEFAULTS = {
      auto_trim_time: 30,
      binds: ['tcp://0.0.0.0:9292'.freeze],
      clean_thread_locals: false,
      debug: false,
      enable_keep_alives: true,
      # ... and other default values
    }

    # ...

    def initialize(user_options={}, default_options = {}, env = ENV, &block)
      default_options = self.puma_default_options(env).merge(default_options)

      @options     = UserFileDefaultOptions.new(user_options, default_options)
      @plugins     = PluginLoader.new
      @user_dsl    = DSL.new(@options.user_options, self)
      @file_dsl    = DSL.new(@options.file_options, self)
      @default_dsl = DSL.new(@options.default_options, self)

      if !@options[:prune_bundler]
        default_options[:preload_app] = (@options[:workers] > 1) && Puma.forkable?
      end

      @puma_bundler_pruned = env.key? 'PUMA_BUNDLER_PRUNED'

      if block
        configure(&block)
      end
    end

    # ...

    def puma_default_options(env = ENV)
      defaults = DEFAULTS.dup
      puma_options_from_env(env).each { |k,v| defaults[k] = v if v }
      defaults
    end

    def puma_options_from_env(env = ENV)
      min = env['PUMA_MIN_THREADS'] || env['MIN_THREADS']
      max = env['PUMA_MAX_THREADS'] || env['MAX_THREADS']
      workers = if env['WEB_CONCURRENCY'] == 'auto'
        require_processor_counter
        ::Concurrent.available_processor_count
      else
        env['WEB_CONCURRENCY']
      end

      {
        min_threads: min && min != "" && Integer(min),
        max_threads: max && max != "" && Integer(max),
        workers: workers && workers != "" && Integer(workers),
        environment: env['APP_ENV'] || env['RACK_ENV'] || env['RAILS_ENV'],
      }
    end

    # ...
  end
end

At first, it creates a default_options hash by merging the actual defaults that are hard-coded in the DEFAULTS hash with ‘defaults’ dynamically provided either via environment variables or the second initializer argument (the only option provided this way is an events pub-sub container, which we will take a closer look at soon).

Then, it instantiates a UserFileDefaultOptions object, which is the actual source of all configuration values. Let’s briefly examine it before returning to the main configuration object:

module Puma
  class UserFileDefaultOptions
    def initialize(user_options, default_options)
      @user_options    = user_options
      @file_options    = {}
      @default_options = default_options
    end

    attr_reader :user_options, :file_options, :default_options

    def [](key)
      fetch(key)
    end

    def []=(key, value)
      user_options[key] = value
    end

    def fetch(key, default_value = nil)
      return user_options[key]    if user_options.key?(key)
      return file_options[key]    if file_options.key?(key)
      return default_options[key] if default_options.key?(key)

      default_value
    end

    def all_of(key)
      user    = user_options[key]
      file    = file_options[key]
      default = default_options[key]

      user    = [user]    unless user.is_a?(Array)
      file    = [file]    unless file.is_a?(Array)
      default = [default] unless default.is_a?(Array)

      user.compact!
      file.compact!
      default.compact!

      user + file + default
    end

    def finalize_values
      @default_options.each do |k,v|
        if v.respond_to? :call
          @default_options[k] = v.call
        end
      end
    end

    def final_options
      default_options
        .merge(file_options)
        .merge(user_options)
    end
  end
end

As already stated, the hierarchy of provided configuration values is neatly implemented in the fetch method - CLI arguments take precedence over values specified in the configuration file, which in turn take precedence over default values.

Additionally, as finalize_values suggests, it’s possible to specify default values as Proc instances, allowing default values to depend on environment variables or anything else that might get loaded or evaluated later in the host application’s lifecycle. However, this doesn’t seem to be used by Puma at all.

Following the creation of @options, Configuration’s initializer creates several other objects. Among them is the initialization of a PluginLoader

module Puma
  class Configuration
    # ...

    def initialize(user_options={}, default_options = {}, env = ENV, &block)
      default_options = self.puma_default_options(env).merge(default_options)

      @options     = UserFileDefaultOptions.new(user_options, default_options)
      @plugins     = PluginLoader.new
      # ...
    end

    # ...
  end
end

Plugins

As can already be deduced, Puma has a simple yet powerful plugin system.

Plugins are added to a registry during their definition via calls to Puma::Plugin.create in upstream libraries, or, as in the case of tmp_restart and systemd, in Puma itself.

Plugins implement the start method, which accepts an instance of a Launcher as its sole argument. We haven’t covered the Launcher yet, but for now, note that it provides access to Puma’s lifecycle callbacks, exposed using the events pub-sub container that we briefly mentioned when looking at the Configuration class. This allows plugins to hook into Puma and augment its behavior.

One interesting design choice of this plugin system is the automatic loading. It allows loading plugins that haven’t been defined yet by relying on conventions around the load path. If a plugin is required in the configuration file while its corresponding Puma::Plugin.create hasn’t been evaluated yet², Puma will try to load it by requireing puma/plugins/<plugin_name>. If that third-party plugin is defined in a separate gem under this expected path, it will get loaded.

The Plugin API also provides a streamlined way to run logic in a separate thread via its #in_background method, which is a simple wrapper around Thread.new. The key difference between using this method and manually starting a thread is that blocks defined using #in_background will only be executed when Puma considers it appropriate to run them, which happens right before the server is about to start.

DSL

Coming back to Configuration’s initializer, the next 3 instance variables being set are all instances of the DSL class. We’ll now pull the curtain on the #configure method:

module Puma
  class Configuration
    # ...

    def initialize(user_options={}, default_options = {}, env = ENV, &block)
      default_options = self.puma_default_options(env).merge(default_options)

      @options     = UserFileDefaultOptions.new(user_options, default_options)
      @plugins     = PluginLoader.new
      @user_dsl    = DSL.new(@options.user_options, self)
      @file_dsl    = DSL.new(@options.file_options, self)
      @default_dsl = DSL.new(@options.default_options, self)
      # ...
      if block
        configure(&block)
      end
    end

    # ...

    attr_reader :options, :plugins

    def configure
      yield @user_dsl, @file_dsl, @default_dsl
    ensure
      @user_dsl._offer_plugins
      @file_dsl._offer_plugins
      @default_dsl._offer_plugins
    end
  end
end

If you look back at Puma::CLI’s #setup_options method we saw earlier, you’ll notice that OptionParser calls a bind method on the user_config argument yielded from the #configure call implicitly invoked by Configuration’s initializer. However, we’ve already seen that the instance of UserFileDefaultOptions stores plain hashes - calling accessors on those instead of setting key-value pairs using #[] is impossible. This is why all three hashes are wrapped in DSL instances, and the benefit of doing that will become apparent soon.

Here’s DSL’s initializer and several methods it exposes as wrappers around its option hash:

module Puma
  class DSL
    def initialize(options, config)
      @config  = config
      @options = options

      @plugins = []
    end

    # ...

    def bind(url)
      @options[:binds] ||= []
      @options[:binds] << url
    end

    # ...

    def port(port, host=nil)
      host ||= default_host
      bind URI::Generic.build(scheme: 'tcp', host: host, port: Integer(port)).to_s
    end

    # ...

    def environment(environment)
      @options[:environment] = environment
    end

    # ...

    def on_restart(&block)
      @options[:on_restart] ||= []
      @options[:on_restart] << block
    end

    # ...
  end
end

As we can see, most of the methods simply provide convenience for working with the @options hash - some of them also encapsulate additional logic that processes the arguments.

We won’t go over each DSL configuration method just yet, but instead will examine them individually as we encounter them in the code. However, a couple of additional methods should be highlighted now:

module Puma
  class DSL
    ON_WORKER_KEY = [String, Symbol].freeze

    # ...

    def on_worker_boot(key = nil, &block)
      warn_if_in_single_mode('on_worker_boot')

      process_hook :before_worker_boot, key, block, 'on_worker_boot'
    end

    # ...

    def on_worker_fork(&block)
      warn_if_in_single_mode('on_worker_fork')

      process_hook :before_worker_fork, nil, block, 'on_worker_fork'
    end

     def after_worker_fork(&block)
      warn_if_in_single_mode('after_worker_fork')

      process_hook :after_worker_fork, nil, block, 'after_worker_fork'
    end

    # ...

    private

    # ...

    def process_hook(options_key, key, block, meth)
      @options[options_key] ||= []
      if ON_WORKER_KEY.include? key.class
        @options[options_key] << [block, key.to_sym]
      elsif key.nil?
        @options[options_key] << block
      else
        raise "'#{meth}' key must be String or Symbol"
      end
    end

    # ...
  end
end

These methods, along with a few analogous ones not shown in the snippet above, allow Puma users to run arbitrary code at key moments of the server lifecycle. We’ll see exactly when and where they are triggered once we dive deeper.

Finally, DSL instances also provide a couple of important methods:

module Puma
  class DSL
    # ...

    def _offer_plugins
      @plugins.each do |o|
        if o.respond_to? :config
          @options.shift
          o.config self
        end
      end

      @plugins.clear
    end

    # ...

    def plugin(name)
      @plugins << @config.load_plugin(name)
    end

    # ...
  end
end

Remember the PluginLoader created in the Configuration initializer? When plugin is called on a DSL object, it delegates to the main Configuration object, which initializes the plugin by calling create on the initialized loader. Then, Configuration#configure calls _offer_plugins on each DSL object, allowing plugins that implement the config method to configure Puma itself. This can be useful when settings need adjustments only when a particular plugin is enabled.

module Puma
  class Configuration
    # ...

    def configure
      yield @user_dsl, @file_dsl, @default_dsl
    ensure
      @user_dsl._offer_plugins
      @file_dsl._offer_plugins
      @default_dsl._offer_plugins
    end

    # ...

    def load_plugin(name)
      @plugins.create name
    end

    # ...
  end
end

Now, _load_from - this method makes it possible to have a Puma configuration file:

module Puma
  class DSL
    # ...

    def load(file)
      @options[:config_files] ||= []
      @options[:config_files] << file
    end

    # ...

    def _load_from(path)
      if path
        @path = path
        instance_eval(File.read(path), path, 1)
      end
    ensure
      _offer_plugins
    end

    # ...
  end
end

module Puma
  class Configuration
    # ...

    def load
      config_files.each { |config_file| @file_dsl._load_from(config_file) }

      @options
    end

    def config_files
      files = @options.all_of(:config_files)

      return [] if files == ['-']
      return files if files.any?

      first_default_file = %W(config/puma/#{@options[:environment]}.rb config/puma.rb).find do |f|
        File.exist?(f)
      end

      [first_default_file]
    end

    # ...
  end
end

Whenever Configuration#load is called, Puma looks up all paths to config files set using any means of configuration and executes instance_eval with the source code of the files passed as plain strings. This is why if you’ve used Puma before and looked at its configuration file, some of the DSL methods might look familiar - these are the very same methods. Puma reuses the same DSL approach for configuration provided via CLI arguments and other sources.

The most common way to override the path to the configuration file is by providing the -C CLI argument. Here’s what happens when it’s supplied:

module Puma
  class CLI
    # ...

    def setup_options
      @conf = Configuration.new({}, {events: @events}, env) do |user_config, file_config|
        @parser = OptionParser.new do |o|
          # ...

          o.on "-C", "--config PATH", "Load PATH as a config file" do |arg|
            file_config.load arg
          end

          # ...
        end
      end
    end

    # ...
  end
end

The path is passed directly to the file_config DSL object, which, in turn, adds the path to the config_files options hash key. Unless reconfigured, Puma defaults to config/puma.rb.

CLI Finalisation

Before we look at the alternative way to boot up a Puma server, we need to examine the remaining logic executed whenever a CLI object is instantiated:

module Puma
  class CLI
    def initialize(argv, log_writer = LogWriter.stdio, events = Events.new, env: ENV)
      # ...

      setup_options env

      begin
        @parser.parse! @argv

        if file = @argv.shift
          @conf.configure do |user_config, file_config|
            file_config.rackup file
          end
        end
      rescue UnsupportedOption
        exit 1
      end

      @conf.configure do |user_config, file_config|
        if @stdout || @stderr
          user_config.stdout_redirect @stdout, @stderr, @append
        end

        if @control_url
          user_config.activate_control_app @control_url, @control_options
        end
      end

      @launcher = Puma::Launcher.new(@conf, env: ENV, log_writer: @log_writer, events: @events, argv: argv)
    end

    # ...

    def run
      @launcher.run
    end
  end
end

The CLI initializer propagates several arguments set using the OptionParser after executing @parser.parse! to the DSL configuration objects. It configures the rackup option if a filepath was passed as an argument to the Puma executable, and stdout_redirect and activate_control_app if the corresponding arguments are provided.

Finally, it initializes an instance of a Launcher and passes the configuration object, along with several other values, as arguments. run will be called on the launcher in the eponymous CLI method immediately after the CLI instance is created in the executable script. We will delve into the inner workings of the launcher in a later section.

Rackup

Before we dive into Puma’s Launcher and other components, a brief refresher on Rack and Rackup is in order. However, we will not go into too much detail on them in this article.

Rack is simply a specification of an API (and simultaneously an implementation of some parts of this specification) that serves as glue between app servers like Puma and frameworks like Rails, which implement the endpoint logic (routing and controllers in the case of Rails).

Ultimately, all abstractions introduced by a framework manifest as a Rack app - an object that responds to #call and takes an env hash as an argument. This holds true for Rails, as internally, all controllers and routing logic are bundled into a stack of middlewares that form a Rack app.

Rackup, in turn, can be thought of as a server-agnostic entrypoint to start the Rack app built by the framework around user-provided logic.

When a Puma server is not started using its own executable, but rather with the underlying framework’s CLI instead (e.g., rails server), the responsibility to actually start the server is left to the framework. It wouldn’t be pragmatic to force framework developers to implement this logic for every popular app serve that may be running the underlying application, and that’s where Rackup comes in and solves this issue by acting as a middleman.

But how does the framework, such as Rails, know which server to use, and how does it activate it? The answer lies in the Rackup::Handler. Puma defines a Puma::RackHandler module under the lib/rack/handler path (which is not loaded from within Puma). Here’s its shortened version:

module Puma
  module RackHandler
    def config(app, options = {})
      require_relative '../../puma'
      require_relative '../../puma/configuration'
      require_relative '../../puma/log_writer'
      require_relative '../../puma/launcher'

      default_options = DEFAULT_OPTIONS.dup

      # ...

      conf = ::Puma::Configuration.new(options, default_options.merge({events: @events})) do |user_config, file_config, default_config|
        # ...

        if options[:environment]
          user_config.environment options[:environment]
        end

        # ...

        user_config.app app
      end
      conf
    end

    def run(app, **options)
      conf = self.config(app, options)

      # ...

      launcher = ::Puma::Launcher.new(conf, :log_writer => log_writer, events: @events)

      yield launcher if block_given?
      begin
        launcher.run
      rescue Interrupt
        puts "* Gracefully stopping, waiting for requests to finish"
        launcher.stop
        puts "* Goodbye!"
      end
    end

    # ...
  end
end

if Object.const_defined?(:Rackup) && ::Rackup.const_defined?(:Handler)
  module Rackup
    module Handler
      module Puma
        class << self
          include ::Puma::RackHandler
        end
      end
      register :puma, Puma
    end
  end
else
  do_register = Object.const_defined?(:Rack) && ::Rack.release < '3'
  module Rack
    module Handler
      module Puma
        class << self
          include ::Puma::RackHandler
        end
      end
    end
  end
  ::Rack::Handler.register(:puma, ::Rack::Handler::Puma) if do_register
end

The final if-statement is key here - whenever this file is loaded, Puma registers the defined handler in the internal Rackup registry. We’ll learn where this file is loaded from soon. The branching is there since after version 3 of Rack, Rackup became its own separate gem.

The handler serves as an alternative to Puma’s CLI class. Similarly, it creates a launcher and applies configuration, albeit slightly differently.

Whenever Puma is started using this handler (i.e., not via a Puma executable), code in the CLI class that we’ve already seen will never be executed. And vice versa, whenever Puma is started using its own executable, the Rackup handler is never called.

Rackup::Server, which is a different component, is responsible for building the Rack app (the process we’ll cover in the next section) and for invoking the appropriate handler by calling a run method, which we’ve seen defined in Puma::RackHandler. The server itself gets called by framework internals. In the case of Rails, it’s called from the server command handler.

How does the framework and Rackup know which server to use? Unless the name of a specific server is explicitly passed as a framework-specific CLI argument or an environment variable, Rackup will try to load the first available server’s handler, popular enough to be hardcoded in Rackup itself as part of a constant. This is why Puma::RackHandler is defined under the lib/rack/handler/puma.rb path - it is a convention that Rackup expects app servers to follow so it can seamlessly load them on-demand.

At this point, Puma goes out of the picture, so we’ll not be covering the rest. However, if you wish to learn a bit more about the integration between Rack, Rackup, the app server, and the framework, it’s recommended to read this piece of Rails documentation. Even though it’s specific to Rails, similar principles apply to other frameworks.

Loading the App

We have looked at 2 alternative ways by which Puma gets to create and run a Launcher, however, we have not yet seen an important part of the boot process - requiring the application and user-defined code.

If you have worked with any Ruby applications in the past, you most definitely have seen the config.ru file, usually located at the root of the project. It’s called a ‘Rackup file’, however, it does not necessitate Rackup to be used - we’ll soon see the details. As an example, here’s the contents of a basic Rackup file in a freshly generated Rails application:

# config.ru

# This file is used by Rack-based servers to start the application.

require_relative "config/environment"

run Rails.application
Rails.application.load_server

Notice the require_relative directive with config/environment as an argument - this loads the Rails application. A Rackup file is an entrypoint to user-defined code that will be used regardless of the way to start a Puma server, therefore it is expected that it will do all the necessary loading-related work.

Apart from requiring all the user-defined code, this file’s other responsibility is to build the Rack application. It is done using a special DSL defined by Rack::Builder, that’s where the run method comes from in the snippet above. In this case, Rails.application abstracts away all Rack middlewares, along with Rails’ internal code that implements the mapping of routes to controllers. Rack simply accepts a prepared object that responds to call and does all the necessary work.

Now, at what point is this user-provided file evaluated? It depends on the way in which the Puma server is started.

Let’s look at the way the Rackup file will be loaded in case Puma is started using its executable. You might remember this part from CLI’s initializer:

module Puma
  class CLI
    # ...

    def initialize(...)
      # ...

      begin
        @parser.parse! @argv

        if file = @argv.shift
          @conf.configure do |user_config, file_config|
            file_config.rackup file
          end
        end
      rescue UnsupportedOption
        exit 1
      end

      # ...
    end

    # ...
  end
end

As we’ve already established, Rackup itself is not involved in the process of booting up a Puma server when it’s started using Puma CLI. However, the CLI object explicitly configures the rackup option with a filename which can optionally be passed to the puma command. The filename is expected to point to a Rackup file, and as mentioned earlier, this file is relevant for non-Rackup startup procedures, despite its name.

Let’s take a look at the Configuration class again:

module Puma
  class Configuration
    # ...

    class ConfigMiddleware
      def initialize(config, app)
        @config = config
        @app = app
      end

      def call(env)
        env[Const::PUMA_CONFIG] = @config
        @app.call(env)
      end
    end

    def app_configured?
      @options[:app] || File.exist?(rackup)
    end

    def rackup
      @options[:rackup]
    end

    def app
      found = options[:app] || load_rackup

      if @options[:log_requests]
        require_relative 'commonlogger'
        logger = @options[:logger]
        found = CommonLogger.new(found, logger)
      end

      ConfigMiddleware.new(self, found)
    end

    # ...

    private

    def rack_builder
      if @puma_bundler_pruned
        begin
          require 'bundler/setup'
        rescue LoadError
        end
      end

      begin
        require 'rack'
        require 'rack/builder'
        ::Rack::Builder
      rescue LoadError
        require_relative 'rack/builder'
        Puma::Rack::Builder
      end
    end

    def load_rackup
      raise "Missing rackup file '#{rackup}'" unless File.exist?(rackup)

      rack_app, rack_options = rack_builder.parse_file(rackup)

      # ...

      rack_app
    end

    # ...
  end
end

Whenever Configuration#app is called for the first time, it triggers a chain of other method calls that ultimately result in the Rackup file being loaded:

found = options[:app] || load_rackup

The load_rackup method uses a previously mentioned Rack::Builder class by calling the rack_builder method. It’s interesting that Puma comes with a minimal version of the builder that it can fall back to in case the rack gem is unavailable. For example, this can happen if puma is invoked from outside the Bundler context (i.e., not via bundle exec), while rack is not installed as a system-wide gem.

The ENV.key? 'PUMA_BUNDLER_PRUNED' conditional ensures that Puma picks up the proper rack from the Bundler context, instead of the one installed as a system-wide dependency (via gem install rack), in case the user enables Bundler pruning, which unloads gems that are part of the bundle from the load path. We won’t be taking a deeper look into the pruning mechanism, but we’ll take note that it’s relevant pretty much exclusively to hot restarts.

When the Rackup file is read and the Rack app becomes available, Puma wraps it in a ConfigMiddleware, which adds the configuration object into the environment hash available for every request being processed. This can be used for debugging purposes within the application.

Even though it does not fall within Puma’s domain, let’s look at the way parse_file is implemented in the original Rack::Builder, since it involves some Ruby curiosities. Here’s a version of this method altered for brevity:

module Rack; end
Rack::BUILDER_TOPLEVEL_BINDING = ->(builder){builder.instance_eval{binding}}

module Rack
  class Builder
    # ...

    def self.parse_file(path, **options)
      # ...

      config = ::File.read(path)

      # ...

      builder = self.new(**options)

      binding = BUILDER_TOPLEVEL_BINDING.call(builder)
      eval(builder_script, binding, path)

      builder.to_app
    end

    # ...
  end
end

The interesting part here is the usage of eval with an explicit binding argument. An alternative way to implement this logic would be to use the instance_eval method on an instance of a Builder directly. However, this would negatively impact the way constants would get resolved from within the Rackup file, which could be demonstrated using the following example:

module Rack
  class Builder
    class CustomConstant
    end
  end
end

class CustomConstant
end

puts Rack::Builder::CustomConstant.object_id
# => 60
puts CustomConstant.object_id
# => 80

script_string = 'puts CustomConstant.object_id'

# Direct instance_eval with a string argument
builder = Rack::Builder.new
builder.instance_eval(script_string)
# => 60
# Resolves `CustomConstant` to the constant nested under `Rack::Builder`

# eval with a binding yielded from instance_eval
BUILDER_TOPLEVEL_BINDING = ->(builder) { builder.instance_eval { binding } }
builder_binding = BUILDER_TOPLEVEL_BINDING.call(builder)
eval(script_string, builder_binding)
# => 80
# Resolves `CustomConstant` to the top-level constant

The approach that involves a binding object yielded from instance_eval is more appropriate, as it avoids potential collisions of constant names.

instance_eval is needed here so that Builder’s DSL can be utilized in the file directly, i.e., methods like run, use, and others. In previous versions, Rack interpolated the contents of a configuration file in the following fashion:

eval("Rack::Builder.new {\n" + builder_script + "\n}.to_app", TOPLEVEL_BINDING, ...)
# `TOPLEVEL_BINDING` is a constant defined by the Ruby VM which unsurprisingly points to the global 'root' binding object.

But this made the Ruby VM ignore any magic comments potentially defined at the beginning of the Rackup file, such as frozen_string_literal: true and others, since technically, they would be preceded by a Rack::Builder.new line and therefore not be located at the start. The trick of providing a binding object as a second argument to eval allows us to circumvent that and have magic comments parsed correctly.

Let’s return to the Configuration#app method now:

module Puma
  class Configuration
    def app
      found = options[:app] || load_rackup
      # ...
    end
  end
end

When the Puma server is launched using Rackup (e.g., rails server), the load_rackup method will never be invoked, as Rackup will build the app itself and inject it into options under the :app key. This means that Puma is not responsible for loading the application when a framework command is used.

An important implication of this is that, at least in the case of Rails, all framework code, including Bundler.setup(...), and sometimes user-defined code, will get loaded before Puma evaluates its configuration file. This is not the case when puma is invoked directly.

By the time run is called on a Launcher instance from the CLI object, the Rackup file is not yet evaluated. The exact moment at which a Rackup file gets loaded, whenever Puma is started using its own executable, depends on factors that will only be unveiled once we dive deeper into the Launcher and the adjacent components.

Entering Active Duty

We have looked at 2 main ways in which a Puma server gets booted, both of which culminate in the initialisation of a Launcher object, with the run method being called on the instance.

Initialising the Launcher

Let’s look at Launcher#initialize:

module Puma
  class Launcher
    def initialize(conf, launcher_args={})
      @runner        = nil
      @log_writer    = launcher_args[:log_writer] || LogWriter::DEFAULT
      @events        = launcher_args[:events] || Events.new
      @argv          = launcher_args[:argv] || []
      @original_argv = @argv.dup
      @config        = conf

      env = launcher_args.delete(:env) || ENV
      # ...

      @config.load

      @binder        = Binder.new(@log_writer, conf)
      @binder.create_inherited_fds(ENV).each { |k| ENV.delete k }
      @binder.create_activated_fds(ENV).each { |k| ENV.delete k }

      @environment = conf.environment

      if ENV["NOTIFY_SOCKET"] && !Puma.jruby?
        @config.plugins.create('systemd')
      end

      if @config.options[:bind_to_activated_sockets]
        @config.options[:binds] = @binder.synthesize_binds_from_activated_fs(
          @config.options[:binds],
          @config.options[:bind_to_activated_sockets] == 'only'
        )
      end

      @options = @config.options

      # ...

      generate_restart_data

      # ...

      Dir.chdir(@restart_dir)

      prune_bundler!

      @environment = @options[:environment] if @options[:environment]
      set_rack_environment

      if clustered?

        # ...

        @runner = Cluster.new(self)
      else
        @runner = Single.new(self)
      end

      # ...

      @status = :run

      # ...
    end
  end
end

The first important thing that happens here is the @config.load call, the contents of which we have already examined. This is where Puma evaluates the configuration file.

Next is the initialization of a Binder. We will look at its internals later, but for now, we can establish that this class makes Puma bind to configured addresses (set using the binds option) and start listening on them.

We’ll skip looking into Binder#create_inherited_fds for now, as it’s only relevant during the restart process, which we will cover separately.

The Binder#create_activated_fds call and the following 2 conditionals are related to Puma’s systemd integration. If you’re not familiar with systemd, it can be thought of as a way to manage processes, an alternative to the containerization approach. We will not cover the integration in great detail, but we’ll mention that Puma supports systemd’s socket activation feature, which allows Puma to use sockets already opened by systemd itself.

@binder.synthesize_binds_from_activated_fs is responsible for rebuilding the binds configuration option based on activated sockets.

The next steps in the launcher initialization process are calls to generate_restart_data, Dir.chdir(@restart_dir), and prune_bundler!, all of which facilitate the restart logic. We will cover the former two later.

Lastly, the launcher sets the RACK_ENV environment variable to the configured value representing the environment in which Puma was launched, and creates a @runner. The clustered? method checks the configured number of workers via (@options[:workers] || 0) > 0, which dictates whether Puma should use a single worker process or fork multiple. At this point, we will look into the Single version of the runner, and dive into Cluster, which is activated if there is more than one worker configured, afterwards. Finally, the @status instance variable is set to :run.

Starting the Main Loop

Now let’s look at Launcher#run:

module Puma
  class Launcher
    def run
      previous_env = get_env

      # ...

      @config.plugins.fire_starts self

      setup_signals

      # ...

      @runner.run

      do_run_finished(previous_env)
    end
  end
end

The call to get_env returns the current ENV hash, which contains environment variables, and this is passed to do_run_finished at the very end, presumably when the server’s run has ended. This is another concern relevant for restarts and shutdowns, so we will not cover it just yet.

@config.plugins is an instance of PluginLoader, which, as we have already seen, gets created when the Configuration object is initialized. The fire_starts method iterates over all loaded plugins and calls start on them if they have it defined.

Trapping Signals

Next, run calls setup_signals. Signals are one of the ways to implement inter-process communication. If you need a refresher or want to dive into Ruby’s internal signal handling, you can read the corresponding section in my other article on Sidekiq Internals.

module Puma
  class Launcher
    def setup_signals
      begin
        Signal.trap "SIGUSR2" do
          restart
        end
      rescue Exception
        log "*** SIGUSR2 not implemented, signal based restart unavailable!"
      end

      unless Puma.jruby?
        begin
          Signal.trap "SIGUSR1" do
            phased_restart
          end
        rescue Exception
          log "*** SIGUSR1 not implemented, signal based restart unavailable!"
        end
      end

      begin
        Signal.trap "SIGTERM" do
          do_graceful_stop

          raise(SignalException, "SIGTERM") if @options[:raise_exception_on_sigterm]
        end
      rescue Exception
        log "*** SIGTERM not implemented, signal based gracefully stopping unavailable!"
      end

      begin
        Signal.trap "SIGINT" do
          stop
        end
      rescue Exception
        log "*** SIGINT not implemented, signal based gracefully stopping unavailable!"
      end

      begin
        Signal.trap "SIGHUP" do
          if @runner.redirected_io?
            @runner.redirect_io
          else
            stop
          end
        end
      rescue Exception
        log "*** SIGHUP not implemented, signal based logs reopening unavailable!"
      end

      begin
        unless Puma.jruby?
          Signal.trap "SIGINFO" do
            thread_status do |name, backtrace|
              @log_writer.log(name)
              @log_writer.log(backtrace.map { |bt| "  #{bt}" })
            end
          end
        end
      rescue Exception
      end
    end

    # ...

    def thread_status
      Thread.list.each do |thread|
        name = "Thread: TID-#{thread.object_id.to_s(36)}"
        name += " #{thread['label']}" if thread['label']
        name += " #{thread.name}" if thread.respond_to?(:name) && thread.name
        backtrace = thread.backtrace || ["<no backtrace available>"]

        yield name, backtrace
      end
    end
  end
end

Each Signal.trap registers a corresponding signal handler. SIGUSR1 and SIGUSR2 trigger a restart of Puma, while SIGINT and SIGTERM stop it.

The SIGHUP signal is most commonly sent when a Puma process is launched from a terminal, and it has two branches: If stdout_redirect was configured (which translates into redirect_stdout, redirect_stderr, and redirect_append configuration options), the signal handler will simply redirect output and error streams to the specified destinations using IO.reopen. This was primarily useful for daemonizing Puma processes, but this feature was removed in version 5.0. It’s unclear whether the SIGHUP handler is still useful.

If stdout_redirect is not configured, the SIGHUP handler will simply stop Puma in the same way SIGINT does. This behaviour is especially relevant in the development environment, where a forcefully shut down terminal will send a SIGHUP signal to an attached process on most systems. Without explicitly calling close in the handler, Puma won’t be able to restart as the original process will not be terminated and will continue to occupy the specified port.

The SIGINFO handler is extremely handy for debugging stuck Puma processes, as it logs the stack traces of all active threads using Thread.list. However, this signal is only available on BSD systems.

Let’s examine the methods invoked from the signal handlers:

module Puma
  class Launcher
    # ...

    def stop
      @status = :stop
      @runner.stop
    end

    def restart
      @status = :restart
      @runner.restart
    end

    def phased_restart
      unless @runner.respond_to?(:phased_restart) and @runner.phased_restart
        log "* phased-restart called but not available, restarting normally."
        return restart
      end
      true
    end

    # ...

    def do_graceful_stop
      @events.fire_on_stopped!
      @runner.stop_blocked
    end

    # ...
  end
end

All of them end up calling a method on the @runner instance, while some of them mutate the @status instance variable.

phased_restart falls back to a plain restart, as it’s only available when the runner is set to Cluster. We will revisit the Cluster runner in a later section.

We will dive deeper into each method invoked on @runner later. For now, let’s return to the rest of the Launcher#run method. The final two calls that we have yet to cover are @runner.run and `do_run_finished:

module Puma
  class Launcher
    # ...

    def run
      previous_env = get_env

      # ...

      @runner.run

      do_run_finished(previous_env)
    end

    # ...

    def do_run_finished(previous_env)
      case @status
      when :halt
        do_forceful_stop
      when :run, :stop
        do_graceful_stop
      when :restart
        do_restart(previous_env)
      end

      close_binder_listeners unless @status == :restart
    end
  end
end

The do_run_finished call gives us a strong hint that signal handlers are expected to run before it gets invoked, as it relies on the @status variable being updated. Based on that, we can infer that @runner.run is a blocking call where the Puma main thread spends the majority of the process’ lifetime. But before we dive into @runner.run, let’s take a peek at what happens in methods invoked from do_run_finished, which seemingly gets called once @runner.run returns. So far, we can assume that this happens when the corresponding @runner methods are called from within signal handlers, such as @runner.stop and others.

module Puma
  class Launcher
    # ...

    def do_forceful_stop
      log "* Stopping immediately!"
      @runner.stop_control
    end

    def do_graceful_stop
      @events.fire_on_stopped!
      @runner.stop_blocked
    end

    def do_restart(previous_env)
      log "* Restarting..."
      ENV.replace(previous_env)
      @runner.stop_control
      restart!
    end

    # ...

    def close_binder_listeners
      @runner.close_control_listeners
      @binder.close_listeners
      unless @status == :restart
        log "=== puma shutdown: #{Time.now} ==="
        log "- Goodbye!"
      end
    end
  end
end

The nature of the code executed as the last step of the Puma lifecycle depends on which of these methods are called, which in turn depends on the value set for the @status variable. As we’ve seen, one way to modify it is by sending an appropriate signal to the Puma process.

We will not unravel the contents of all @runner methods and @binder.close_listeners right now. Instead, we will focus on @runner.run next specifically.

Final Preparations

With the upfront setup of signal handlers complete, the launcher delegates flow control to @runner.run. At this point, the Puma process has not yet bound to specified endpoints, let alone started processing requests.

As we’ve already seen, the class of the @runner object depends on how Puma was launched - it can either be Cluster or Single, depending on the number of workers specified. Both classes are descendants of Runner, which contains logic shared between them. In Puma’s terminology, a worker is a separate forked child process and the Cluster runner contains additional logic for managing these workers. We will focus on the Single version of the runner in this section, as it allows us to concentrate on Puma’s networking behaviour without paying attention to the complexity of managing multiple processes.

Internals of Cluster and other details of running Puma in forked mode will be covered in a future section.

Here’s how Single#run and Runner#initialize look:

module Puma
  class Single < Runner
    def run
      # ...

      load_and_bind

      Plugins.fire_background

      @launcher.write_state
      start_control

      @server = server = start_server
      server_thread = server.run

      # ...

      @events.fire_on_booted!

      # ...

      begin
        server_thread.join
      rescue Interrupt
      end
    end
  end
end

module Puma
  class Runner
    def initialize(launcher)
      @launcher = launcher
      @log_writer = launcher.log_writer
      @events = launcher.events
      @config = launcher.config
      @options = launcher.options
      @app = nil
      @control = nil
      @started_at = Time.now
      @wakeup = nil
    end

    # ...

    def load_and_bind
      unless @config.app_configured?
        error "No application configured, nothing to run"
        exit 1
      end

      begin
        @app = @config.app
      rescue Exception => e
        log "! Unable to load application: #{e.class}: #{e.message}"
        raise e
      end

      @launcher.binder.parse @options[:binds]
    end

    # ...

    def app
      @app ||= @config.app
    end

    # ...
  end
end

The load_and_bind method is where the actual Ruby program, whether it’s a Rails application or a plain Rackup file, gets loaded, but only if Puma was started using its own CLI command. If Puma was launched via a framework-specific command (e.g., rails server), the application would already have been loaded by this point, since the internal rails server command evaluates the Rackup file before reaching Launcher#run.

The @launcher.binder.parse call from load_and_bind is the last step in the process of starting Puma’s main loop, which we’ll soon cover. Before we dive into that, let’s take a look at the rest of Single#run.

The Plugins.fire_background method delegates to PluginRegistry, which iterates over all blocks provided to in_background methods invoked within plugins defined via Puma::Plugin.create. This is the point where Puma initializes a thread for each defined block and starts executing it in an infinite loop.

Next, Launcher.write_state and start_control are executed. You might have noticed references to control_app, control_url, and control_listeners in the code we’ve already reviewed. Puma comes with an optional control app that allows dynamic interaction with the running server. We won’t go into its internal workings, but we should note that write_state creates and persists a state file that supports some of the control app’s functions, while start_control starts the control app itself.

The part of Launcher#write_state that is most relevant to us right now is the call to write_pid, which creates a PID file if Puma is configured to do so:

module Puma
  class Launcher
    def write_pid
      path = @options[:pidfile]
      return unless path
      cur_pid = Process.pid
      File.write path, cur_pid, mode: 'wb:UTF-8'
      at_exit do
        delete_pidfile if cur_pid == Process.pid
      end
    end
  
    # ...

    def delete_pidfile
      path = @options[:pidfile]
      File.unlink(path) if path && File.exist?(path)
    end
  end
end

PID files are relevant to supervisors such as systemd, monit, and others. They are used to signify to the supervisor that a particular process is up and running, or if it is down. However, PID files are not used in most containerized setups, including containerd (which is used by Docker). As an example of PID file usage, Rackup, which is used to launch a Rack app in case Puma is started without using its own executable, aborts startup if it detects an existing file with the PID not matching that of the current process.

The next two lines are the most important, as they are responsible for launching the Server:

module Puma
  class Single < Runner
    def run
      # ...

      @server = server = start_server
      server_thread = server.run

      # ...

      @events.fire_on_booted!

      # ...

      begin
        server_thread.join
      rescue Interrupt
      end
    end
  end
end

module Puma
  class Runner
    def start_server
      server = Puma::Server.new(app, @events, @options)
      server.inherit_binder(@launcher.binder)
      server
    end
  end
end

A call to @events.fire_on_booted is a great display of the way the Events pub-sub container is utilized by Puma. This particular dispatch signals that a Puma server has just been started and is ready to accept clients. This is the point where all blocks registered using the DSL#on_booted method get called, most of which are provided by third-party integrations and plugins.

We also have finally discovered what the main thread is doing for most of Puma’s main process lifetime: it blocks on a Thread#join, waiting for the server thread to finish. This can be seen as a contrast to the way Sidekiq conducts its main thread, which consists of repeatedly reading from a pipe used by signal traps to delegate execution of the actual handler logic to the main thread itself. While the Launcher thread is not performing such reads, this does not mean that Puma shuns this approach, as we’ll soon learn after digging deeper.

The way requests get processed by the Server will finally be revealed in a future section after the inner workings of the Binder are unraveled. The binder instance that was created in Launcher’s initializer can be seen being explicitly propagated to a Server instance using inherit_binder.

Setting Up the Addresses

After reading and applying the configuration, as well as starting the Server thread, Puma finally starts listening on the specified endpoints. This happens in the Binder#parse method, which gets called from Single#run:

module Puma
  class Binder
    def initialize(log_writer, conf = Configuration.new, env: ENV)
      @log_writer = log_writer
      @conf = conf
      @listeners = []
      @inherited_fds = {}
      @activated_sockets = {}
      @unix_paths = []
      @env = env

      @proto_env = {
        "rack.version".freeze => RACK_VERSION,
        "rack.errors".freeze => log_writer.stderr,
        "rack.multithread".freeze => conf.options[:max_threads] > 1,
        "rack.multiprocess".freeze => conf.options[:workers] >= 1,
        "rack.run_once".freeze => false,
        RACK_URL_SCHEME => conf.options[:rack_url_scheme],
        "SCRIPT_NAME".freeze => env['SCRIPT_NAME'] || "",
        "QUERY_STRING".freeze => "",
        SERVER_SOFTWARE => PUMA_SERVER_STRING,
        GATEWAY_INTERFACE => CGI_VER
      }

      @envs = {}
      @ios = []
    end

    # ...

    def parse(binds, log_writer = nil, log_msg = 'Listening')
      log_writer ||= @log_writer
      binds.each do |str|
        uri = URI.parse str
        case uri.scheme
        when "tcp"
          if fd = @inherited_fds.delete(str)
            io = inherit_tcp_listener uri.host, uri.port, fd
            log_writer.log "* Inherited #{str}"
          elsif sock = @activated_sockets.delete([ :tcp, uri.host, uri.port ])
            io = inherit_tcp_listener uri.host, uri.port, sock
            log_writer.log "* Activated #{str}"
          else
            ios_len = @ios.length
            params = Util.parse_query uri.query

            low_latency = params.key?('low_latency') && params['low_latency'] != 'false'
            backlog = params.fetch('backlog', 1024).to_i

            io = add_tcp_listener uri.host, uri.port, low_latency, backlog

            @ios[ios_len..-1].each do |i|
              addr = loc_addr_str i
              log_writer.log "* #{log_msg} on http://#{addr}"
            end
          end

          @listeners << [str, io] if io
        when "unix"
          # ...
          @listeners << [str, io]
        when "ssl"
          # ...
          @listeners << [str, io] if io
        else
          log_writer.error "Invalid URI: #{str}"
        end
      end

      @inherited_fds.each do |str, fd|
        log_writer.log "* Closing unused inherited connection: #{str}"

        begin
          IO.for_fd(fd).close
        rescue SystemCallError
        end

        uri = URI.parse str
        if uri.scheme == "unix"
          path = "#{uri.host}#{uri.path}"
          File.unlink path
        end
      end

      unless @activated_sockets.empty?
        fds = @ios.map(&:to_i)
        @activated_sockets.each do |key, sock|
          next if fds.include? sock.to_i
          log_writer.log "* Closing unused activated socket: #{key.first}://#{key[1..-1].join ':'}"
          begin
            sock.close
          rescue SystemCallError
          end

          File.unlink key[1] if key.first == :unix
        end
      end
    end
  end
end

As can be seen, parse iterates over the passed binds array, which is taken directly from the configuration. When Puma is supposed to bind on a localhost address, the array contain this string: tcp://localhost:3000.

Each bind string is then parsed using URI::parse and converted into an object whose class depends on the URI itself. If it’s a tcp or unix URI, then it will be converted into Ruby’s URI::Generic object, which implements RFC 2396. This object exposes URI parts as methods.

During iteration, a case-when statement evaluates each bind, which invokes appropriate logic based on the bind’s scheme. There are three possibilities here: tcp, unix, and ssl. Binds whose scheme is unix are UNIX domain sockets, which can be used for interprocess communication in the scope of a single machine. Logic that sets up ssl binds is similar to that which is responsible for tcp, but with the extra responsibility of handling TLS/SSL concerns. We will not be looking into how exactly binds of these two schemes get configured in this article, since after the Binder finishes its work, there’s no significant difference between these sockets from the perspective of the server.

Let’s examine the tcp branch in more detail:

# ...
case uri.scheme
when "tcp"
  if fd = @inherited_fds.delete(str)
    io = inherit_tcp_listener uri.host, uri.port, fd
    log_writer.log "* Inherited #{str}"
  elsif sock = @activated_sockets.delete([ :tcp, uri.host, uri.port ])
    io = inherit_tcp_listener uri.host, uri.port, sock
    log_writer.log "* Activated #{str}"
  else
    ios_len = @ios.length
    params = Util.parse_query uri.query

    low_latency = params.key?('low_latency') && params['low_latency'] != 'false'
    backlog = params.fetch('backlog', 1024).to_i

    io = add_tcp_listener uri.host, uri.port, low_latency, backlog

    @ios[ios_len..-1].each do |i|
      addr = loc_addr_str i
      log_writer.log "* #{log_msg} on http://#{addr}"
    end
  end
# ...
end
# ...

The last branch is the most important out of the three here. Code utilising ios_len, which includes the last @ios[ios_len..-1] iteration, is simply there to log addresses being listened on without duplication. We’ll take note of how it is possible to specify low_latency and backlog parameters on the individual URI itself (e.g., tcp://0.0.0.0:3000?low_latency=true&backlog=1024) before taking a brief look at the other two branches, after which we’ll examine add_tcp_listener.

The first branch is used for inheriting sockets after Puma restarts. The second one is related to systemd socket activation, which we are not covering. Both @inherited_fds and @activated_sockets get populated during the initialisation of the Launcher instance, before the call to parse is made, which is why their members are deleted.

You might wonder how inherited and systemd activated sockets ever get registered, considering that they are expected to be duplicated in the binds configuration option.

Inherited binds get passed to the restarted Puma process using a slightly different mechanism. We’ll take a look at how this is achieved by the restart process later.

With sockets activated via systemd, the situation is also a little bit different. If you look back at Launcher#initialize, you will see the following:

module Puma
  class Launcher
    def initialize(...)
      # ...

      if @config.options[:bind_to_activated_sockets]
        @config.options[:binds] = @binder.synthesize_binds_from_activated_fs(
          @config.options[:binds],
          @config.options[:bind_to_activated_sockets] == 'only'
        )
      end

      # ...
    end
  end
end

In case the bind_to_activated_sockets DSL option is configured, Binder will automatically add sockets activated by systemd to the binds array. If this option is not configured, it’s expected that activated sockets get passed explicitly as Puma CLI arguments. If you want to learn how binds are propagated from systemd in the first place, refer to the corresponding Puma documentation page.

Binding

Let’s now look at add_tcp_listener:

module Puma
  class Binder
    # ...

    def add_tcp_listener(host, port, optimize_for_latency=true, backlog=1024)
      if host == "localhost"
        loopback_addresses.each do |addr|
          add_tcp_listener addr, port, optimize_for_latency, backlog
        end
        return
      end

      host = host[1..-2] if host&.start_with? '['
      tcp_server = TCPServer.new(host, port)

      if optimize_for_latency
        tcp_server.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
      end
      tcp_server.setsockopt(Socket::SOL_SOCKET,Socket::SO_REUSEADDR, true)
      tcp_server.listen backlog

      @ios << tcp_server
      tcp_server
    end

    # ...
  
    def loopback_addresses
      t = Socket.ip_address_list.select do |addrinfo|
        addrinfo.ipv6_loopback? || addrinfo.ipv4_loopback?
      end
      t.map! { |addrinfo| addrinfo.ip_address }; t.uniq!; t
    end

    # ...
  end
end

First, this method covers the edge case when ‘localhost’ is specified as the host. This gets translated into tail calls to add_tcp_listener for each loopback address, which on most systems would be IPv4’s 127.0.0.1 and IPv6’s ::1.

In case the given address is IPv6, it’s transformed from the URI format ([::1]) to the IP address format (::1).

A TCPServer is then instantiated, which is simply a plain Ruby abstraction over a socket that acts as a server socket. Let’s examine what happens when TCPServer::new is called, which is implemented directly in C:

// ext/socket/tcpserver.c

static VALUE
tcp_svr_init(int argc, VALUE *argv, VALUE sock)
{
    VALUE hostname, port;

    rb_scan_args(argc, argv, "011", &hostname, &port);
    return rsock_init_inetsock(sock, hostname, port, Qnil, Qnil, INET_SERVER, Qnil, Qnil);
}

The rsock_init_inetsock is a hefty function, which in the case of tcp_svr_init performs the socket and bind system calls.

// ext/socket/ipsocket.c

static VALUE
init_inetsock_internal(VALUE v)
{
    # ...
        status = rsock_socket(res->ai_family,res->ai_socktype,res->ai_protocol);
        syscall = "socket(2)";
        fd = status;
        if (fd < 0) {
            error = errno;
            continue;
        }
        arg->fd = fd;
        if (type == INET_SERVER) {
#if !defined(_WIN32) && !defined(__CYGWIN__)
            status = 1;
            setsockopt(fd, SOL_SOCKET, SO_REUSEADDR,
                       (char*)&status, (socklen_t)sizeof(status));
#endif
            status = bind(fd, res->ai_addr, res->ai_addrlen);
            syscall = "bind(2)";
        }
    # ...

The bind syscall is made directly, while socket is abstracted away in rsock_socket. There’s an interesting detail hidden away there that makes it worth taking a look at:

// ext/socket/init.c

int
rsock_socket(int domain, int type, int proto)
{
    int fd;

    fd = rsock_socket0(domain, type, proto);
    if (fd < 0) {
        if (rb_gc_for_fd(errno)) {
            fd = rsock_socket0(domain, type, proto);
        }
    }
    if (0 <= fd)
        rb_update_max_fd(fd);
    return fd;
}

# ...

static int
rsock_socket0(int domain, int type, int proto)
{
#ifdef SOCK_CLOEXEC
    type |= SOCK_CLOEXEC;
#endif

#ifdef SOCK_NONBLOCK
    type |= SOCK_NONBLOCK;
#endif

    int result = socket(domain, type, proto);

    if (result == -1)
        return -1;

    rb_fd_fix_cloexec(result);

#ifndef SOCK_NONBLOCK
    rsock_make_fd_nonblock(result);
#endif

    return result;
}

# ...

When a socket is opened, it’s preconfigured with the SOCK_CLOEXEC and SOCK_NONBLOCK options.

The CLOEXEC option signifies that a socket should be closed when an exec-like system call is made (e.g. execve). This usually does not happen in application code (i.e. it’s unlikely that a call to Kernel.exec happens within an HTTP endpoint handler), but it’s a noteworthy detail. The very same method is used to initialise a plain Socket, which means that normally, all sockets opened using the Ruby API will have this flag set and will be closed automatically when a process is replaced via exec. This will become extra relevant once we cover restarts.

The NONBLOCK flag makes the opened file descriptor function in nonblocking mode. This will also become extremely important later as we progress further.

Listening

At this point, the file descriptor for the socket is obtained, but Puma hasn’t yet started listening on it. Binder does a few more things in add_tcp_listener:

module Puma
  class Binder
    # ...

    def add_tcp_listener(host, port, 
      # ...

      tcp_server = TCPServer.new(host, port)

      if optimize_for_latency
        tcp_server.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
      end
      tcp_server.setsockopt(Socket::SOL_SOCKET,Socket::SO_REUSEADDR, true)
      tcp_server.listen backlog

      @ios << tcp_server
      tcp_server
    end

    # ...
  end
end

In case the bind address included the low_latency parameter, a TCP_NODELAY flag is set on the socket using the setsockopt. When enabled, it disables Nagle’s algorithm.

In essence, Nagle’s algorithm makes it so that data is not always immediately sent over the network, but buffered until there is a sufficient amount to send out or until an acknowledgment for the previous packets is received instead. This lowers the total amount of network packets, thus improving throughput, but potentially decreasing latency due to increased waiting times on data send-outs.

It’s worth noting that this flag will only be enabled if Puma is passed an explicit bind string with the necessary parameter, which in my experience is rarely done. There is no other way to enable it.

If you’re familiar with some of the discourse around Nagle’s algorithm and if you’re of the opinion that it’s a remnant of a bygone era, you may question why TCP_NODELAY isn’t set by default, considering that some frameworks and languages, like Go and NodeJS, do that. While there doesn’t seem to be a definitive answer available, this comment highlights that enabling TCP_NODELAY on listening sockets won’t do much in the case of Puma, since it makes use of TCP_CORK, which takes precedence over TCP_NODELAY. We’ll dive into socket corking and what it does once we start looking into how Puma writes the responses.

Let’s get a little bit ahead of ourselves: you probably already realise that the socket we’re examining will start calling accept at some point, and you might ask why the TCP_NODELAY option is set on that listening socket, considering that actual writes happen on the sockets spawned from accept, not on the ‘server’ socket. The answer is that accepted sockets inherit this option from the listening ‘parent’. Interestingly, this behaviour might be unintentional on Linux, considering that the manpage explicitly mentions that new sockets returned by accept do not inherit options of the listening socket. Regardless of whether this is expected or not, this behaviour can be observed on many systems, and thus Puma has been relying on it for 13 years and counting.

Another option being set on this socket is SO_REUSEADDR. Without setting this option, Puma might not be able to rebind to the same interface in certain situations after it restarts. This will happen due to the port entering the TIME_WAIT state, usually for a duration of MSL multiplied by 2, depending on the system. This state handles potential straggling packets arriving from a previous connection in case the server gets killed. The risks of enabling this option seem to be minimal on modern systems - you can read more about this socket option here and here.

Finally, listen is called on the socket. The sole argument of the Ruby wrapper method is the desired size of the backlog, which gets passed to the underlying system call. This backlog dictates how many incoming connections can remain unaccepted. Any request arriving after the queue becomes full will face an ECONNREFUSED error, which on a higher level usually manifests as the 502 or 504 HTTP status codes. The default socket backlog size in Puma is 1024.

It’s worth noting that this backlog value is not the only moving part that requires tuning when configuring the server for higher capacity. The queue size is capped by the configurable OS value, which is SOMAXCONN on Linux, available at /proc/sys/net/core/somaxconn. It usually has a default of 4096.

Another implicit limit when accepting connections from the listening socket is the per-process resource limit on the amount of open file descriptors, also imposed by the OS. Any socket opened by a process correlates to a file descriptor. While it may not necessarily truncate the backlog size outright the way SOMAXCONN does, it won’t be possible to accept new simultaneous connections once the specified limit is reached. However, it’s unusual for a Puma-backed application to exhaust the open FD limit, considering Puma’s threaded nature - running out of worker capacity (i.e. max amount of threads serving clients) before hitting the resource limit is much more likely.

In the end, add_tcp_listener and its friends push the socket object into the @ios array. At this point, clients can reach the Puma server and initiate connections. These connections, however, are not yet accepted. In order for this to happen, Puma has to start explicitly accepting them, which happens in Puma::Server.

Establishing Connections

We’ve gone through Puma’s Binder, which parses the specified addresses on which it is supposed to listen for incoming connections, opens the corresponding sockets, and calls bind and listen on them, at which point the server is fully ready to accept connections.

The Single version of Runner then instantiates a Server instance, provides it with the aforementioned Binder object, and calls run on it. Let’s look at it again:

# lib/puma/single.rb

module Puma
  class Single < Runner
    def run
      # ...

      load_and_bind # Calls `Binder#parse`, which we've looked at in the previous section

      # ...

      @server = server = start_server
      server_thread = server.run

      # ...

      @events.fire_on_booted!

      # ...

      begin
        server_thread.join
      rescue Interrupt
      end
    end
  end
end

# lib/puma/runner.rb

module Puma
  class Runner
    def start_server
      server = Puma::Server.new(app, @events, @options)
      server.inherit_binder(@launcher.binder)
      server
    end
  end
end

It’s worth to remember that the main thread of Puma will spend the rest of its runtime waiting on the server_thread to be joined, i.e. for the server to finish.

Now is the time to delve into the inner workings of Server:

module Puma
  class Server
    # ...

    def initialize(app, events = nil, options = {})
      @app = app
      @events = events || Events.new

      @check, @notify = nil
      @status = :stop

      @thread = nil
      @thread_pool = nil

      @options = if options.is_a?(UserFileDefaultOptions)
        options
      else
        UserFileDefaultOptions.new(options, Configuration::DEFAULTS)
      end

      # A lot of `@options` are added as instance variables as shortcuts, most are omitted for brevity
      @queue_requests            = @options[:queue_requests]

      # ...

      ENV['RACK_ENV'] ||= "development"

      @mode = :http

      @precheck_closing = true

      @requests_count = 0

      @idle_timeout_reached = false
    end

    def inherit_binder(bind)
      @binder = bind
    end

    # ...
  end
end

The initializer sets some instance variables that will soon become relevant. Some of them are extracted from the Puma configuration options (omitted in the snippet above), while others are preset to either nil or starting values.

Let’s look at the run method:

module Puma
  class Server
    # ...

    def run(background=true, thread_name: 'srv')
      BasicSocket.do_not_reverse_lookup = true

      @events.fire :state, :booting

      @status = :run

      @thread_pool = ThreadPool.new(thread_name, options) { |client| process_client client }

      if @queue_requests
        @reactor = Reactor.new(@io_selector_backend) { |c| reactor_wakeup c }
        @reactor.run
      end

      @thread_pool.auto_reap! if options[:reaping_time]
      @thread_pool.auto_trim! if options[:auto_trim_time]

      @check, @notify = Puma::Util.pipe unless @notify

      @events.fire :state, :running

      if background
        @thread = Thread.new do
          Puma.set_thread_name thread_name
          handle_servers
        end
        return @thread
      else
        handle_servers
      end
    end

    # ...
  end
end

BasicSocket.do_not_reverse_lookup is immediately set to true, disabling DNS reverse lookups by altering the global Ruby configuration value.

// ext/socket/basicsocket.c

static VALUE
bsock_do_not_rev_lookup_set(VALUE self, VALUE val)
{
    rsock_do_not_reverse_lookup = RTEST(val);
    return val;
}

// ext/socket/init.c

int rsock_do_not_reverse_lookup = 1;

// ...

// An internal function that does some upkeeping and returns an instance of the specified socket class
VALUE
rsock_init_sock(VALUE sock, int fd)
{
    rb_io_t *fp;

    rb_update_max_fd(fd);
    MakeOpenFile(sock, fp);
    fp->fd = fd;
    fp->mode = FMODE_READWRITE|FMODE_DUPLEX;
    rb_io_ascii8bit_binmode(sock);
    if (rsock_do_not_reverse_lookup) {
        // A bitwise OR operation that sets the NOREVLOOKUP bit to 1, disabling reverse lookups
        fp->mode |= FMODE_NOREVLOOKUP;
    }
    rb_io_synchronized(fp);

    return sock;
}

When Ruby needs to extract the remote address from a TCP connection, e.g. when IPSocket.peeraddr or IPSocket.recvfrom is called, it will try to get the DNS address associated with the IP address of the peer. This is not desired, since Puma doesn’t need the DNS records that may be potentially associated with a client. Not only is doing a reverse lookup here mostly pointless, but it would also result in an outrageously unnecessary latency increase for all clients due to extra DNS queries. Puma makes use of peeraddr, whose source code can be found below, where the check for FMODE_NOREVLOOKUP happens. We’ll examine the reasons for its usage later.

// ext/socket/ipsocket.c

static VALUE
ip_peeraddr(int argc, VALUE *argv, VALUE sock)
{
    rb_io_t *fptr;
    union_sockaddr addr;
    socklen_t len = (socklen_t)sizeof addr;
    int norevlookup;

    GetOpenFile(sock, fptr);

    if (argc < 1 || !rsock_revlookup_flag(argv[0], &norevlookup))
        // A bitwise AND operation that checks whether FMODE_NOREVLOOKUP is set
        norevlookup = fptr->mode & FMODE_NOREVLOOKUP;
    if (getpeername(fptr->fd, &addr.addr, &len) < 0)
        rb_sys_fail("getpeername(2)");
    return rsock_ipaddr(&addr.addr, len, norevlookup);
}

After disabling reverse DNS lookups, Puma instantiates a ThreadPool and passes a block that calls Server’s process_client method on the client argument. We’ll look into ThreadPool shortly, but before that, let’s see the rest of run:

module Puma
  class Server
    # ...

    def run(background=true, thread_name: 'srv')
      # ...

      @status = :run

      @thread_pool = ThreadPool.new(thread_name, options) { |client| process_client client }

      if @queue_requests
        @reactor = Reactor.new(@io_selector_backend) { |c| reactor_wakeup c }
        @reactor.run
      end

      @thread_pool.auto_reap! if options[:reaping_time]
      @thread_pool.auto_trim! if options[:auto_trim_time]

      @check, @notify = IO.pipe unless @notify

      @events.fire :state, :running

      if background
        @thread = Thread.new do
          Puma.set_thread_name thread_name
          handle_servers
        end
        return @thread
      else
        handle_servers
      end
    end
  end
end

Note the assignment of :run to @status. Throughout the Server’s lifecycle, its status can be one of run, stop, halt, and restart. We’ll see how exactly it’s used later.

Pay attention to the @queue_requests conditional. For now, let’s imagine that the variable is set to false, even though it’s true by default. Disabling it can be a very bad idea. We’ll see why Reactor is important by first taking a look at how Puma functions without it.

@check and @notify are the read and write ends of an unnamed pipe created using the pipe system call. Just by the names of the variables, we can expect that they are used for some form of orchestration. Both of the values returned by IO.pipe are instances of the IO class - this will become relevant in a second.

The Accept Loop

Finally, run calls handle_servers, wrapped in a separate thread. This is the thread that the Runner joins, as we’ve seen previously. Here’s the full body of handle_servers, save for lines that are only relevant for the Cluster runner:

module Puma
  class Server

    # ...

    def handle_servers
      begin
        check = @check
        sockets = [check] + @binder.ios
        pool = @thread_pool
        queue_requests = @queue_requests
        drain = options[:drain_on_shutdown] ? 0 : nil

        # ...

        while @status == :run || (drain && shutting_down?)
          begin
            ios = IO.select sockets, nil, nil, (shutting_down? ? 0 : @idle_timeout)
            unless ios
              unless shutting_down?
                @idle_timeout_reached = true

                # ...
                @log_writer.log "- Idle timeout reached"
                @status = :stop
              end

              break
            end

            # ...

            ios.first.each do |sock|
              if sock == check
                break if handle_check
              else
                pool.wait_until_not_full
                pool.wait_for_less_busy_worker(options[:wait_for_less_busy_worker]) if @clustered

                io = begin
                  sock.accept_nonblock
                rescue IO::WaitReadable
                  next
                end
                drain += 1 if shutting_down?
                pool << Client.new(io, @binder.env(sock)).tap { |c|
                  c.listener = sock
                  c.http_content_length_limit = @http_content_length_limit
                  c.send(addr_send_name, addr_value) if addr_value
                }
              end
            end
          rescue IOError, Errno::EBADF
            raise
          rescue StandardError => e
            @log_writer.unknown_error e, nil, "Listen loop"
          end
        end

        @log_writer.debug "Drained #{drain} additional connections." if drain
        @events.fire :state, @status

        if queue_requests
          @queue_requests = false
          @reactor.shutdown
        end

        graceful_shutdown if @status == :stop || @status == :restart
      rescue Exception => e
        @log_writer.unknown_error e, nil, "Exception handling servers"
      ensure
        [@check, @notify].each do |io|
          begin
            io.close unless io.closed?
          rescue Errno::EBADF
          end
        end
        @notify = nil
        @check = nil
      end

      @events.fire :state, :done
    end

    # ...
  end
end

At first, it might seem like a lot is going on here. Let’s tackle this bit by bit.

First, a sockets array is built: the reader end of the pipe that we’ve seen being created, @check, gets concatenated with @binder.ios - these are the listening sockets that Binder created. Since both the sockets and @check are instances of the IO class, they share the same interface and, from that perspective, are no different from each other.

We are then faced with a while loop that runs while @status is :run, which is the default initial value. The shutdown process, including the drain behaviour, will be covered in another section. For now, we can establish that this loop does not finish until the server is shutting down.

Each iteration of this loop then calls IO.select. In order to understand why it’s needed here and what it does, let’s first imagine a hypothetical more trivial version of a Server.

In a simplistic scenario, once listen is called on a socket, one could simply call accept to start processing incoming connections. The problem with this approach is that, in this case, accept would block unless there is a connection waiting to be accepted, making the thread stall potentially indefinitely. If there’s just one listening socket, then it may not be a big problem. However, this is not the case for Puma. Even though it may be rare to listen on multiple addresses and thus have multiple @binder.ios, we’ve just seen that Server always prepends the IO object that represents the reader end of the pipe to the ios array. This means that there will always be at least two objects in there.

Rushing ahead, one might question, however, why a pipe is seemingly used for orchestration, when the main thread (the one that waits on the Server thread in the Single runner) could have simply set the @status instance variable in a synchronised manner, if required, without having to send messages through a pipe. Doing so, after all, would have eliminated the ever-present first element of the io array by getting rid of the pipe, thus alleviating the necessity to switch between listening sockets in most cases. Even though this might seem the most plausible, the answer has nothing to do with the Clustered forking runner, which we’ll examine later. The actual answer lies within the process of the server shutdown.

If Puma had used a direct accept, it would not be able to process multiple listening sockets using a single thread. It would most likely be forced to spawn a separate one for each IO object, which would be redundant, since some of these threads could be spending relatively much time doing nothing. This would definitely be the case for the thread handling @check. Threads aren’t free, after all, and therefore this could be seen as an example of reckless usage of resources.

Another reason why calling a blocking accept immediately would be detrimental is that it would make it necessary to use exceptions for control flow.

Imagine the following scenario: a Puma process needs to be shut down and booted up again for whatever reason. However, connections can still keep being made during this time. One could check for the @status on each loop, which Server actually does anyway, but this would introduce an edge case where there are no incoming connections to be accepted when the server is supposed to shut down. The loop would simply never finish or would result in potentially significant lag in the shutdown process. To break out of this loop, the runner would have to raise an exception on the server thread.

Raising an exception on a thread is generally unsafe because it breaks assumptions that a developer makes when writing the code and the code itself. Although, as we’ll see later, the Server thread does not invoke user-defined application code, which means that a careful and limited utilisation of Thread.raise or Thread.kill with the application of Thread.handle_interrupt could be an option. However, this could complicate Server’s logic.

So, even if our rudimentary server would drop support for listening on multiple addresses, we would still like to, at least ideally, avoid having to intrusively raise an exception on the thread that will be waiting for connections during shutdown. If we don’t do that, then it’s not clear how should the server be stopped gracefully.

It seems like the blocking nature of accept is the crux of the issue in our hypothetical scenario.

Turns out that there is a way to avoid the waiting by marking the accepting socket as non-blocking. We are already familiar with this: recall that when instantiating a TCPServer instance, which is a higher-level abstraction over the IPSocket class, Ruby automatically sets the SOCK_NONBLOCK option on it. In fact, this flag will be set on instances of all socket classes in Ruby, including IPSocket, UNIXSocket, and plain Socket, since all of them use the rsock_socket0 function under the hood. But if this option is always set, why would accept block? It blocks because Ruby opts out of having all sockets non-blocking by default and makes the blocking behaviour explicit in the high-level API methods it exposes instead. To the developer, accept will always block, even though the underlying socket is non-blocking, because Ruby implicitly retries on EAGAIN and EWOULDBLOCK. Contrary to this, accept_nonblock does not retry and instead propagates the error in the form of an exception all the way to the caller.

You might have noticed that Puma does exactly this further down in the Server#handle_servers method, so using accept_nonblock seems to be the right direction when data needs to be pulled from multiple sockets.

The final question is: why not simply call accept_nonblock in a loop and immediately retry after rescuing exceptions with the IO::WaitReadable module mixed in when there is no connection ready to be accepted? The answer is that this would result in a so-called busy loop, where the program wastes CPU cycles by actively waiting for a condition to be satisfied. Active waiting means that the process is relentlessly checking whether the condition is met without pause. In the case of Puma, this waiting would be pointless, since there’s really nothing useful for the server thread to do when there are no connections available. sleeping arbitrarily during downtime would be a poor choice, since it would decrease the efficiency of the web server and add unnecessary waiting times for clients. Instead of busy-waiting, it would be optimal if there were a way to make the server react only when there are open connections, without occupying a CPU core when waiting.

Enter select - a system call that makes it possible to monitor desired file descriptors for their readiness to be written to or read from. Instead of having the application act on EAGAIN and the like, select makes the OS handle the check and makes it responsible for waking up the application when it’s appropriate. A call to it will return only once there are sockets ready to be accepted, in this case, or if the optional timeout passes. Naturally, select can handle more than one listening socket, which fits Puma’s use case of potentially having multiple bound addresses.

Let’s sum up our reverse engineering exercise.

Since Puma makes it possible to have several listening sockets, the Server needs to handle all of them fairly. Calling blocking accept directly is not an option, since the server would need to spawn a thread for each address, otherwise processing wouldn’t be fair. Spawning threads for this purpose is not desirable, since it’s a resource that isn’t free, and especially since it’s possible to achieve the desired result using just one. Using the non-blocking accept_nonblock, while making it possible to handle multiple sockets, results in busy-waits that waste CPU cycles. The select system call, which monitors the specified binder sockets, is used to solve all of the issues above. Additionally, it allows implementing orchestration control, such as shutdown signalling, using the same interface - server state changes are written to a pipe, which is also an IO object, just like the sockets. This IO object can be, and is, added by Puma to the list of file descriptors monitored using select, thus making the server itself responsible for reacting to state changes. The busy loop is eliminated since select is invoked without a timeout (unless idle_timeout is configured, which it is not by default), making it wait indefinitely until there are any incoming connections.

Now is the time to see what happens once IO.select returns:

module Puma
  class Server
    # ...

    def handle_servers
      begin
      # ...
        while @status == :run || (drain && shutting_down?)
          begin
            ios = IO.select sockets, nil, nil, (shutting_down? ? 0 : @idle_timeout)

            # ...

            ios.first.each do |sock|
              if sock == check
                break if handle_check
              else
                # ...
              end
            end

            # ...
          end
        end
      # ...
      end
    end

    # ...
  end
end

ios is a 3-member array, where the first member is an array of file descriptors ready to be read from, or, in this case, listening sockets ready to accept pending connections. The other 2 elements are irrelevant here since Server does not care about writing data.

Each socket that is ready is then iterated over using ios.first.each.

The first thing that happens here is a check to determine whether the socket is the reader end of the pipe used for communicating the current status to the server. This is a special case, since plain IO objects, such as @check, do not have the accept method. This obviously makes sense, since a pipe is not a network socket and does not have the notion of listening and accepting remote connections. The way data received from the pipe is handled is also different:

# lib/puma/const.rb

module Puma
  module Const
    # ...
    STOP_COMMAND = "?"
    HALT_COMMAND = "!"
    RESTART_COMMAND = "R"
    # ...
  end
end

# lib/puma/server.rb

module Puma
  class Server
    include Puma::Const

    # ...

    def handle_check
      cmd = @check.read(1)

      case cmd
      when STOP_COMMAND
        @status = :stop
        return true
      when HALT_COMMAND
        @status = :halt
        return true
      when RESTART_COMMAND
        @status = :restart
        return true
      end

      false
    end

    # ...
  end
end

Quite straightforward - exactly 1 byte is read, which represents a state change defined in Puma::Const. The status is set accordingly to the processed command, and iteration over ios is stopped, nudging the server to start executing the shutdown logic.

Processing of actual sockets resides in a separate branch, which starts with two calls to the ThreadPool, which was instantiated in the run method:

module Puma
  class Server
    # ...

    def run(background=true, thread_name: 'srv')
      # ...
      @thread_pool = ThreadPool.new(thread_name, options) { |client| process_client client }
      # ...
    end

    # ...

    def handle_servers
      begin
        # ...
        pool = @thread_pool
        # ...
        while @status == :run || (drain && shutting_down?)
          begin
            # ...
            ios.first.each do |sock|
              if sock == check
                # ...
              else
                pool.wait_until_not_full
                # ...

                io = begin
                  sock.accept_nonblock
                rescue IO::WaitReadable
                  next
                end
                drain += 1 if shutting_down?
                pool << Client.new(io, @binder.env(sock)).tap { |c|
                  c.listener = sock
                  c.http_content_length_limit = @http_content_length_limit
                  c.send(addr_send_name, addr_value) if addr_value
                }
              end
            end
            # ...
          end
        end
      end
    end

    # ...
  end
end

We won’t look into ThreadPool internals just yet, so for now, we can assume that wait_until_not_full is some kind of backpressure mechanism that prevents the server from accepting more connections than Puma can reasonably process.

Finally, the anticipated accept_nonblock call is made. It’s wrapped in a rescue IO::WaitReadable to account for possible race conditions, such as a select-ed socket no longer being readable by the time accept is reached. An instance of the appropriate socket class is returned and assigned to io.

The accepted socket, which represents an actual client connection, is then wrapped in an instance of Puma::Client and added to the Puma::ThreadPool via its << method.

This is where the main job of the Server, which consists of accepting connections, ends. It also handles shutdown logic and exposes some callbacks that are used by the thread pool and the reactor, but we’ll take a separate look at them in future sections.

Handling Connections

In the previous section, we’ve investigated how Puma’s Server accepts incoming connections. We’ve seen that sockets returned by accept are wrapped in instances of Puma::Client and passed to the Puma::ThreadPool, at which point the server completes its main loop. In order to follow the clients’ lifecycle further and see how and what processes them, let’s quickly look back at relevant parts of Server:

module Puma
  class Server
    # ...

    def run(background=true, thread_name: 'srv')
      # ...

      @thread_pool = ThreadPool.new(thread_name, options) { |client| process_client client }

      # ...

      @thread_pool.auto_reap! if options[:reaping_time]
      @thread_pool.auto_trim! if options[:auto_trim_time]
      # ...
    end

    # ...

    def handle_servers
      # ...
        addr_send_name, addr_value = case options[:remote_address]
        when :value
          [:peerip=, options[:remote_address_value]]
        when :header
          [:remote_addr_header=, options[:remote_address_header]]
        when :proxy_protocol
          [:expect_proxy_proto=, options[:remote_address_proxy_protocol]]
        else
          [nil, nil]
        end

        # ...
            ios.first.each do |sock|
              # ...
                pool.wait_until_not_full
                pool.wait_for_less_busy_worker(options[:wait_for_less_busy_worker]) if @clustered
                # ...
                @thread_pool << Client.new(io, @binder.env(sock)).tap { |c|
                  c.listener = sock
                  c.http_content_length_limit = @http_content_length_limit
                  c.send(addr_send_name, addr_value) if addr_value
                }
              # ...
            end
      # ...
    end

    # ...
  end
end

Before looking into how the ThreadPool handles clients, let’s see what happens in Client’s initializer, with some of the original comments preserved:

module Puma
  # An instance of this class represents a unique request from a client.
  # For example, this could be a web request from a browser or from CURL.
  # ...
  class Client
    def initialize(io, env=nil)
      @io = io
      @to_io = io.to_io
      @io_buffer = IOBuffer.new
      @proto_env = env
      @env = env&.dup

      @parser = HttpParser.new
      @parsed_bytes = 0
      @read_header = true
      @read_proxy = false
      @ready = false

      @body = nil
      @body_read_start = nil
      @buffer = nil
      @tempfile = nil

      @timeout_at = nil

      @requests_served = 0
      @hijacked = false

      @http_content_length_limit = nil
      @http_content_length_limit_exceeded = false

      @peerip = nil
      @peer_family = nil
      @listener = nil
      @remote_addr_header = nil
      @expect_proxy_proto = false

      @body_remain = 0

      @in_last_chunk = false

      # need unfrozen ASCII-8BIT, +'' is UTF-8
      @read_buffer = String.new # rubocop: disable Performance/UnfreezeString
    end

    attr_reader :env, :to_io, :body, :io, :timeout_at, :ready, :hijacked,
                :tempfile, :io_buffer, :http_content_length_limit_exceeded

    attr_writer :peerip, :http_content_length_limit

    attr_accessor :remote_addr_header, :listener

    # ...
  end
end

As the comment says, a Client encapsulates incoming requests from a particular TCP connection. The prepared instance variables also indicate that a lot of data reading is going to be conducted within this class. We’ll soon get a chance to become more familiar with them.

There are two interesting things to observe with regards to the start of the client’s lifecycle.

First is the c.send(addr_send_name, addr_value) if addr_value line in Server’s handle_servers loop. This call configures the way the remote address of the incoming connection gets extracted. By default, it uses the IPSocket#peeraddr method, which uses the getpeername system call. You may recall that Server sets BasicSocket.do_not_reverse_lookup to true as soon as it starts. If it hadn’t done this, all incoming connections would trigger a reverse DNS lookup unless the address extraction is reconfigured. Performing a reverse lookup here would be mostly pointless, since the IP address should be enough to identify a client. If you want to learn more about possible alternatives to peeraddr that Puma offers, you can read the documentation of Puma::DSL#set_remote_address.

Second, the last line of Client#initialize - @read_buffer = String.new - deserves attention. As the comment above it suggests, in Ruby, strings initialised using literals (e.g., 'string') are created with the UTF-8 encoding. The read buffer is used by the client to actually read data off a socket, which means that it is supplied by the client and is completely arbitrary - anything can arrive. This includes data that usually has special meaning, such as \r\n, also known as the CR LF sequence, or HTTP’s end-of-line marker. There should be no issues with this sequence in case it is part of the HTTP body, where it would not be considered part of the protocol itself. The trouble comes if, for example, a file containing \r\n strings is uploaded using multipart/form-data. If this data is added to a UTF-8 encoded buffer, it could be misinterpreted as an actual CR LF sequence, violating multipart boundaries and breaking the code that performs the read. The solution is to use the ASCII-8BIT encoding, where \r\n would be taken literally as a sequence of bytes. Strings created using explicit String.new have the ASCII encoding, contrary to literals. The described scenario would result in Puma not being able to process such requests prior to v6.3.0, where the bug was fixed³.

Managing the Thread Pool

So far, we have identified two primary threads that are running by this point: the one that calls the Single runner from the launcher and the other that runs the server loop.

Once accepted, the clients are processed by a set of worker threads managed by a ThreadPool.

module Puma
  class ThreadPool
    # ...

    def initialize(name, options = {}, &block)
      @not_empty = ConditionVariable.new
      @not_full = ConditionVariable.new
      @mutex = Mutex.new

      @todo = []

      @spawned = 0
      @waiting = 0

      @name = name
      @min = Integer(options[:min_threads])
      @max = Integer(options[:max_threads])
      # ...
      @block = block
      @out_of_band = options[:out_of_band]
      # ...
      @reaping_time = options[:reaping_time]
      @auto_trim_time = options[:auto_trim_time]

      @shutdown = false

      @trim_requested = 0
      @out_of_band_pending = false

      @workers = []

      @auto_trim = nil
      @reaper = nil

      @mutex.synchronize do
        @min.times do
          spawn_thread
          @not_full.wait(@mutex)
        end
      end

      @force_shutdown = false
      @shutdown_mutex = Mutex.new
    end

    # ...
  end
end

There are a lot of instance variables being set in the initializer; we’ll see how most of them work once we observe them in use. One thing stands out, though - a mutex wrapping @min iterations of spawn_thread. This suggests that threads are spawned immediately during the creation of the thread pool. Before we dive into the called method, let’s examine the @not_full.wait(@mutex) line first.

As can be seen, @not_full is an instance of the ConditionVariable class. Even though condition variables are part of the standard Ruby concurrency toolkit, they are much less commonly encountered compared to Mutex and Queue. In essence, a condition variable allows code in a critical section to wait for a signal from another thread while releasing the mutex, thus guaranteeing atomicity and avoiding busy-waiting (which we’ve covered previously when we looked into the server select loop).

Here, the name of the condition variable suggests that the thread pool waits until something signals it that there is still room for more threads to be spawned. Also, note that if Puma is configured with a dynamic range of threads (e.g., threads 3 5 in config/puma.rb, where 3 is the minimum and 5 is the maximum), the pool will only be instantiated with the lowest possible number of threads.

Let’s find out what’s on the other end of @not_full:

module Puma
  class ThreadPool
    # ...

    def spawn_thread
      @spawned += 1

      # ...
      th = Thread.new(@spawned) do |spawned|
        # ...
        todo  = @todo
        block = @block
        mutex = @mutex
        not_empty = @not_empty
        not_full = @not_full

        while true
          work = nil

          mutex.synchronize do
            while todo.empty?
              if @trim_requested > 0
                @trim_requested -= 1
                @spawned -= 1
                @workers.delete th
                not_full.signal
                # ...
                Thread.exit
              end

              @waiting += 1
              if @out_of_band_pending && trigger_out_of_band_hook
                @out_of_band_pending = false
              end
              not_full.signal
              begin
                not_empty.wait mutex
              ensure
                @waiting -= 1
              end
            end

            work = todo.shift
          end

          # ...

          begin
            @out_of_band_pending = true if block.call(work)
          rescue Exception => e
            STDERR.puts "Error reached top of thread-pool: #{e.message} (#{e.class})"
          end
        end
      end

      @workers << th

      th
    end

    # ...
  end
end

Let’s start from the end of this method, which is actually executed much sooner than the rest of it. Once a thread is spawned, it’s immediately added to the @workers array, which plays an important role in the pool’s lifecycle management.

Now, back to the beginning. The main loop first obtains the mutex using Mutex#synchronize. It then checks if the todo array is empty. For now, let’s assume that it is. The first conditional essentially stops the thread via Thread.exit, updates the relevant variables, and signals the @not_full condition variable if @trim_requested. At this point, it’s going to be 0. There are two possible ways for this counter to get incremented: during shutdown and when ThreadPool#trim is called.

Remember these two calls that happen right after the server instantiates a pool?

module Puma
  class Server
    def run(background=true, thread_name: 'srv')
      # ...
      @thread_pool.auto_reap! if options[:reaping_time]
      @thread_pool.auto_trim! if options[:auto_trim_time]
      # ...
    end
  end
end  

Let’s see what they do and how it’s relevant to @trim_requested:

module Puma
  class ThreadPool
    # ...

    def trim(force=false)
      with_mutex do
        free = @waiting - @todo.size
        if (force or free > 0) and @spawned - @trim_requested > @min
          @trim_requested += 1
          @not_empty.signal
        end
      end
    end

    # ...

    class Automaton
      def initialize(pool, timeout, thread_name, message)
        @pool = pool
        @timeout = timeout
        @thread_name = thread_name
        @message = message
        @running = false
      end

      def start!
        @running = true

        @thread = Thread.new do
          Puma.set_thread_name @thread_name
          while @running
            @pool.public_send(@message)
            sleep @timeout
          end
        end
      end

      def stop
        @running = false
        @thread.wakeup
      end
    end

    def auto_trim!(timeout=@auto_trim_time)
      @auto_trim = Automaton.new(self, timeout, "#{@name} threadpool trimmer", :trim)
      @auto_trim.start!
    end

    # ...

    def with_mutex(&block)
      @mutex.owned? ?
        yield :
        @mutex.synchronize(&block)
    end

    # ...
  end
end

When invoked, auto_trim! spawns a separate thread which routinely calls the aforementioned trim method. This method calculates how many threads are currently waiting and not doing any work. Availability here is determined by the @waiting count, whose manipulation we will see in a bit, and the number of clients in the todo array. The @trim_requested variable is then incremented if there are more free threads than the minimum configured amount. @trim_requested itself is used in the conditional check here because trim needs to handle the potential situation where trims are requested multiple times in quick succession, before the first trimming takes place. @not_empty, the second condvar, is signalled here so that the threads waiting on it are woken up and get a chance to exit.

While we’re at it, let’s also look at reap, which works using the same principle as trim:

module Puma
  class ThreadPool
    # ...

    def reap
      with_mutex do
        dead_workers = @workers.reject(&:alive?)

        dead_workers.each do |worker|
          worker.kill
          @spawned -= 1
        end

        @workers.delete_if do |w|
          dead_workers.include?(w)
        end
      end
    end

    # ...

    def auto_reap!(timeout=@reaping_time)
      @reaper = Automaton.new(self, timeout, "#{@name} threadpool reaper", :reap)
      @reaper.start!
    end

    # ...
  end
end

The difference is that reap makes room in the pool for threads that could be occupied by dead threads.

Back in the spawn_thread loop, here’s what happens in an iteration that does not trim the thread:

module Puma
  class ThreadPool
    # ...
    def spawn_thread
      th = Thread.new(@spawned) do |spawned|
        # ...

        while true
          work = nil

          mutex.synchronize do
            while todo.empty?
              if @trim_requested > 0
                # ...
              end

              @waiting += 1
              # ...
              not_full.signal
              begin
                not_empty.wait mutex
              ensure
                @waiting -= 1
              end
            end

            work = todo.shift
          end

          # ...

          begin
            # Slightly modified for brevity
            block.call(work)
          rescue Exception => e
            STDERR.puts "Error reached top of thread-pool: #{e.message} (#{e.class})"
          end
        end
      end
    end
  end
end  

The @waiting counter, which we have seen in action, gets incremented. Shortly after this, @not_full is signalled - this is where the original @min.times iteration in the initializer gets unblocked, and subsequent threads get a green light to be created. This condvar is also used in a couple more places, which we’ll cover later.

Right after this, the thread blocks on @not_empty.wait, releasing the wrapping mutex. So far, we’ve seen only one place where it gets signalled: the trim method. However, there is another important call site that we have not seen yet.

Once wait returns, @waiting is decremented, indicating that a thread now has something to do. Specifically, it extracts a client from the front of the makeshift queue (todo is a plain Ruby array) and calls the block that was previously provided to the ThreadPool initializer, which processes the client.

shifting the todo list makes it so that clients are dequeued in a first-in, first-out manner.

Keep in mind that everything other than executing the block callback is wrapped in a mutex, which makes it safe to shift work off the queue, increment and decrement counters, and so on.

This is where the main worker thread loop ends, although it’s still not clear exactly how clients end up in the todo list and what signals the not_empty condvar, on which the threads will be waiting for most of their lifetime. Now is the time to look at ThreadPool#<<:

module Puma
  class ThreadPool
    # ...

    def <<(work)
      with_mutex do
        if @shutdown
          raise "Unable to add work while shutting down"
        end

        @todo << work

        if @waiting < @todo.size and @spawned < @max
          spawn_thread
        end

        @not_empty.signal
      end
    end

    # ...
  end
end

Every time a client is added to the thread pool by the server, it is added straight to the @todo array, and @not_empty is signalled. This allows waiting threads to wake up and get a chance to pop the freshly added client off the queue.

You might remember that Puma spawns only the @min number of threads at first. The << method is responsible for spawning more threads if the pool is not at max capacity. As an additional measure to prevent the pool from inevitably and quickly being fully utilised, new threads are spawned only when the backlog of connections becomes larger than the number of threads waiting for work. This means that if Puma sees low enough traffic, it will try to keep the pool as small as possible according to its configuration.

Before we move on from the ThreadPool, let’s revisit the final call that we glossed over previously. Remember that before the server accepts a connection, it calls wait_until_not_full on the pool:

# lib/puma/server.rb

module Puma
  class Server
    # ...
    def handle_servers
      # ...
      ios.first.each do |sock|
        # ...
        pool.wait_until_not_full
        # ...
      
        io = begin
          sock.accept_nonblock
        rescue IO::WaitReadable
          next
        end
        # ...
      end
    end
    # ...
  end
end

# lib/puma/thread_pool.rb

module Puma
  class ThreadPool
    # ...
    def wait_until_not_full
      with_mutex do
        while true
          return if @shutdown

          return if busy_threads < @max

          @not_full.wait @mutex
        end
      end
    end
    
    # ...

    def busy_threads
      with_mutex { @spawned - @waiting + @todo.size }
    end
  end
end

This is the backpressure mechanism, which prevents too many connections from being accepted when Puma’s capacity is exhausted. The method short-circuits and doesn’t wait if there is still room for the pool to grow according to the min and max configuration options; otherwise, the thread pool would never reach its max capacity.

Let’s summarise the pool’s internals based on what we’ve learned.

Clients get pushed to the todo array when << is called.

There are two condition variables: not_full and not_empty. The former is signalled whenever a thread has no work to do (i.e., there are no clients in the todo array to be picked up), or if a thread gets removed from the pool. The signalling happens once per iteration of the main loop within a worker thread, right before it starts waiting on the not_empty variable indefinitely until it gets signalled. The not_empty condvar is signalled each time a new client is added to the pool, which happens only after not_full is signalled, since Server calls wait_until_not_full before pushing a client via <<. It is also signalled when a thread should be trimmed from the pool, as otherwise threads waiting on this condvar would not get a chance to remove themselves until a new client is accepted.

The combination of these two condvars makes it possible for Puma to implement fine-grained and efficient control of when new clients should be accepted, which it does. It also allows for dynamic scale-downs and scale-ups of the pool in case Puma is configured with a variable number of worker threads.

The backpressure control is arguably the most important here. Puma, as we’ve just learned in detail, is a threaded web server. This means that there’s usually an inherent limit to the amount of work that can be processed concurrently, since the worker pool has a limited number of threads, even though the upper limit may be high.

Spawning a separate thread for each incoming connection would result in unbounded resource usage during times of high load, since threads aren’t free - each one allocates its own stack. There are also severe implications stemming from the fact that Ruby, specifically its C implementation (known as CRuby or MRI, with which this article is written in mind), utilises a global VM lock (GVL), which prevents multiple threads of a single Ruby process from running in parallel. Due to this, threading in Ruby mostly offers benefits for IO-bound code, which releases the GVL whenever a blocking call happens, such as reading from or writing to a socket. Crunching Fibonacci sequences or performing other CPU-bound work is still possible to do concurrently, but not in parallel. Even though doing so may increase throughput, having more threads than the optimal number may - and most likely will - result in higher latencies due to extra context switching.

This is why Puma opts to have a preallocated pool of threads. This means that if Puma is configured to have a single worker with a thread pool of size 5, it can actively run client-serving code for at most 5 requests at the same time. In other words, at any given moment, there can only be at most 5 clients popped off the todo list, but the todo list itself may have more clients pending, even if all threads are busy executing application code or reading data from the socket.

Accepting more connections and not controlling the influx of clients into the todo queue would result in too much disruption. Unlucky clients would connect, start sending their requests, and then wait for a response, eventually timing out due to exhaustion of the server’s capacity. Instead, clients simply do not get accepted in the first place until the server knows that it can handle them, which is more transparent for the client and the infrastructure.

Now that we’ve covered the threaded nature of Puma, as a hook for a future section, here’s a question: can it happen that the todo list ever gets bigger than the total number of threads in the pool? Based on what we’ve seen so far - no, because of the not_full condvar synchronization. In reality - yes. This will be explained once we start looking into the Reactor.

Serving Clients

We’ve looked into the ThreadPool and learned how Puma manages its worker threads, how it applies backpressure to prevent overload, and why the number of threads in a pool is always limited.

It’s time to finally see how Puma reads the request data and writes back the response.

Our investigation continues from the callback passed to the thread pool’s initializer by the server, which gets called at the end of each iteration of the main loop that every worker thread runs. This callback is a Proc object converted from a block, which contains a single call to Server#process_client. Let’s take a look at the first part:

module Puma
  class Server
    # ...
    def process_client(client)
      # ...

      close_socket = true

      requests = 0

      begin
        if @queue_requests &&
          !client.eagerly_finish

          client.set_timeout(@first_data_timeout)
          if @reactor.add client
            close_socket = false
            return false
          end
        end

        with_force_shutdown(client) do
          client.finish(@first_data_timeout)
        end

        # ...

      end
    end
    # ...
  end
end

First, the method almost immediately returns false if the queue_requests configuration option is enabled and client#eagerly_finish does not return a falsy value. As we’ve mentioned previously, we will skip looking into request queueing and the @reactor for now.

Then, a call to client#finish happens, wrapped in with_force_shutdown(client). The examination of the shutdown method will also be deferred until a future section.

Client#finish is quite straightforward on the surface:

module Puma
  class Client
    # ...
    def finish(timeout)
      return if @ready
      @to_io.wait_readable(timeout) || timeout! until try_to_finish
    end

    def timeout!
      write_error(408) if in_data_phase
      raise ConnectionError
    end

    ERROR_RESPONSE = {
      # ...
      408 => "HTTP/1.1 408 Request Timeout\r\nConnection: close\r\n\r\n",
      # ...
    }

    def write_error(status_code)
      begin
        @io << ERROR_RESPONSE[status_code]
      rescue StandardError
      end
    end
    # ...
  end
end

@ready will always be false if the client is fresh, so let’s not analyse this guard clause just yet.

@to_io is the IO-like object which was previously sourced from the accept_nonblock call in the server.

IO#wait_readable is the first ‘proper’ read-related IO method we encounter in Puma. It is similar to accept_nonblock, which we’ve already seen, since it also follows the non-blocking IO approach. This makes it possible to specify exactly how long the wait should be, instead of waiting indefinitely until there’s data to be read. Under the hood, wait_readable utilises the select system call, which is also exposed directly via IO::select. Note that this method does not actually read data off a socket; it only returns true if any data has become available within the specific timeout, or false if it has not.

The timeout argument, configured using the first_data_timeout option, is 30 seconds by default. Any clients who connect and do not send any data within this time will be timed out with a 408 status.

IO#<<, which is called to write the timeout response, ultimately calls the writev system call.

All of the above happens only if and while try_to_finish returns false, so let’s unpack it:

# lib/puma/const.rb

module Puma
  module Const
    # ...
    CHUNK_SIZE = 64 * 1024
    # ...
    MAX_HEADER = 1024 * (80 + 32)
    # ...
  end
end

# lib/puma/client.rb

module Puma
  class Client
    # ...

    include Puma::Const

    # ...

    def initialize(io, env=nil)
      # ...
      @read_header = true
      @read_proxy = false
      @ready = false
      # ...
      @buffer = nil
      # ...
    end

    # ...

    def try_to_finish
      if env[CONTENT_LENGTH] && above_http_content_limit(env[CONTENT_LENGTH].to_i)
        @http_content_length_limit_exceeded = true
      end

      if @http_content_length_limit_exceeded
        @buffer = nil
        @body = EmptyBody
        set_ready
        return true
      end

      return read_body if in_data_phase

      data = nil
      begin
        data = @io.read_nonblock(CHUNK_SIZE)
      rescue IO::WaitReadable
        return false
      rescue EOFError
      rescue SystemCallError, IOError
        raise ConnectionError, "Connection error detected during read"
      end

      unless data
        @buffer = nil
        set_ready
        raise EOFError
      end

      if @buffer
        @buffer << data
      else
        @buffer = data
      end

      return false unless try_to_parse_proxy_protocol

      @parsed_bytes = @parser.execute(@env, @buffer, @parsed_bytes)

      if @parser.finished? && above_http_content_limit(@parser.body.bytesize)
        @http_content_length_limit_exceeded = true
      end

      if @parser.finished?
        return setup_body
      elsif @parsed_bytes >= MAX_HEADER
        raise HttpParserError,
          "HEADER is longer than allowed, aborting client early."
      end

      false
    end

    # ...

    def in_data_phase
      !(@read_header || @read_proxy)
    end

    # ...

    def set_ready
      if @body_read_start
        @env['puma.request_body_wait'] = Process.clock_gettime(Process::CLOCK_MONOTONIC, :float_millisecond) - @body_read_start
      end
      # ...
      @ready = true
    end

    # ...
  end
end

The first two guard clauses are not triggered when a client is popped off the todo array for the first time, so let’s skip them.

The next guard clause, return read_body if in_data_phase, also would not get executed with a fresh client, since @read_header is always true initially. This suggests that the subsequent operations within try_to_finish are dedicated to reading the HTTP header fields.

The data is then read using read_nonblock. If there is more than 64 kilobytes of data available to be read from the connection, Puma does not read it immediately - more on how the rest is handled shortly. As its name suggests, read_nonblock is a non-blocking operation. If there’s no available data on the socket, an exception with the IO::WaitReadable mixin is raised, and the method returns false immediately, indicating the client has not been fully served yet.

If the local data variable is nil, then it means either the socket was closed after being accepted or something else went wrong. In any case, @ready is set to true, and an exception is raised, ending the method’s execution. Otherwise, the returned string object is assigned to @buffer.

The next line is a call to try_to_parse_proxy_protocol, which is related to the set_remote_address Puma configuration option we looked at earlier. When Puma is running behind the HAProxy load balancer, it’s possible to propagate the original client IP via its Proxy protocol. Since any proxy technically creates another TCP connection to forward the client’s request, the client IP sourced by Puma via the getpeername system call would not represent the original client but rather the proxy itself. Many proxies solve this by providing the original client’s IP as a separate HTTP header. HAProxy can take it a step further: instead of exposing the IP as a header, it can prepend this information as the very first line within the HTTP message, preceding the protocol data. try_to_parse_proxy_protocol simply parses this information and ensures that client#peerip returns the appropriate data. Just in case the client has not yet read the whole proxy line, the method detects this and returns false, which in turn makes try_to_finish return false, indicating that more data has to be read.

Parsing HTTP

Once any data has been read from the socket, the HTTP parser is invoked. Puma uses its own parser, found in ext/puma_http11. It is implemented as a C extension and contains some of the oldest code in Puma. The reason for this is that the parser was forked from another Ruby web server - Mongrel, which would most likely be familiar to you if you have been around the Ruby ecosystem for a while.

We will not cover the internals of the parser, aside from mentioning that it uses Ragel, which generates a finite state machine from the grammar specification. In this case, the grammar represents the HTTP specification. It can be found in ext/puma_http11/http11_parser_common.rl.

When execute is called on the parser, three arguments are provided: @env, @buffer, and @parsed_bytes. The first argument is a plain Ruby hash, representing the Rack environment along with Puma-specific information. Its purpose is to provide a server-agnostic way for the underlying application to extract request data. One of the most common pieces of data the env hash contains is HTTP headers. The parser’s execute method takes raw data from @buffer as input, parses it, and populates the @env hash with headers and other data. The third argument is an integer representing the number of bytes already read from @buffer. This allows execute to be called multiple times without re-parsing the entire payload each time more data is read from the socket, which happens when try_to_finish indicates that more data must be read by returning false. The method returns the number of bytes read so far, excluding the third offset argument, which avoids the need to increment @parsed_bytes each time execute is called.

Afterwards, finished? is called on the parser to check whether it has successfully read the headers. This method returns true if the integer cs value of the puma_parser struct is higher than 46. 46 is the number of the final state in the state machine generated by Ragel; if the parser’s current state matches it, the request has been fully read successfully.

It’s important to understand that the parser itself does not cause any reads directly if the current @buffer does not contain all of the header data. Hence, the potential need to call Client#try_to_finish, which reads data from the socket before invoking the parser, multiple times.

Next, Puma checks whether the size of the HTTP body, if fully read, exceeds an optional limit configured via the http_content_length_limit option. There’s no limit by default.

What’s even more important to understand is that despite calling @parser.body.bytesize, it’s not guaranteed that @parser.body will actually contain the full request body. The body can be completely missing if only the headers have been sent over the network and read from the socket so far. finished? will return true regardless, since the parser primarily cares about the headers. The contents of the body are arbitrary and not constrained by the HTTP specification, so there’s nothing of interest to parse there, making it of no interest to the parser.

If the quality of the underlying network is decent and the request is not too large, it’s likely that the very first read_nonblock will read the whole HTTP request, including its body. But this assumption cannot always be made, as there are too many factors at play that may impact how data is being sent, such as the client’s implementation, operating system, and so on. We’re still at the TCP level reading from a socket in a non-blocking manner; a single HTTP header may be split across multiple TCP packets. Technically, each byte of an HTTP request can be sent and flushed separately by the client, and each call to read_nonblock may read exactly one byte at a time. This is why reading the HTTP body is encapsulated into separate methods - setup_body, read_body, and others.

If the parser has successfully parsed the header section of a request, setup_body is called. Before we look into it, let’s pay attention to the alternative branch with the @parser >= MAX_HEADER conditional. It is evaluated when the parser halts execution (meaning that finished? returns false) due to the header section of an HTTP request exceeding the predefined maximum size, which is 114688 bytes by default. The parser defines this and several other limits here.

Reading the Request Body

Client#setup_body is invoked exactly once during a client’s lifetime, as it sets @read_header to false almost immediately after being called. This ensures that subsequent invocations of try_to_finish will hit an early guard clause and call Client#read_body instead.

Let’s see what setup_body does:

# lib/puma/const.rb

module Puma
  module Const
    # ...
    HTTP_EXPECT = "HTTP_EXPECT"
    CONTINUE = "100-continue"

    HTTP_11_100 = "HTTP/1.1 100 Continue\r\n\r\n"

    # ...
    TRANSFER_ENCODING2 = "HTTP_TRANSFER_ENCODING"
    # ...
    CONTENT_LENGTH = "CONTENT_LENGTH"
    # ...
  end
end

# lib/puma/client.rb

module Puma
  class Client
    # ...

    CONTENT_LENGTH_VALUE_INVALID = /[^\d]/.freeze

    # ...

    def setup_body
      @body_read_start = Process.clock_gettime(Process::CLOCK_MONOTONIC, :float_millisecond)

      if @env[HTTP_EXPECT] == CONTINUE
        @io << HTTP_11_100
        @io.flush
      end

      @read_header = false

      body = @parser.body

      te = @env[TRANSFER_ENCODING2]
      if te
        # Handle chunked bodies
        # ...
      end

      @chunked_body = false

      cl = @env[CONTENT_LENGTH]

      if cl
        if CONTENT_LENGTH_VALUE_INVALID.match?(cl) || cl.empty?
          raise HttpParserError, "Invalid Content-Length: #{cl.inspect}"
        end
      else
        @buffer = body.empty? ? nil : body
        @body = EmptyBody
        set_ready
        return true
      end

      content_length = cl.to_i

      remain = content_length - body.bytesize

      # The rest of the method is explored separately
      # ...
    end
    # ...
  end
end

The start time of the body reading process is saved using Process.clock_gettime with the monotonic clock provided by the OS. This offers much better accuracy compared to the wall clock, which can be subject to adjustments and skew. You may recall that Client#set_ready uses this time to calculate how much time has elapsed between the connection being first established and the body being fully read. This time is then made available to the underlying application as part of the env hash under the puma.request_body_wait key.

The env hash is then checked for the presence of the Expect HTTP header. Note that Puma’s parser makes all HTTP header names underscored and uppercased, and it prepends some headers with the HTTP string - this is why the Expect header becomes the HTTP_EXPECT key in the env hash.

When the client sends 100-continue as the value of this header, it allows the server to tell the client whether it will serve the request or not. This helps prevent the client from sending a potentially large HTTP body that the server may reject due to its size. The client may postpone sending the body until it receives a positive response from the server. In other words, it prevents unnecessary work. Puma always responds to such requests with a HTTP/1.1 100 Continue\r\n\r\n response. Note that this response is one of the so-called informational responses and does not constitute a full-fledged HTTP response to the provided request.

But why does Puma do this when it’s possible to specify http_content_length_limit as a configuration option, which implies that some requests should not receive this kind of response? There are several possible scenarios: if the entire request, exceeding the preconfigured limit, is received via a single read_nonblock call, then the above_http_content_limit(@parser.body.bytesize) check that occurs right after @parser.execute will detect it. If this isn’t the case and multiple Client#try_to_finish calls are required, the two guard clauses at the beginning of the method - previously glossed over - will detect that the value from the now-parsed Content-Length header exceeds the allowed length limit and will prevent the body from being parsed.

However, there seems to be a third possibility: if all of the headers are read in a single read_nonblock call, but the body is either not read as part of it or is read only up to the http_content_length_limit, then setup_body will be executed, and the response with the 100 status code will be sent, even though it shouldn’t be. If this assessment is correct, it seems to be a bug. Its impact should be minimal, though, as this feature is largely implemented on a best-effort basis: most clients may simply not care and may send potentially large requests outright.

What’s interesting here is that the response is written immediately using IO#<<, after which IO#flush is called. The reason for calling flush here is that most of the IO write methods, such as <<, write data to an internal buffer managed by Ruby first, not directly to the OS-managed buffer. A call to << may call write, which moves data from the user-space buffer to the OS buffer, but only when certain conditions are met, such as when the Ruby buffer exceeds a threshold. Otherwise, data remains in the user-space buffer. Flushing guarantees that the current contents of the buffer are immediately seen by the OS.

Why is flushing important here? Most likely because, without it, the client may not receive the Expect response. This means that it would theoretically be possible for write operations, without a subsequent flush, to not actually send data over the network for some time. If this were to happen, the Client instance could potentially be blocked on the next iteration of try_to_finish, as the actual client on the other side of the network would simply not send the request body due to not having received a 100 status code response.

Next, @read_header is set to false. This marks the completion of header parsing and ensures that subsequent try_to_finish invocations will call read_body (not setup_body).

Following this demarcation, setup_body splits into two possible branches. The first branch, triggered if the request contains the Transfer-Encoding header, processes chunked requests:

# lib/puma/const.rb

module Puma
  module Const
    # ...
    CHUNKED = "chunked"
    # ...
    TRANSFER_ENCODING2 = "HTTP_TRANSFER_ENCODING"
    # ...
  end
end

# lib/puma/client.rb

module Puma
  class Client
    ALLOWED_TRANSFER_ENCODING = %w[compress deflate gzip].freeze
    # ...
    TE_ERR_MSG = 'Invalid Transfer-Encoding'
    # ...

    def setup_body
      # ...
      te = @env[TRANSFER_ENCODING2]
      if te
        te_lwr = te.downcase
        if te.include? ','
          te_ary = te_lwr.split ','
          te_count = te_ary.count CHUNKED
          te_valid = te_ary[0..-2].all? { |e| ALLOWED_TRANSFER_ENCODING.include? e }
          if te_ary.last == CHUNKED && te_count == 1 && te_valid
            @env.delete TRANSFER_ENCODING2
            return setup_chunked_body body
          elsif te_count >= 1
            raise HttpParserError   , "#{TE_ERR_MSG}, multiple chunked: '#{te}'"
          elsif !te_valid
            raise HttpParserError501, "#{TE_ERR_MSG}, unknown value: '#{te}'"
          end
        elsif te_lwr == CHUNKED
          @env.delete TRANSFER_ENCODING2
          return setup_chunked_body body
        elsif ALLOWED_TRANSFER_ENCODING.include? te_lwr
          raise HttpParserError     , "#{TE_ERR_MSG}, single value must be chunked: '#{te}'"
        else
          raise HttpParserError501  , "#{TE_ERR_MSG}, unknown value: '#{te}'"
        end
      end
      # ...
    end
    # ...
  end
end

Chunked transfer encoding is an HTTP/1.1 feature that, essentially, enables a rudimentary form of streaming, though not without limitations. With it, the request body can be split into chunks, providing an alternative to specifying the entire body size in the Content-Length header. Using Content-Length would require the whole body to be generated first. In fact, according to the specification, Content-Length must not be set when the body is encoded.

Encodings are applied in a first-in, first-out manner. For example, the Transfer-Encoding value can be gzip, chunked. This means that the body contents were first compressed and then chunked. Puma only allows requests where the last applied encoding is chunked because the HTTP specification states that a request can only be chunked once and must be the last operation. Therefore, chunking the request first and then compressing it should not occur. Any HTTP/1.1-compliant server must support chunking.

The conditionals in the snippet above simply determine whether the provided headers satisfy the HTTP specification. If they do, the header is deleted, and setup_chunked_body is invoked. This article will omit the internals of this and adjacent methods for brevity, considering them as a black box with a slightly more complicated version of the non-encoded request reading logic.

When the incoming request is chunked, setup_body returns early.

The second branch of setup_body is triggered when a request is not encoded. Let’s take a look at it in the snippet below with the original comments preserved:

module Puma
  class Client
    # ...
    CONTENT_LENGTH_VALUE_INVALID = /[^\d]/.freeze
    # ...

    def setup_body
      # ...

      te = @env[TRANSFER_ENCODING2]
      if te
         # ...
      end

      @chunked_body = false

      cl = @env[CONTENT_LENGTH]

      if cl
        if CONTENT_LENGTH_VALUE_INVALID.match?(cl) || cl.empty?
          raise HttpParserError, "Invalid Content-Length: #{cl.inspect}"
        end
      else
        @buffer = body.empty? ? nil : body
        @body = EmptyBody
        set_ready
        return true
      end

      content_length = cl.to_i

      remain = content_length - body.bytesize

      if remain <= 0
        # Part of the body is a pipelined request OR garbage. We'll deal with that later.
        if content_length == 0
          @body = EmptyBody
          if body.empty?
            @buffer = nil
          else
            @buffer = body
          end
        elsif remain == 0
          @body = StringIO.new body
          @buffer = nil
        else
          @body = StringIO.new(body[0,content_length])
          @buffer = body[content_length..-1]
        end
        set_ready
        return true
      end

      if remain > MAX_BODY
        @body = Tempfile.create(Const::PUMA_TMP_BASE)
        File.unlink @body.path unless IS_WINDOWS
        @body.binmode
        @tempfile = @body
      else
        # The body[0,0] trick is to get an empty string in the same
        # encoding as body.
        @body = StringIO.new body[0,0]
      end

      @body.write body

      @body_remain = remain

      false
    end
  end
end

The contents of the Content-Length header are checked. If they are present, they are validated against the /[^\d]/ regular expression, which permits only digits.

Here’s an interesting part: if the Content-Length header is not provided, the body is not read at all. setup_body returns true, signifying that the client is ready. The request will still be served, and application code (such as the corresponding Rails controller) will still be called, although this will most likely result in undesired effects.

The specified content length is then used to calculate how many bytes of the body have not been read yet.

If the remaining amount of bytes is 0, it means that the body was fully read in one sitting. In this case, the body string is wrapped in a StringIO instance. This class simply exposes a String as an IO-compatible API.

If the remaining amount is less than 0, it means that the actual response is larger than the size specified in Content-Length. In this case, the body is trimmed at the boundary, and the rest is added to the @buffer. Remember this, as it will become relevant later. Having both the @body (don’t confuse it with the body local variable sourced from @parser) and @buffer instance variables may seem redundant, but it is necessary, and we will see why soon.

Once Content-Length bytes have been read, the client is marked as ready, and the method returns. Otherwise, the request is not considered fully read.

Puma then performs an optimization: if the remaining bytes to be read exceed MAX_BODY, which is the same as the max header size (114688 bytes), Puma moves the body out of memory and into a Tempfile. This prevents request bodies over 114 kilobytes from residing in the server’s memory. The tempfile is immediately unlinked, most likely for security and cleanup purposes. Interestingly, this does not delete its contents; the file just becomes invisible to the file system. binmode is then called on the tempfile, which achieves the same result as the trick of instantiating a string via String.new instead of via a literal we saw earlier: the encoding of the file’s contents becomes ASCII-8BIT instead of UTF-8. The fact that @body can be an instance of Tempfile, which is a subclass of IO, should explain why a ‘normal’ body string is wrapped in StringIO in the alternative branch.

Next, the body local variable, which was previously extracted from @parser, is written to the @body instance variable. This will eventually contain the whole request payload, possibly after multiple calls to Client#try_to_finish.

We’re done with setup_body. As you may remember, this is the last call in try_to_finish, so we’re almost done with Client’s internals. Before we finish, let’s quickly step back to the beginning of try_to_finish:

module Puma
  class Client
    # ...
    def try_to_finish
      if env[CONTENT_LENGTH] && above_http_content_limit(env[CONTENT_LENGTH].to_i)
        @http_content_length_limit_exceeded = true
      end

      if @http_content_length_limit_exceeded
        @buffer = nil
        @body = EmptyBody
        set_ready
        return true
      end

      return read_body if in_data_phase

      # ...
    end
  end
end

We’ve seen several places where the examined methods return false. Each time this happens, the caller must call try_to_finish again as many times as it takes until it returns true in order to fully read the request. This means that the body (or even the headers) may not be read, parsed, and processed in a single call.

When this occurs, and multiple calls to try_to_finish are required (which happens because data is read from a socket using read_nonblock in a non-blocking manner), the subsequent invocation will call read_body instead of following the path we’ve just explored. Let’s look into this method.

# lib/puma/const.rb

module Puma
  module Const
    # ...
    CHUNK_SIZE = 64 * 1024
    # ...
  end
end

# lib/puma/client.rb

module Puma
  class Client
    # ...
    def read_body
      if @chunked_body
        return read_chunked_body
      end

      remain = @body_remain

      if remain > CHUNK_SIZE
        want = CHUNK_SIZE
      else
        want = remain
      end

      begin
        chunk = @io.read_nonblock(want, @read_buffer)
      rescue IO::WaitReadable
        return false
      rescue SystemCallError, IOError
        raise ConnectionError, "Connection error detected during read"
      end

      unless chunk
        @body.close
        @buffer = nil
        set_ready
        raise EOFError
      end

      remain -= @body.write(chunk)

      if remain <= 0
        @body.rewind
        @buffer = nil
        set_ready
        return true
      end

      @body_remain = remain

      false
    end
    # ...
  end
end

As seen, this method picks up where setup_body leaves off. @body_remain is the number of bytes that still need to be read, calculated using the Content-Length header and the byte size of the first part of the body. If this number exceeds 64 kilobytes, the next read_nonblock call is capped at this limit. If any new data is read, @body_remain gets adjusted based on the amount, and the method can be called again. If, however, this is the call that fully reads the request, @body is rewound using IO#rewind, so that subsequent reads start from the very beginning, and the client is marked as ready.

We’re now finally done with Client.

Processing Requests

We just took a deep dive into Client and its try_to_finish method while originally looking into Server#process_client. This method reads data off a socket, parses HTTP, and performs buffering.

We established that ‘try’ in the client’s method name signifies that this method does not immediately read all the data available from the HTTP connection. It’s expected that this method will be called multiple times until it returns true.

Client#finish is a method that calls try_to_finish repeatedly until it returns true, which means it essentially acts as the blocking API for the client.

Let’s pick up where we left off in Server#process_client and recall where this method is being called from:

module Puma
  class Server
    # ...
    def process_client(client)
      # ...
      begin
        # ...
        with_force_shutdown(client) do
          client.finish(@first_data_timeout)
        end

        while true
          # ...
          case handle_request(client, requests + 1)
          when false
            # ...
          when :async
            # ...
          when true
            # ...
          end
        end

        # ...
      end
    end
    # ...
  end
end

When client.finish returns, a single HTTP request has been fully read and is ready to be processed. Server#handle_request, defined in the Puma::Request module mixed into the Server, is then invoked with the client as an argument. Let’s jump straight into it.

# lib/puma/const.rb

module Puma
  class Server
    # ...

    if Socket.const_defined?(:TCP_INFO) && Socket.const_defined?(:IPPROTO_TCP)
      UNPACK_TCP_STATE_FROM_TCP_INFO = "C".freeze

      def closed_socket?(socket)
        skt = socket.to_io
        return false unless skt.kind_of?(TCPSocket) && @precheck_closing

        begin
          tcp_info = skt.getsockopt(Socket::IPPROTO_TCP, Socket::TCP_INFO)
        rescue IOError, SystemCallError
          # ...
          @precheck_closing = false
          false
        else
          state = tcp_info.unpack(UNPACK_TCP_STATE_FROM_TCP_INFO)[0]
          # TIME_WAIT: 6, CLOSE: 7, CLOSE_WAIT: 8, LAST_ACK: 9, CLOSING: 11
          (state >= 6 && state <= 9) || state == 11
        end
      end
    else
      def closed_socket?(socket)
        false
      end
    end

    # ...
  end
end

# lib/puma/server.rb

module Puma
  module Request
    # ...
    def handle_request(client, requests)
      env = client.env
      io_buffer = client.io_buffer
      socket  = client.io
      app_body = nil

      return false if closed_socket?(socket)

      if client.http_content_length_limit_exceeded
        return prepare_response(413, {}, ["Payload Too Large"], requests, client)
      end

      normalize_env env, client

      env[PUMA_SOCKET] = socket

      if env[HTTPS_KEY] && socket.peercert
        env[PUMA_PEERCERT] = socket.peercert
      end

      env[HIJACK_P] = true
      env[HIJACK] = client

      env[RACK_INPUT] = client.body
      env[RACK_URL_SCHEME] ||= default_server_port(env) == PORT_443 ? HTTPS : HTTP

      if @early_hints
        env[EARLY_HINTS] = lambda { |headers|
          begin
            unless (str = str_early_hints headers).empty?
              fast_write_str socket, "HTTP/1.1 103 Early Hints\r\n#{str}\r\n"
            end
          rescue ConnectionError => e
            @log_writer.debug_error e
          end
        }
      end

      req_env_post_parse env

      # ...
    end
  end
end

The configured env hash, the underlying socket, and an io_buffer (which we haven’t seen yet) are assigned to local variables. At this point, io_buffer is empty.

A guard clause immediately checks if the socket is closed by the client. It turns out that determining this is not trivial in Ruby. A separate closed_socket? method encapsulates the corresponding logic, which utilizes the BasicSocket#getsockopt method, which in turn uses the eponymous system call. This syscall allows the programmer to source socket options, with the option of interest here being TCP_INFO. Interestingly, this option is not available on all operating systems and kernels. Darwin, for example, does not expose it, so closed_socket? always returns false on Mac as a result. If the platform supports this option, simply calling getsockopt is not enough. The value returned by this method dumps the underlying tcp_info C struct as a raw string, so extraction of particular fields must be done manually. Fortunately, tcpi_state is the very first attribute of this struct, so it can be extracted with a simple unpack call using the C string format. The C argument tells Ruby to interpret the first bytes of the payload as 8-bit unsigned chars, which matches the type of tcpi_state. After extracting the number representing the current state, it is compared against the predefined set of closed states - if a match occurs, the socket is considered closed.

The reason requests are discarded if the client closes the socket after establishing the connection and sending data is to help mitigate basic denial-of-service attacks from malicious actors. This, however, prevents Puma from serving so-called fire-and-forget requests, where a client does not wait for the server to respond. This is a specific and arguably unorthodox use case, which could also suggest that a different server - one not as universal as Puma - should be used instead.

Next, Puma immediately responds with a 413 status code and terminates the request if its payload exceeds the optionally configured value. prepare_response is an important function used for other purposes as well, so we will explore it later.

After checking the payload size, normalize_env is called to normalize certain HTTP requests with which the env hash was populated. We’ll skip this since it doesn’t contain much of note.

The socket itself is then added to the env hash as a key-value pair, alongside the client. A mysterious HIJACK_P key is also added with its value set to true. We’ll see what it’s for in a moment.

If HTTP early hints are enabled using the early_hints configuration option, Puma constructs a lambda that emits the appropriate response and adds it to the env hash. Note that it’s not called immediately. In fact, it’s not called by Puma at all - the underlying Rack application has to do that when it deems it necessary. We won’t delve into how the payload is generated in str_early_hints, but we will look at fast_write_str later.

Following that, req_env_post_parse is called, which performs additional header manipulations.

Calling the App

After this, arguably the most pivotal event occurs - the reason why Puma exists in the first place: the underlying Rack app, such as a Rails application, is finally called.

module Puma
  module Request
    # ...

    def handle_request(client, requests)
      # ...

      begin
        if @supported_http_methods == :any || @supported_http_methods.key?(env[REQUEST_METHOD])
          status, headers, app_body = @thread_pool.with_force_shutdown do
            @app.call(env)
          end
        else
          @log_writer.log "Unsupported HTTP method used: #{env[REQUEST_METHOD]}"
          status, headers, app_body = [501, {}, ["#{env[REQUEST_METHOD]} method is not supported"]]
        end

        res_body = app_body

        return :async if client.hijacked

        status = status.to_i

        if status == -1
          unless headers.empty? and res_body == []
            raise "async response must have empty headers and body"
          end

          return :async
        end
      rescue ThreadPool::ForceShutdown => e
        @log_writer.unknown_error e, client, "Rack app"
        @log_writer.log "Detected force shutdown of a thread"

        status, headers, res_body = lowlevel_error(e, env, 503)
      rescue Exception => e
        @log_writer.unknown_error e, client, "Rack app"

        status, headers, res_body = lowlevel_error(e, env, 500)
      end
      prepare_response(status, headers, res_body, requests, client)
    ensure
      io_buffer.reset
      uncork_socket client.io
      app_body.close if app_body.respond_to? :close
      client&.tempfile_close
      # ...
    end

    # ...
  end
end

However, before the application is invoked, a quick check happens to determine whether the incoming HTTP request method is allowed - all methods are allowed by default. If this list was explicitly configured and the incoming request does not pass, a 501 status code response is generated. Note that this response is not immediately sent.

The @app is a callable object that was previously passed to the Server all the way from Puma::Configuration. We referred to it as a ‘Rack app’ earlier. If you don’t recall where this app comes from, you can refresh your memory by revisiting the earlier section on Rackup.

As a quick refresher, the @app is an object that responds to the #call method with a single argument: the Rack env hash. This allows Puma to run any framework or any code that conforms to this API. For example, Rails follows this convention and exposes the application to Rack in this way. This occurs well before the routing table is consulted and the appropriate controller is extracted, since @app.call(env) is THE entry point to the underlying application.

Applications that conform to the Rack specification should return a 3-element array when finishing processing a request, containing the response status, a hash of headers, and the response body.

After @app.call returns, most of the application-specific code has already run. In a typical Rails application, this means that the controller action method will have already returned by now.

Before the response can be written, however, one important thing must be checked. Remember the HIJACK keys added to the env hash earlier in handle_request? This is Rack’s hijack feature, which allows the underlying application to either take over the connection completely, known as a full hijack, or only the response body, known as a partial hijack. Both modes allow streaming the response body directly to the client from the application and enable bi-directional streaming. Rails’ ActionCable uses this exact feature to integrate WebSockets.

If the app performed a full hijack, client.hijacked will have been set to true. The application can achieve this by calling the Client object passed from the server via the rack.hijack key in the env hash. For reference, here’s what Client#call does:

module Puma
  class Client
    # ...
    def call
      @hijacked = true
      env['rack.hijack_io'] ||= @io
    end
    # ...
  end
end

In the case of a full hijack, Puma completely delegates control over the client object, as well as the underlying socket, to the application. It does this by immediately returning the :async symbol from handle_request. We’ll see how this is handled by the Server shortly.

There’s another way to fully hijack the socket from Puma - by returning the [-1, {}, []] array from the Rack app. This behaviour seems to be the predecessor of the Rack hijack specification.

Partial hijack does not relinquish Puma’s control over the client and is handled later - we’ll get a chance to see how it’s implemented.

Finally, handle_request also handles errors by rescuing Exception - the ancestor of all exceptions in Ruby. In case the underlying Rack app did not handle the exception itself, which is usually rare, at least with Rails, Puma has no choice but to log it and return a 500 status code response. Like in previous rescue blocks, the response is not immediately written to the client.

Once the request is fully served, some cleanup is in order. This happens in the ensure block, which runs regardless of whether the wrapped code raised an exception. The io_buffer, which we briefly mentioned, is closed, along with the response body (if it is closable- more on that in the next section) and the tempfile that could have been used to buffer large request bodies.

Buffering the Response

Now is the time to start preparing before sending out the response. Here’s the first part of prepare_response, which handle_request calls at the very end:

module Puma
  module Request
    # ...
    def prepare_response(status, headers, res_body, requests, client)
      env = client.env
      socket = client.io
      io_buffer = client.io_buffer

      return false if closed_socket?(socket)

      force_keep_alive = if @enable_keep_alives
        # ...
      else
        false
      end

      resp_info = str_headers(env, status, headers, res_body, io_buffer, force_keep_alive)

      # ...
    end
    # ...
  end
end

The three most important attributes of a client are first extracted into local variables: the env hash, the io socket, and the io_buffer, an instance of a small Puma utility class IOBuffer, which wraps Ruby’s StringIO.

It is then checked whether the socket is still open, exactly the same way as in handle_request.

Following that, the force_keep_alive parameter is derived from the @enable_keep_alives Puma configuration option and some logic that we will skip over for now. We will come back to the topic of keep-alives in a separate section. For now, let’s consider this variable to be false.

A call to str_header is then made. Let’s dive into it:

# lib/puma/const.rb

module Puma
  module Const
    # ...
    LINE_END = "\r\n"
    # ...
    COLON = ": "
    # ...
    REQUEST_METHOD = "REQUEST_METHOD"
    HEAD = "HEAD"
    # ...
    SERVER_PROTOCOL = "SERVER_PROTOCOL"
    HTTP_11 = "HTTP/1.1"
    # ...
    HTTP_CONNECTION = "HTTP_CONNECTION"
    # ...
    CLOSE = "close"
    KEEP_ALIVE = "keep-alive"
    # ...
    HTTP_11_200 = "HTTP/1.1 200 OK\r\n"
    HTTP_10_200 = "HTTP/1.0 200 OK\r\n"
    # ...
    STATUS_WITH_NO_ENTITY_BODY = {
      204 => true,
      205 => true,
      304 => true
    }.freeze
  end
end

# lib/puma/request.rb

module Puma
  module Request
    # ...
    def str_headers(env, status, headers, res_body, io_buffer, force_keep_alive)
      line_ending = LINE_END
      colon = COLON

      resp_info = {}
      resp_info[:no_body] = env[REQUEST_METHOD] == HEAD

      http_11 = env[SERVER_PROTOCOL] == HTTP_11
      if http_11
        resp_info[:allow_chunked] = true
        resp_info[:keep_alive] = env.fetch(HTTP_CONNECTION, "").downcase != CLOSE

        if status == 200
          io_buffer << HTTP_11_200
        else
          io_buffer.append "#{HTTP_11} #{status} ", fetch_status_code(status), line_ending

          resp_info[:no_body] ||= status < 200 || STATUS_WITH_NO_ENTITY_BODY[status]
        end
      else
        resp_info[:allow_chunked] = false
        resp_info[:keep_alive] = env.fetch(HTTP_CONNECTION, "").downcase == KEEP_ALIVE

        if status == 200
          io_buffer << HTTP_10_200
        else
          io_buffer.append "HTTP/1.0 #{status} ",
                       fetch_status_code(status), line_ending

          resp_info[:no_body] ||= status < 200 || STATUS_WITH_NO_ENTITY_BODY[status]
        end
      end

      # ...
    end
    # ...
  end
end

This method is responsible for writing out most of the response headers that have been set by the application. In addition, it also builds a resp_info hash, whose contents are derived from the response headers, as well as other factors.

Most notably, the beginning of this method is spent determining the values for no_body, allow_chunked, and keep_alive.

Even though Puma is advertised as an HTTP/1.1 server, it also supports HTTP/1.0, which is not surprising - HTTP/1.1 could be mostly considered a superset of 1.0. The if http_11 conditional is the first explicit instance of logic we’ve encountered that branches based on the incoming message’s protocol version. It clearly highlights the key differences between the two versions: HTTP/1.1 offers chunked transfer encoding, whose implementation we’ve seen earlier, and keep-alives, whose exploration we’ve deferred for later. Interestingly, as shown in the second branch, keep-alives are still technically possible with HTTP/1.0 if the client supports them and explicitly states this using the Connection: keep-alive header. The keep-alive behaviour can be disabled similarly if the HTTP/1.1 client explicitly disables it using Connection: close.

The very first response line is then written to IOBuffer, with an inlined optimisation for 200 responses. fetch_status_code extracts the message corresponding to a status from a Puma::HTTP_STATUS_CODES constant, which contains mappings like 404 => 'Not Found' and others.

Note that writes to IOBuffer do not get written straight to the client - this is not the socket, after all.

The method continues:

module Puma
  module Request
    # ...
    def str_headers(env, status, headers, res_body, io_buffer, force_keep_alive)
      # ...

      resp_info[:keep_alive] &&= @queue_requests

      resp_info[:keep_alive] &&= force_keep_alive

      resp_info[:response_hijack] = nil

      headers.each do |k, vs|
        next if illegal_header_key?(k)

        case k.downcase
        when CONTENT_LENGTH2
          next if illegal_header_value?(vs)
          resp_info[:content_length] = vs&.to_i
          next
        when TRANSFER_ENCODING
          resp_info[:allow_chunked] = false
          resp_info[:content_length] = nil
          resp_info[:transfer_encoding] = vs
        when HIJACK
          resp_info[:response_hijack] = vs
          next
        when BANNED_HEADER_KEY
          next
        end

        ary = if vs.is_a?(::Array) && !vs.empty?
          vs
        elsif vs.respond_to?(:to_s) && !vs.to_s.empty?
          vs.to_s.split NEWLINE
        else
          nil
        end
        if ary
          ary.each do |v|
            next if illegal_header_value?(v)
            io_buffer.append k, colon, v, line_ending
          end
        else
          io_buffer.append k, colon, line_ending
        end
      end

      if http_11
        io_buffer << CONNECTION_CLOSE if !resp_info[:keep_alive]
      else
        io_buffer << CONNECTION_KEEP_ALIVE if resp_info[:keep_alive]
      end
      resp_info
    end
    # ...
  end
end

The keep_alive parameter, which was derived from the client’s Connection header value, then goes through two boolean AND operations: one with the global @queue_requests configuration option and another with the force_keep_alive parameter we’ve already seen generated in prepare_response. If any of these variables evaluate to false, then keep-alive behaviour is disabled regardless of what the client wanted.

The response headers obtained from the underlying Rack app are then iterated over and written to the buffer. The iteration has a few special cases, so let’s cover them:

The contents of the Content-Length response header, if present, are written to resp_info. If they are present, this header does not get written to the buffer, at least yet;
If the response header hash contains a rack.hijack key, then the Rack application has requested a partial hijack. This obviously should not be sent back to the client as a response header, since it’s an implementation detail, hence it’s discarded like the Content-Length header;
If Transfer-Encoding response header is present, the content_length key, which could have been set by a previous iteration, gets nullified. The provided value is added to resp_info, while allow_chunked gets set to false.

The keep_alive option then manifests as the Connection response header: if it’s set to true, the header’s value becomes keep-alive; if false, it becomes close. HTTP/1.0 closes the connection unless the opposite is explicitly specified, while HTTP/1.1 keeps it open.

At this point, most of the response headers are written to the buffer. Let’s pick up where we left off in prepare_response:

module Puma
  module Request
    # ...
    def prepare_response(status, headers, res_body, requests, client)
      # ...

      resp_info = str_headers(env, status, headers, res_body, io_buffer, force_keep_alive)

      close_body = false
      response_hijack = nil
      content_length = resp_info[:content_length]
      keep_alive     = resp_info[:keep_alive]

      if res_body.respond_to?(:each) && !resp_info[:response_hijack]
        if !content_length && !resp_info[:transfer_encoding] && status != 204
          if res_body.respond_to?(:to_ary) && (array_body = res_body.to_ary) &&
              array_body.is_a?(Array)
            body = array_body.compact
            content_length = body.sum(&:bytesize)
          elsif res_body.is_a?(File) && res_body.respond_to?(:size)
            body = res_body
            content_length = body.size
          elsif res_body.respond_to?(:to_path) && File.readable?(fn = res_body.to_path)
            body = File.open fn, 'rb'
            content_length = body.size
            close_body = true
          else
            body = res_body
          end
        elsif !res_body.is_a?(::File) && res_body.respond_to?(:to_path) &&
            File.readable?(fn = res_body.to_path)
          body = File.open fn, 'rb'
          content_length = body.size
          close_body = true
        elsif !res_body.is_a?(::File) && res_body.respond_to?(:filename) &&
            res_body.respond_to?(:bytesize) && File.readable?(fn = res_body.filename)
          content_length = res_body.bytesize unless content_length
          if (body_str = res_body.to_hash[:source])
            body = [body_str]
          else
            body = File.open fn, 'rb'
            close_body = true
          end
        else
          body = res_body
        end
      else
        response_hijack = resp_info[:response_hijack] || res_body
      end

      # ...
    end
    # ...
  end
end

This giant conditional is where the Rack response body object provided by the underlying application gets manipulated for the first time.

Let’s start with the outermost condition: if res_body.respond_to?(:each) && !resp_info[:response_hijack].

Rack specification dictates that the body returned from the Rack application must implement each or call. This means that a body can be an Array, a String, a Proc-like object, or a generic enumerable. The latter option is usually the most common one.

Bodies that respond to each are considered enumerable, while those that respond to call are considered streaming bodies. This means the first branch is meant for the former.

The second branch is triggered for streaming bodies: if the application performs a partial hijack, it is expected to put a callable object into the rack.hijack header. This object should accept a single IO-like #call argument. If it is present, it gets assigned to the response_hijack variable.

Can it not be present? Yes. Since Rack 3, the partial hijack feature, whose purpose is mainly bi-directional streaming, can be implemented in a simpler manner: using the aforementioned streaming body. As we’ve found, if the body does not respond to #each, it is expected to respond to #call, which is the same interface as the hijack proc. This is why the body itself is assigned to the response_hijack variable in this case.

We’re done with streaming bodies and hijacking for now, so let’s now analyze the first branch, which branches out further into the second layer of nesting.

The first branch in the second layer is triggered if the application did not provide the Content-Length header explicitly, did not specify the transfer encoding, and does not return a 204 ‘No Content’ response status. This constitutes most of the standard responses, such as those generated by Rails using render json: or render text:.

The second and third branches handle bodies that are supposed to be read from files.

Back to the first branch, the third layer of nested conditionals handles all the possible ways in which the underlying Rack application can return the body: as an enumerable, as an actual file, or as a path to the file. Ultimately, the full contents of any of these bodies are assigned to the body variable, with content_length calculated based on their sizes.

Before we proceed, we can already highlight the key difference between enumerable and streaming bodies: the former must be provided upfront, while streaming bodies, as we’ll see further, allow parts of the body to be streamed as it’s being generated by the application.

Writing the Response

At this point, prepare_response has either accumulated the entire body in memory, opened a file containing the body, or completely avoided touching the body if the application indicates that it wants to stream the response.

Here’s what happens next:

# lib/puma/request.rb

module Puma
  module Request
    # ...
    def prepare_response(status, headers, res_body, requests, client)
      # ...

      if res_body.respond_to?(:each) && !resp_info[:response_hijack]
        # ...
      end
      
      line_ending = LINE_END

      cork_socket socket

      # ...
    end
    # ...
  end
end

# lib/puma/server.rb

module Puma
  class Server
    # ...

    if Socket.const_defined?(:TCP_CORK) && Socket.const_defined?(:IPPROTO_TCP)
      def cork_socket(socket)
        skt = socket.to_io
        begin
          skt.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_CORK, 1) if skt.kind_of? TCPSocket
          # ...
        end
      end
  
      def uncork_socket(socket)
        skt = socket.to_io
        begin
          skt.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_CORK, 0) if skt.kind_of? TCPSocket
          # ...
        end
      end
    end

    # ...
  end
end

The TCP_CORK option is enabled on the socket using setsockopt, provided the kernel supports it.

If you recall the earlier section where we covered the low_latency bind address parameter, which enables the TCP_NODELAY option and disables Nagle’s algorithm, you’ll remember the brief mention of TCP_CORK and how it overrides TCP_NODELAY. Puma decides to utilize corking unconditionally, which makes TCP_NODELAY seemingly redundant if TCP_CORK is available.

So, what exactly does TCP_CORK do? In essence, it could be said that it’s similar to Nagle’s algorithm, since corking achieves the opposite of having TCP_NODELAY on. When enabled on a socket, data sent to it will be aggressively buffered until the application undoes the cork or until 200ms pass. This can potentially increase throughput when the application sends large responses.

After the socket gets corked, making it so that the kernel will try its best to buffer data before sending it over the network, prepare_response makes the final preparations:

# lib/puma/const.rb

module Puma
  module Const
    # ...
    CONTENT_LENGTH_S = "Content-Length: "
    # ...
    TRANSFER_ENCODING_CHUNKED = "Transfer-Encoding: chunked\r\n"
    # ...
  end
end

# lib/puma/request.rb

module Puma
  module Request
    # ...
    def prepare_response(status, headers, res_body, requests, client)
      # ...
      cork_socket socket

      if resp_info[:no_body]
        unless status == 101
          if content_length && status != 204
            io_buffer.append CONTENT_LENGTH_S, content_length.to_s, line_ending
          end

          io_buffer << LINE_END
          fast_write_str socket, io_buffer.read_and_reset
          socket.flush
          return keep_alive
        end
      else
        if content_length
          io_buffer.append CONTENT_LENGTH_S, content_length.to_s, line_ending
          chunked = false
        elsif !response_hijack && resp_info[:allow_chunked]
          io_buffer << TRANSFER_ENCODING_CHUNKED
          chunked = true
        end
      end

      io_buffer << line_ending

      if response_hijack
        fast_write_str socket, io_buffer.read_and_reset
        uncork_socket socket
        response_hijack.call socket
        return :async
      end

      fast_write_response socket, body, io_buffer, chunked, content_length.to_i
      body.close if close_body
      keep_alive
    end
    # ...
  end
end

If the response contains no body, Puma adds the final CR LF sequence demarcating the end of the header section, calls fast_write_str, and then flushes the socket – we’ve already clarified what this does and why it’s important. The flush also implies that fast_write_str actually writes data to the socket - we’ll look into it soon. The method then immediately returns the previously calculated keep_alive value.

If there is a body and the Content-Length header is present, it gets written to the buffer. If the header is not present, there are two options: either the application wants to stream the body via a partial hijack or by providing a call-able body, or the body is chunked. If it’s chunked, an appropriate header is written, and execution continues.

Afterwards, the final end-of-line marker is written to signify the end of the header section.

In case the body is streaming and not enumerable, or if the client has been hijacked by the application, headers accumulated so far in the buffer get written using fast_write_str. The socket gets uncorked, and the provided lambda, previously assigned to response_hijack, is called. The method then immediately returns the :async symbol.

If execution reaches this point, it means the body is enumerable and needs to be written to the socket by the web server itself. This happens in fast_write_response. After it returns, the body is closed in case it was an opened file, and the keep_alive value is returned.

Let’s look at fast_write_str, which we’ve seen called numerous times:

module Puma
  module Request
    # ...
    SOCKET_WRITE_ERR_MSG = "Socket timeout writing data"
    # ...
    def fast_write_str(socket, str)
      n = 0
      byte_size = str.bytesize
      while n < byte_size
        begin
          n += socket.write_nonblock(n.zero? ? str : str.byteslice(n..-1))
        rescue Errno::EAGAIN, Errno::EWOULDBLOCK
          unless socket.wait_writable 10
            raise ConnectionError, SOCKET_WRITE_ERR_MSG
          end
          retry
        rescue  Errno::EPIPE, SystemCallError, IOError
          raise ConnectionError, SOCKET_WRITE_ERR_MSG
        end
      end
    end
    # ...
  end
end

This method is straightforward: it simply takes the given string, gets its byte size, and performs write_nonblock, which uses the write system call, repeatedly until the string is fully written to the socket. It can take an indefinite number of calls to the system call until the while condition is satisfied because the underlying write system call operates in a non-blocking manner. wait_readable is used to check if the socket has any data available for reading. We’ve already discussed this mode of operation when talking about accept_nonblock – the principle behind waiting is the same for both reads and writes.

Whenever this method is called, the client may begin to see the response data on its socket. Up until that moment, response data is usually being written to the local io_buffer by Puma. As we’ve seen above, once the time comes, read_and_reset is called on this buffer, and its return value (which is its entire contents) is passed as the str argument to fast_write_str.

There is one final part of response writing that we haven’t covered yet: the fast_write_response method, which is called at the very end of prepare_response. We will not examine it in detail – it’s quite large, as it handles both string and file bodies, as well as chunking. It writes data to the socket using fast_write_str, which we’ve just seen. If the response is chunked, it is written using multiple fast_write_str calls.

However, there is one important detail we must mention. When the body provided by the underlying application is a file larger than 64 kilobytes, it is not read into memory but sent directly from the file itself using IO.copy_stream. This is achieved via the sendfile system call, which copy_stream calls if it’s available on the underlying system. This is usually much more efficient, as the file contents do not have to be copied to user space and can instead be transferred between file descriptors entirely within the kernel.

That’s it! We’re done with response writing.

Keeping Connections Alive

We just examined in depth how Puma processes individual HTTP requests in Server#handle_request. Before diving into this, we left off at Server#process_client, which ThreadPool calls when it pops available clients off the todo array.

Now, let’s continue:

module Puma
  class Server
    # ...
    def process_client(client)
      # ...
      close_socket = true
      # ...

        with_force_shutdown(client) do
          client.finish(@first_data_timeout)
        end

        while true
          @requests_count += 1
          case handle_request(client, requests + 1)
          when false
            break
          when :async
            close_socket = false
            break
          when true
            # ...
          end
        end
        true
      rescue StandardError => e
        client_error(e, client, requests)
        requests > 0
      ensure
        client.io_buffer.reset

        begin
          client.close if close_socket
        rescue IOError, SystemCallError
          # ...
        rescue StandardError => e
          @log_writer.unknown_error e, nil, "Client"
        end
      end
    end
    # ...
  end
end   

handle_request, as we’ve seen, returns three possible values: true, false, and :async.

Let’s address :async first. This value indicates that the application has taken over the connection. The application does this either via the Rack hijack specification or by returning a streaming callable body. In this case, Puma should not do anything further with the socket, which is why it simply breaks the loop. Note that the client is not closed in the ensure block, since close_socket is set to false. If Puma were to close the client here, the application would not be able to stream the body or perform its tasks.

If :async is returned, Puma completely forgets about the client connection. At this point, it is no longer part of the ThreadPool’s queue, so breaking out of the loop effectively removes it, with no cleanup necessary.

The remaining two possible boolean return values are related to keep-alive behaviour, also known as persistent connections. This is an HTTP 1.1 feature that allows the client and server to reuse the same TCP connection for multiple HTTP requests.

In HTTP 1.0, every response is supposed to be served on a separate connection. Once the response is sent back to the client, the connection is closed. This is obviously inefficient if the client sends multiple requests. Without keep-alive, the client would need to re-establish a new TCP connection for every new request.

Instead, the client can specify the Connection: keep-alive header in its request to an HTTP 1.1-compatible server, like Puma, indicating that it should not close the connection immediately after sending the first response.

We’ve seen that handle_request returns the keep_alive value unless the socket is closed by the client prematurely, in which case the caller receives false, regardless of the keep_alive setting. Let’s quickly review how the value of this variable is determined:

module Puma
  module Request
    # ...
    def prepare_response(status, headers, res_body, requests, client)
      # ...
      force_keep_alive = if @enable_keep_alives
        requests < @max_fast_inline ||
        @thread_pool.busy_threads < @max_threads ||
        !client.listener.to_io.wait_readable(0)
      else
        false
      end

      resp_info = str_headers(env, status, headers, res_body, io_buffer, force_keep_alive)
      # ...
      keep_alive     = resp_info[:keep_alive]

      # ...
      keep_alive
    end

    # ...
    def str_headers(env, status, headers, res_body, io_buffer, force_keep_alive)
      # ...
      if http_11
        # ...
        resp_info[:keep_alive] = env.fetch(HTTP_CONNECTION, "").downcase != CLOSE
        # ...
      else
        # ...
        resp_info[:keep_alive] = env.fetch(HTTP_CONNECTION, "").downcase == KEEP_ALIVE
        # ...
      end
      # ...
      resp_info[:keep_alive] &&= @queue_requests
      # ...
      resp_info[:keep_alive] &&= force_keep_alive
      # ...
    end
  end
end

First, the client header is consulted. If keep-alive is explicitly disabled by the client, then it won’t be used, no matter what.

If the client is fine with making the connection persistent, the decision then falls to the server.

First, if the @queue_requests Puma configuration option is disabled, keep-alive connections are not used at all. A reminder: we’ve encountered @queue_requests several times already, but we haven’t yet explored its functionality - this will be covered shortly.

Next, if the client is partial and request queueing is enabled, which is the default, the server checks whether persistent connections are enabled in general. By default, persistent connections are allowed, but it’s possible to disable them altogether, which might be done for a good reason, as we’ll see later.

Finally, the connection is kept if any of the following three conditions are satisfied:

The current call to process_client has processed fewer than max_fast_inline requests, which is a configuration option with a default value of 10. The number of calls is tracked using the requests variable, incremented each time handle_request is called.
If the thread pool is not saturated with work and there are idle threads. This check uses the same logic as ThreadPool#wait_until_not_full, which is invoked before a client is queued.
If there are no other connections on the listening socket that are ready to be accepted.

Note that if any of these conditions evaluate to true, the current connection will be persisted. This means that even if the current call to process_client has processed more than max_fast_inline requests and there are no idle threads in the pool, but no other connection is waiting to be accepted, the thread may still be occupied with the current client. However, if there is another waiting connection and the thread pool is saturated, the connection will still be persisted. Subsequent requests from the same client may occupy the worker thread until the max_fast_inline requests have been processed.

Sounds good in theory, right? If there are no clients waiting to be served, why not continue serving additional requests immediately? And even if there are pending clients, why not optimise this particular connection if more requests are going to be sent over it? In practice, though, it’s not always that straightforward. Keep reading to learn what can go wrong.

Once handle_request returns, process_client makes a decision based on its return value:

module Puma
  class Server
    # ...
    def process_client(client)
      # ...
      begin
        # ...
        while true
          @requests_count += 1
          case handle_request(client, requests + 1)
          when false
            break
          when :async
            close_socket = false
            break
          when true
            # ...

            requests += 1

            fast_check = @status == :run

            fast_check = false if requests >= @max_fast_inline &&
              @thread_pool.backlog > 0

            next_request_ready = with_force_shutdown(client) do
              client.reset(fast_check)
            end

            unless next_request_ready
              break unless @queue_requests
              client.set_timeout @persistent_timeout
              if @reactor.add client
                close_socket = false
                break
              end
            end
          end
        end
        true
      end
    end
    # ...
  end
end

If the connection should not be pipelined, the loop simply breaks. Unlike the :async branch, where Puma does not close the socket, here the socket is closed. At this point, Puma has either served exactly one HTTP request or the connection was closed by the client, and the thread becomes free to serve other clients.

If the connection should be kept alive, Puma sets the fast_check variable to true if both of the following conditions are met:

The server is not in the shutdown process.
The number of requests processed by the current call to process_client is lower than max_fast_inline, or there are no other clients waiting in the pool’s backlog.

This variable is then passed as an argument to Client#reset:

module Puma
  class Client
    # ...
    def reset(fast_check=true)
      @parser.reset
      @io_buffer.reset
      @read_header = true
      @read_proxy = !!@expect_proxy_proto
      @env = @proto_env.dup
      @parsed_bytes = 0
      @ready = false
      @body_remain = 0
      @peerip = nil if @remote_addr_header
      @in_last_chunk = false
      @http_content_length_limit_exceeded = false

      if @buffer
        return false unless try_to_parse_proxy_protocol

        @parsed_bytes = @parser.execute(@env, @buffer, @parsed_bytes)

        if @parser.finished?
          return setup_body
        elsif @parsed_bytes >= MAX_HEADER
          raise HttpParserError,
            "HEADER is longer than allowed, aborting client early."
        end

        return false
      else
        begin
          if fast_check && @to_io.wait_readable(0.2)
            return try_to_finish
          end
        rescue IOError
        end
      end
    end
    # ...
  end
end

At this point, the parser is reset. This is actually an internal re-instantiation. IOBuffer is also reset - the previous response was written to it.

All instance variables that were mutated during the processing of the previous request are also reset to their initial values.

Next, the method branches: if the @buffer is present, this implies that some of the data for the subsequent request was already sent by the client during the processing of the previous request. This can happen if the client sends multiple requests over the same connection. It’s allowed because @buffer is filled directly from @io.read_nonblock, which operates on raw TCP data. This method does not account for persistent connections or HTTP itself, so no request splitting happens at this level. However, as you might recall, Puma specifies the exact amount of bytes it wants to read from the socket, based on its Content-Length calculation. Since read_nonblock will not return more bytes than specified, it ensures that reads stop at the end of the current response’s body. So how can the buffer not be empty? The answer is simple: the very first read_nonblock, which primarily gets the headers, has a limit of 64 kilobytes. This makes it likely that the body is also read, along with any subsequent potentially pipelined requests, in case the client sent them up-front.

In this case, the leftover contents of the buffer are read using the parser. If they contain the full header section, setup_body is called, which we’ve already covered.

The other case occurs when the buffer is empty. This can happen if the first request didn’t have a body, or if it wasn’t fully read by the first read_nonblock call. In such cases, the server needs to wait for any subsequent pipelined requests to arrive.

This is where fast_check comes in. If it’s true, Puma waits for 200 milliseconds, and if new data arrives, the request is attempted to be processed immediately.

Now, let’s go back to Server#process_client to piece everything together:

module Puma
  class Server
    # ...
    def process_client(client)
      # ...
      begin
        # ...
        while true
          @requests_count += 1
          case handle_request(client, requests + 1)
          when false
            break
          when :async
            # ...
          when true
            # ...
            next_request_ready = with_force_shutdown(client) do
              client.reset(fast_check)
            end

            unless next_request_ready
              break unless @queue_requests
              client.set_timeout @persistent_timeout
              if @reactor.add client
                close_socket = false
                break
              end
            end
          end
        end
        true
      end
    end
    # ...
  end
end

When Client#reset returns true, the loop continues, as the subsequent request has been successfully processed.

It’s important to note that fast_check won’t have any effect if the client sends full requests quickly enough for a single read_nonblock to read them in full. Although such rapid clients will eventually hit the circuit breaker implemented by force_keep_alive in Server#prepare_response - which gets called by handle_request during the next iteration - none of the conditions that define force_keep_alive should be met in order to free the thread from the greedy client.

If Client#reset returns false, it means that either:

The data for the next request could not be read immediately using a single read_nonblock, or
fast_check was set to false.

In this case, the client is added to the yet-to-be-explored @reactor, which is enabled when the queue_requests configuration option is on (which it is by default). The loop breaks, and the worker thread can begin serving another connection.

Let’s summarise. When a worker thread from ThreadPool handles a client connection, it may:

Serve a single request and be done with this client.
Keep the connection open and delegate its management (and subsequent request handling) to the application.
Handle multiple requests sequentially sent via the keep-alive feature over a single connection.

You may have already guessed the potential issue here: a client using keep-alive connections could completely take over a thread by sending requests that are small enough and fast enough. Although this might not be reliably achievable (since network quality plays a significant role, and the client would have to get lucky with the force_keep_alive conditionals all returning false), it is a theoretical possibility.

It turns out that this is not just a theoretical issue, but a real one. As reported, while it may seem unlikely that the listening socket wouldn’t have other connections queued up or that the thread pool wouldn’t be saturated with work, relying on these conditions to not be met at a specific CPU tick is futile. A new connection may be made right after the listener socket is checked for readability. If all threads are occupied by clients at that point, the new connection is out of luck, since every worker thread will be processing at least 10 requests from another client sequentially. Imagine if each of those 10 requests takes 100 milliseconds. If you’re not convinced that simply reducing the max_fast_inline configuration value is insufficient to fully address this issue, the linked article goes into greater detail. The ability to completely disable keep-alive connections became available in Puma as a result of this investigation.

While Puma’s keep-alive behaviour may decrease latencies for individual clients, this improvement may come at a cost of disproportionally degrading latencies for others.

It’s also important to note that the root cause of this behaviour lies in Puma’s choice of implementation trade-offs, rather than the use of HTTP persistent connections themselves. As of Puma 6.5.0, an effort is underway to address this issue by delegating subsequent requests to the reactor, instead of processing them immediately, which would allow a chance to serve other clients.

Dealing With Slow Clients

In the previous sections, we fully examined how connections are handled, from reading requests to writing responses. Throughout our exploration, we kept encountering the nebulous ‘reactor’ and the associated queue_requests configuration option.

Before we wrap up this final part of the connection-handling algorithm, let’s identify the potentially fatal flaw in Puma’s behaviour, based on everything we’ve covered so far.

This flaw manifests in the same symptom as the previously investigated edge-case with keep-alive connections: degraded latencies for certain clients.

Let’s go back to the start of Server#process_client:

module Puma
  class Server
    # ...
    def process_client(client)
      # ...
      begin
        if @queue_requests &&
          !client.eagerly_finish

          client.set_timeout(@first_data_timeout)
          if @reactor.add client
            close_socket = false
            return false
          end
        end

        with_force_shutdown(client) do
          client.finish(@first_data_timeout)
        end

        # ...
      end

      # ...
    end
    # ...
  end
end

When we previously looked at this, we decided to set aside the @queue_requests conditional and assumed it was false. In this case, client.finish gets called. This method blocks until a single HTTP request is fully read from the socket.

The blocking nature of this method can be Puma’s undoing. It can introduce negative effects, ranging from decreased latency for a few clients and lower throughput to complete denial of service.

A client may be deliberately or unintentionally slow in sending its request. The first_data_timeout configuration option aborts the connection if client.finish does not detect new data on the socket within 30 seconds by default. However, a malevolent actor could still abuse the server by sending a single byte at regular intervals, completely bypassing this timeout.

As we’ve seen, if subsequent requests from a persistent connection can’t be processed inline, they get delegated to the reactor. If the reactor is disabled, the connection is dropped, even though it could remain open while waiting for the client, which is inefficient.

Although the issue of slow clients can be mitigated by placing a reverse proxy (like NGINX) in front of Puma, it is still better for Puma to handle read buffering on its own as a fallback. This is why request_queueing is enabled by default, and why the option to disable the reactor may eventually be removed entirely in future versions. However, this doesn’t mean Puma should always be exposed directly to the internet (although it can be)⁴. A reverse proxy still provides many benefits, such as handling SSL termination, serving static assets without impacting the throughput of dynamic traffic, and enabling more advanced DDoS mitigation.

The Reactor is a relatively simple abstraction that helps free up worker threads by taking over the responsibility of buffering requests. Let’s quickly remind ourselves of its API:

module Puma
  class Server
    # ...
    def run(background=true, thread_name: 'srv')
      # ...

      if @queue_requests
        @reactor = Reactor.new(@io_selector_backend) { |c| reactor_wakeup c }
        @reactor.run
      end

      # ...
      handle_servers
    end

    # ...

    def handle_servers
      # ...

      if queue_requests
        @queue_requests = false
        @reactor.shutdown
      end

      # ...
    end

    # ...

    def process_client(client)
      # ...

      begin
        if @queue_requests &&
          !client.eagerly_finish

          client.set_timeout(@first_data_timeout)
          if @reactor.add client
            close_socket = false
            return false
          end
        end

        while true
          @requests_count += 1
          case handle_request(client, requests + 1)
          # ...
          when true
            # ...

            unless next_request_ready
              break unless @queue_requests
              client.set_timeout @persistent_timeout
              if @reactor.add client
                close_socket = false
                break
              end
            end
          end
        end
      end

      # ...
    end

    # ...
  end
end

When instantiated, the Reactor instance takes a block argument that calls Server#reactor_wakeup.

Clients are then added to the reactor using Reactor#add. This occurs in the following two cases:

When the request cannot be fully read off the socket using eagerly_finish, which calls Client#try_to_finish only once if data is available;
When requests from a persistent connection can no longer be processed inline by Server#process_client.

Here’s the initializer and add method for the reactor:

module Puma
  class Reactor
    def initialize(backend, &block)
      require 'nio'
      valid_backends = [:auto, *::NIO::Selector.backends]
      unless valid_backends.include?(backend)
        raise ArgumentError.new("unsupported IO selector backend: #{backend} (available backends: #{valid_backends.join(', ')})")
      end

      @selector = ::NIO::Selector.new(NIO::Selector.backends.delete(backend))
      @input = Queue.new
      @timeouts = []
      @block = block
    end

    def run(background=true)
      if background
        @thread = Thread.new do
          Puma.set_thread_name "reactor"
          select_loop
        end
      else
        select_loop
      end
    end

    def add(client)
      @input << client
      @selector.wakeup
      true
    rescue ClosedQueueError, IOError
      false
    end

    # ...
  end
end

The line require 'nio' stands out. Before we delve into what NIO is, let’s attempt to understand how the Reactor manages to buffer requests.

The run method starts a thread that calls select_loop, which, presumably, does something in a loop.

We’re already familiar with IO::select and the underlying eponymous system call - this is the technique the server thread uses to monitor listening sockets. It’s reasonable to assume that the same approach is used by the reactor. Although possible, this isn’t the most efficient solution, due to the limitations of the select system call.

The most important limitation is the hard cap of 1024 file descriptors that can be passed to a single call. This would mean Puma could handle no more than 1024 waiting client connections, which imposes an inherent scalability limit.

Another disadvantage of select is the need to iterate over the array of file descriptors every time it returns. There’s no efficient way to determine which specific file descriptors are ready, making this process inefficient.

If it is not convincing enough, the select manpage explicitly states that “all modern applications should instead use poll(2) or epoll(7)”⁵.

So, what are the alternatives? The manpage quote highlighted poll as an option, which can be thought of as a better version of select.

There are, however, true alternatives that employ completely different models of operation - event-driven and asynchronous IO. Examples of these approaches include epoll and the relatively new io_uring facility. The behaviour of io_uring differs significantly from basic select: io_uring functions by mmaping two ring buffers between user space and the kernel, achieving considerable performance improvements by almost completely avoiding any system calls. If you’re interested, here’s a comprehensive document written by io_uring’s author, detailing the IO facility and the motivations behind it.

While these underlying implementation details are intriguing, they are not crucial for Puma. An efficient implementation does improve the reactor’s performance and scalability, but Puma is a cross-platform server. Some of these system calls and facilities are OS-specific and may not be available on every system or kernel version. In fact, advanced facilities like io_uring are specific to certain operating systems. Puma may still function with the basic select if the loads aren’t high enough.

This is why Puma opts to use nio4r - a set of Ruby bindings for libev, which provides cross-platform asynchronous IO abstraction. This gem supports almost all possible facilities available on the host platform, automatically integrating them into its API, which Puma uses. We won’t dive into nio4r’s internals here.

Puma allows specifying the backend when multiple options are available, and the selector is instantiated accordingly. The internal selector loop then starts running.

An @input Queue is created, and its thread-safety is crucial because add will be called concurrently by multiple threads.

Whenever a client is added to the reactor, it is placed in the input queue, and wakeup is called on the selector. We’ll explore exactly what this method does next.

module Puma
  class Reactor
    # ...
    def select_loop
      close_selector = true
      begin
        until @input.closed? && @input.empty?
          timeout = (earliest = @timeouts.first) && earliest.timeout
          @selector.select(timeout) {|mon| wakeup!(mon.value)}

          timed_out = @timeouts.take_while {|t| t.timeout == 0}
          timed_out.each { |c| wakeup! c }

          unless @input.empty?
            until @input.empty?
              client = @input.pop
              register(client) if client.io_ok?
            end
            @timeouts.sort_by!(&:timeout_at)
          end
        end
      rescue StandardError => e
        STDERR.puts "Error in reactor loop escaped: #{e.message} (#{e.class})"
        STDERR.puts e.backtrace

        if NoMethodError === e
          close_selector = false
        else
          retry
        end
      end

      @timeouts.each(&@block)
      @selector.close if close_selector
    end
    # ...
  end
end

The loop continues until the input queue is fully drained and closed.

First, the @timeouts array is consulted. This array will likely be empty during the first iteration, so let’s assume that timeout is nil for now.

select is then called on the NIO selector. To clarify, this is not Ruby’s IO::select, though it may use the select system call if no other backend is available.

If no timeout value is passed, this call will block indefinitely until one of the monitored file descriptors becomes ready or wakeup is called on the selector. As we’ve just seen, add uses this method immediately after placing a client in the queue - this prevents the selector from sleeping indefinitely while clients are waiting to be processed.

For now, let’s skip over the block argument passed to select and move forward.

Once select returns, the @timeouts array is traversed again, and Reactor#wakeup is called for clients whose timeout has expired.

Next, the entire @input queue is drained, and register is called for each client:

module Puma
  class Reactor
    # ...
    def register(client)
      @selector.register(client.to_io, :r).value = client
      @timeouts << client
    rescue ArgumentError
    end
    # ...
  end
end

At this stage, the client is added to the selector using its register method. This method takes the IO object as the first argument and :r (indicating interest in read readiness) as the second. The selector returns a monitor, and the value of this monitor is set to the client object.

Back in the select_loop, the @timeouts array is sorted in-place by client timeout timestamps. This small optimisation allows the polling code to simply call @timeouts.first to retrieve the earliest timeout value, instead of repeatedly sorting a potentially large array on each read.

When the reactor stops - typically during shutdown - the @block provided to the initialiser is invoked for all clients currently being processed. This either attempts to serve the clients one last time or times them out gracefully.

Returning to @selector.select - once a monitored client has data ready to read, wakeup! is called with the client as an argument. The client is extracted from the monitor instance’s value.

# lib/puma/reactor.rb

module Puma
  class Reactor
    def initialize(backend, &block)
      # ...
      @block = block
    end

    # ...

    def wakeup!(client)
      if @block.call client
        @selector.deregister client.to_io
        @timeouts.delete client
      end
    end
    # ...
  end
end

# lib/puma/server.rb

module Puma
  class Server
    # ...
    def run(background=true, thread_name: 'srv')
      # ...
      @reactor = Reactor.new(@io_selector_backend) { |c| reactor_wakeup c }
      # ...
    end

    # ...

    def reactor_wakeup(client)
      shutdown = !@queue_requests
      if client.try_to_finish || (shutdown && !client.can_close?)
        @thread_pool << client
      elsif shutdown || client.timeout == 0
        client.timeout!
      else
        client.set_timeout(@first_data_timeout)
        false
      end
    rescue StandardError => e
      client_error(e, client)
      client.close
      true
    end
  end
end

The Reactor#wakeup! method calls the block, a Proc object that wraps Server#reactor_wakeup. If the server method returns true, the client is deregistered from the selector and removed from the timeouts array, effectively releasing the client from the reactor.

The reactor_wakeup callback functions as follows:

First, it attempts to fully read the current client’s request, regardless of circumstances. The request will only be buffered successfully if the entire payload is already available on the socket and ready to be read. Client#try_to_finish does not wait - if the request is read, the client is immediately pushed to the thread pool for processing according to the logic we’ve previously examined.

Even if the request isn’t fully buffered, the client is still pushed to the thread pool if the server is shutting down and the request isn’t actively being parsed. This improves client experience during shutdowns and restarts. We’ll explore the nuances of this process in a later section.

If the request isn’t being parsed, a 408 Request Timeout response is sent if the server is shutting down, or if the client has spent longer than @first_data_timeout between reactor wake-ups. Although it may seem like clients could time out even if they are slowly sending data, this is not the case.

If the request has not timed out but couldn’t be fully read, the timeout resets to the first_data_timeout value (30 seconds by default). This mechanism ensures that each successful write by the client debounces the timeout, preventing a strict 30-second window to complete the entire request. The method returns false, signalling the reactor to continue managing the connection.

If an exception occurs, the error is logged, and the client is closed⁶.

This method marks the exit point from the reactor. A client may either proceed to the thread pool or remain in the selector until it times out, the server shuts down, or data continues arriving from the client. It’s important to note that each reactor_wakeup call happens sequentially - with only one reactor thread, execution of callbacks is inherently serialised.

This also answers a question raised during the examination of the ThreadPool: can the todo array contain more items than there are threads? Yes, it can - thanks to the reactor. While managed by the reactor, requests are hidden from the thread pool. However, once fully buffered, ThreadPool#<< inserts them directly into the pool’s todo list without checking its size via wait_until_not_full. This prevents the Server from accepting new connections until the queue is drained.

In summary, the reactor enhances Puma’s capacity management by handling slow clients. The thread pool is no longer blocked on waiting for socket data, allowing it to perform useful tasks, such as executing application logic.

Once the thread pool picks up a connection, if the request data isn’t immediately available, the connection is pushed back to the reactor. The reactor leverages an appropriate IO backend to efficiently monitor registered sockets and invoke the server’s callback when data is ready to be read.

The reactor also allows keep-alive connections to remain open without tying up a dedicated thread. However, as we’ve observed, a flaw in Puma 6.5.0 can lead to the opposite effect.

With this, we’ve completed the exploration of reading and writing data from connections.

Forking

So far, we’ve only examined the Single Puma runner. This mode utilises a single main process that runs a single thread pool directly. Its concurrency model is thread-per-request, which can deliver significant throughput for IO-bound applications.

Puma also offers another mode of operation called cluster mode. In this mode, the main process forks multiple child processes, known as workers, to handle incoming connections in parallel - a model sometimes referred to as preforking.

Workers in clustered Puma are almost identical copies, each implementing the same request processing logic as described earlier. Each worker gets its own thread pool, server, and reactor.

As previously discussed, increasing the thread pool size eventually leads to diminishing - and potentially negative - returns. Unless the application frequently releases the Global VM Lock (GVL), for example, by making numerous external network calls, a minimal pool size of 3 to 5 threads often results in the best balance between throughput and latency. For CPU-bound applications, increasing thread count can degrade performance by introducing more context switching and GVL contention. It’s important to note that these are general recommendations - concurrency settings should always be tuned based on monitoring and measuring real-world impact on the application. Puma’s default settings are designed for typical web servers without severe IO bottlenecks, such as N+1 query issues or excessive timeout limits for downstream services.

Running Puma in clustered mode is the only way to achieve true parallel request handling on MRI (Ruby’s C implementation). Each forked process maintains its own GVL, allowing separate workers to run in parallel - unlike threads, which are only concurrent. A common guideline is to spawn as many workers as there are CPU cores on the host machine.

Another benefit of multiple workers is resource isolation. Puma does not enforce timeouts that span the call to the underlying application. For example, if you invoke sleep inside a Rails controller action, the worker thread handling that request will block indefinitely until the server restarts. Puma does not provide a mechanism to forcibly terminate threads stuck in application logic, and this is intentional. Forcefully interrupting threads, such as via Thread#raise or Thread#kill, is unsafe and unreliable. Exceptions raised this way can surface at unpredictable points - even within ensure blocks, potentially corrupting shared state (e.g., database connection pools or similar global structures). In such cases, the only recovery path is often a full restart.

Processes, however, do not suffer from this issue. If a process is terminated, its state is completely discarded, preventing contamination of shared state in other workers⁷. While file descriptors can persist after process termination, they are typically cleaned up by the operating system.

The ability to enforce global timeouts across application logic is a key advantage of preforking servers that use process-per-request models, such as pitchfork or unicorn. Unlike Puma, these servers can terminate a process without risking corruption of shared memory. Although Puma could theoretically terminate an entire worker when one of its threads exceeds a timeout, doing so would cancel all other in-flight requests handled by that worker, presenting a tradeoff.

Apart from timeouts, process-per-request models offer additional benefits, many of which are outlined in the pitchfork documentation.

Now, let’s return to the Launcher and revisit how a runner is configured:

# lib/puma/launcher.rb

module Puma
  class Launcher
    # ...
    def initialize(conf, launcher_args={})
      # ...
      if clustered?
        @options[:logger] = @log_writer

        @runner = Cluster.new(self)
      else
        @runner = Single.new(self)
      end
      # ...
    end

    # ...

    def run
      # ...
      @runner.run
      # ...
    end

    # ...
  end
end

# lib/puma/cluster.rb

module Puma
  class Cluster < Runner
    # ...
    def initialize(launcher)
      super(launcher)

      @phase = 0
      @workers = []
      @next_check = Time.now

      @phased_restart = false
    end

    # ...

    def run
      @status = :run

      output_header "cluster"

      log "*      Workers: #{@options[:workers]}"

      if @options[:preload_app]
        fork_safe = ->(t) { !t.is_a?(Process::Waiter) && t.thread_variable_get(:fork_safe) }

        before = Thread.list.reject(&fork_safe)

        log "*     Restarts: (\u2714) hot (\u2716) phased"
        log "* Preloading application"
        load_and_bind

        after = Thread.list.reject(&fork_safe)

        if after.size > before.size
          threads = (after - before)
          if threads.first.respond_to? :backtrace
            log "! WARNING: Detected #{after.size-before.size} Thread(s) started in app boot:"
            threads.each do |t|
              log "! #{t.inspect} - #{t.backtrace ? t.backtrace.first : ''}"
            end
          else
            log "! WARNING: Detected #{after.size-before.size} Thread(s) started in app boot"
          end
        end
      else
        log "*     Restarts: (\u2714) hot (\u2714) phased"

        unless @config.app_configured?
          error "No application configured, nothing to run"
          exit 1
        end

        @launcher.binder.parse @options[:binds]
      end

      # ...
    end
    # ...
  end
end

First, take note of the instance variables set during the initialiser. @workers starts off empty, while @phase is set to 0.

When run is called, the runner’s @status is updated to :run, which differs from the behaviour in the Single runner. This variable tracks the current state of the cluster.

The runner then checks if the preload_app! configuration option is enabled.

Forking processes is more resource-intensive than spawning threads because forked children receive a full copy of the parent’s memory. To optimise this, operating systems may offer copy-on-write (COW) functionality. With COW, forked children don’t immediately duplicate the parent’s memory. Instead, memory is copied only when the child modifies it, allowing shared, read-only memory to remain untouched.

Puma’s preload_app! feature leverages this mechanism by loading the application in the main process before forking workers. Enabling this option is almost always beneficial. However, there are scenarios where disabling it can be necessary - for example, when there is only one Puma instance running (this isn’t usually the case in modern deployment strategies, where traffic is load-balanced between multiple replicas). When app preloading is enabled, Puma will not be able to conduct a zero-downtime phased restart, leaving only the hot restart as an option, which results in increased latency. The inner workings of both hot and phased restarts, along with the reason why preload_app! clashes with one of them, will be explored later.

If preloading is enabled, Puma iterates over all running threads and collects those without the fork_safe instance variable set to true. This variable can be manually assigned to any thread by calling thread.thread_variable_set(:fork_safe, true). This convention, while not extensively documented, is used by Puma and other servers to identify threads that may not be fork-safe. Background tasks, often spawned by gems, may run threads throughout the application’s lifecycle. For instance, Datadog explicitly marks its workers as fork-safe to comply with this practice.

While thread-safety is a well-understood concept, fork-safety is less commonly discussed. Forked children inherit the parent’s memory, which can lead to leaks or deadlocks if not managed properly. Consider a Worker class that launches a thread to process an internal buffer. If this worker is started before the fork, the child process may inherit a buffer with partially processed work. The main process likely already handled that work, but after the fork, when the worker thread restarts (since only the main thread survives the fork), it could process the same buffer again - resulting in duplicate processing.

For example, this scenario is precisely what dogstatsd, Datadog’s StatsD client, has to mitigate.

The consequences can be more severe than just duplicate processing - if the thread inherits a locked mutex instead of a buffer, the child process may deadlock.

To prevent such issues, Puma checks Thread.list before and after the application is loaded. If any unmarked threads are detected, Puma logs a warning.

Next, load_and_bind loads the application and calls binder.parse to configure listener sockets. This step occurs regardless of whether preloading is enabled. However, with preload enabled, it also loads the Rack application.

It’s important to note that if Puma is started implicitly through a framework command (e.g., rails s), the application loads upfront - even if preloading is disabled. This happens earlier in the boot process, before Puma’s Launcher is invoked, as discussed in previous sections.

Managing Signals

Following the potential preload, Puma prepares to start workers by setting up several pipes and registering signal handlers.

module Puma
  class Cluster < Runner
    # ...
    def run
      # ...

      read, @wakeup = Puma::Util.pipe

      setup_signals

      @check_pipe, @suicide_pipe = Puma::Util.pipe

      @fork_pipe, @fork_writer = Puma::Util.pipe

      log "Use Ctrl-C to stop"

      single_worker_warning

      redirect_io

      Plugins.fire_background

      @launcher.write_state

      start_control

      @master_read, @worker_write = read, @wakeup

      @options[:worker_write] = @worker_write

      @config.run_hooks(:before_fork, nil, @log_writer)

      # ...
    end
    # ...
  end
end

The read, @wakeup pipe is first established - read represents the reader end, while @wakeup is the writer end. This pipe allows workers to communicate with the main process.

Next, the @check_pipe, @suicide_pipe pipe is created. This is a clever mechanism that enables workers to detect if the main process has exited. If the process that owns the writer end of the pipe terminates, an EOF marker is automatically written to the reader end. It’s important to remember that when a parent process forks children, those children do not automatically terminate if the parent dies. Instead, they continue running as orphans, eventually adopted by the init process (at least on Linux).

Lastly, the @fork_pipe, @fork_writer pipe is set up. This pipe is part of a feature that hasn’t been covered yet, so we’ll revisit it later.

The next step is Cluster#setup_signals, which appears as follows:

module Puma
  class Cluster < Runner
    # ...
    def setup_signals
      if @options[:fork_worker]
        Signal.trap "SIGURG" do
          fork_worker!
        end

        if (fork_requests = @options[:fork_worker].to_i) > 0
          @events.register(:ping!) do |w|
            fork_worker! if w.index == 0 &&
              w.phase == 0 &&
              w.last_status[:requests_count] >= fork_requests
          end
        end
      end

      Signal.trap "SIGCHLD" do
        wakeup!
      end

      Signal.trap "TTIN" do
        @options[:workers] += 1
        wakeup!
      end

      Signal.trap "TTOU" do
        @options[:workers] -= 1 if @options[:workers] >= 2
        wakeup!
      end

      master_pid = Process.pid

      Signal.trap "SIGTERM" do
        if Process.pid != master_pid
          log "Early termination of worker"
          exit! 0
        else
          @launcher.close_binder_listeners

          stop_workers
          stop
          @events.fire_on_stopped!
          raise(SignalException, "SIGTERM") if @options[:raise_exception_on_sigterm]
          exit 0
        end
      end
    end
    # ...
  end
end

As we analyse the signal traps, it’s crucial to remember that these traps are registered in the main process. This is why Process.pid is considered the master process ID. However, the signal traps are inherited by child processes. This is evident from the SIGTERM handler, which checks the current Process.pid - if a child process receives this signal, it will evaluate to the child’s ID.

The TTIN and TTOU signals allow the dynamic adjustment of worker counts during runtime.

SIGCHLD is triggered when a child process stops. The handler for SIGCHLD, as well as for TTIN and TTOU, invokes the wakeup! method, which simply writes a WAKEUP request to the @wakeup pipe:

module Puma
  class Runner
    # ...
    def wakeup!
      return unless @wakeup

      @wakeup.write Puma::Const::PipeRequest::WAKEUP unless @wakeup.closed?
      # ...
    end
    # ...
  end
end

The SIGURG signal handler is only registered if the fork_worker configuration option is enabled. This ties back to the @fork_pipe, @fork_writer pipe we mentioned earlier. When enabled, a handler for the :ping! event is also registered on the Puma::Events pub-sub sink. This marks the first instance of Puma using its own callback system internally. Upon receiving ping!, fork_worker! is triggered if specific conditions are met. We’ll revisit this in detail later.

SIGTERM is the graceful shutdown signal⁸. When the main process receives SIGTERM, it executes the following steps:

Calls close_binder_listener on the launcher, which in turn calls Binder#close_listeners.
Stops the cluster by invoking stop_workers and stop.
Raises a SignalException with the SIGTERM argument by default.

In environments like Kubernetes, which expects a 143 exit code (indicating SIGTERM was handled properly), raising this exception achieves the desired result. If raise_exception_on_sigterm is disabled, the process exits with exit 0, producing an exit code of 0 (indicating success). The value set for this configuration option should depend on the infrastructure in which Puma operates.

Next, let’s examine how stop and stop_workers function:

module Puma
  class Cluster < Runner
    # ...

    def stop
      @status = :stop
      wakeup!
    end

    # ...

    def stop_workers
      log "- Gracefully shutting down workers..."
      @workers.each { |x| x.term }

      begin
        loop do
          wait_workers
          break if @workers.reject {|w| w.pid.nil?}.empty?
          sleep 0.2
        end
      rescue Interrupt
        log "! Cancelled waiting for workers"
      end
    end

    # ...
  end
end

The stop method is straightforward.

In contrast, stop_workers iterates over the @workers array, calling term on each worker. Afterward, it enters a loop that persists until all workers have completely stopped. We’ll delve into the internals of wait_workers shortly.

Worker Management

Returning to Cluster#run:

module Puma
  class Cluster < Runner
    # ...
  
    def run
      # ...
  
      spawn_workers
    
      Signal.trap "SIGINT" do
        stop
      end
  
      # ...
    end

    # ...
  
    def spawn_workers
      diff = @options[:workers] - @workers.size
      return if diff < 1
  
      master = Process.pid
      if @options[:fork_worker]
        @fork_writer << "-1\n"
      end
  
      diff.times do
        idx = next_worker_index
  
        if @options[:fork_worker] && idx != 0
          @fork_writer << "#{idx}\n"
          pid = nil
        else
          pid = spawn_worker(idx, master)
        end
  
        debug "Spawned worker: #{pid}"
        @workers << WorkerHandle.new(idx, pid, @phase, @options)
      end
  
      if @options[:fork_worker] && all_workers_in_phase?
        @fork_writer << "0\n"
      end
    end

    # ...

    def next_worker_index
      occupied_positions = @workers.map(&:index)
      idx = 0
      idx += 1 until !occupied_positions.include?(idx)
      idx
    end

    # ...
  end
end

Following the setup of signal handlers, the clustered runner initiates the worker spawning process.

A key characteristic of spawn_workers is its idempotency. This ensures that repeatedly calling spawn_workers does not fork additional workers unnecessarily. Idempotency is achieved by factoring in the current count of spawned workers.

From this point, the method branches into two paths:

If fork_worker is disabled (which it is by default), the main process forks all workers directly during each spawn_workers invocation.
If fork_worker is enabled, the main process forks only the first worker (ID 0). Worker 0 then forks the remaining workers, whose IDs were written to the @fork_writer pipe by the main process.

Despite its name, disabling fork_worker does not prevent worker processes from being forked - workers are always forked, the difference is in who forks them.

Enabling this option shifts the responsibility of forking subsequent workers from the main process to worker 0. This differs from the default setup, where the main process forks all workers.

The primary reason for enabling fork_worker lies in further optimising copy-on-write (COW) behaviour.

If the underlying application delays initialising certain shared states until after the main process forks the workers, those late-constructed resources will not benefit from COW. This results in unnecessary memory duplication across worker processes. By delegating the forking process to worker 0, late-initialised constructs remain shared, improving memory efficiency by maximising COW.

Here’s where the @events.register(:ping!) callback (previously mentioned) becomes relevant.

When a :ping! event is triggered, the forking of additional workers (by worker 0) is initiated:

# ...
if (fork_requests = @options[:fork_worker].to_i) > 0
  @events.register(:ping!) do |w|
    fork_worker! if w.index == 0 &&
      w.phase == 0 &&
      w.last_status[:requests_count] >= fork_requests
  end
end
# ...

It’s safe to assume that the ping! event is sent regularly by workers. When worker 0 receives it, it calls fork_worker! if it has processed more than the specified number of requests, which is 1000 by default. This will happen only once per server lifecycle, as each refork increments the current phase (we’ll look into phased_restart later).

module Puma
  class Cluster < Runner
    # ...

    def fork_worker!
      if (worker = worker_at 0)
        worker.phase += 1
      end
      phased_restart(true)
    end

    # ...

    def worker_at(idx)
      @workers.find { |w| w.index == idx }
    end

    # ...
  end
end

If an application defers loading or instantiating some of its global components, memory will not be allocated for them immediately during startup. However, it will eventually be loaded at some point during the server’s lifetime.

This eventuality is heuristically defined by the number of requests a particular worker has processed. Once worker 0 is sufficiently warmed up, forking other workers from its image will most likely result in a smaller overall memory footprint compared to immediate forking, as memory used for allocating deferred components will not have to be copied or initialised. The child process will instead read it from the parent process’s memory according to COW semantics. More about the intricacies and limitations of this feature is described here.

Now, let’s return to spawn_workers. Each returned pid is wrapped in a WorkerHandle, even if the main process was not responsible for forking it.

# lib/puma/cluster.rb

module Puma
  class Cluster < Runner
    # ...
    def spawn_workers
      # ...
      diff.times do
        idx = next_worker_index
      
        if @options[:fork_worker] && idx != 0
          @fork_writer << "#{idx}\n"
          pid = nil
        else
          pid = spawn_worker(idx, master)
        end
      
        debug "Spawned worker: #{pid}"
        @workers << WorkerHandle.new(idx, pid, @phase, @options)
      end

      # ...
    end
    # ...
  end
end

# lib/puma/cluster/worker_handle.rb

module Puma
  class Cluster < Runner
    class WorkerHandle
      def initialize(idx, pid, phase, options)
        @index = idx
        @pid = pid
        @phase = phase
        @stage = :started
        @signal = "TERM"
        @options = options
        @first_term_sent = nil
        @started_at = Time.now
        @last_checkin = Time.now
        @last_status = {}
        @term = false
      end

      attr_reader :index, :pid, :phase, :signal, :last_checkin, :last_status, :started_at

      # ...
    end
  end
end

WorkerHandle is a small class that encapsulates a running process by exposing methods that provide control over it. Instances of this class are what the @workers array contains. Since we’ve already seen the term method being called on workers, let’s make a quick detour and see how a handle provides control over the underlying process.

module Puma
  class Cluster < Runner
    class WorkerHandle
      # ...

      def term
        begin
          if @first_term_sent && (Time.now - @first_term_sent) > @options[:worker_shutdown_timeout]
            @signal = "KILL"
          else
            @term ||= true
            @first_term_sent ||= Time.now
          end
          Process.kill @signal, @pid if @pid
        rescue Errno::ESRCH
        end
      end

      # ...
    end
  end
end

As can be seen, the purpose of this method is twofold: to gracefully terminate the process with SIGTERM initially, or to unceremoniously SIGKILL it after worker_shutdown_timeout seconds in case it did not react to SIGTERM in time.

Despite its name, Process::kill is a generic facility for sending signals to a process identified by the specified PID. What actually happens when this method is called depends entirely on the signal argument.

Before we move on from spawn_workers and proceed to spawn_worker, which is called during the iteration, let’s pay attention to the final conditional.

module Puma
  class Cluster < Runner
    # ...
    def spawn_workers
      # ...

      if @options[:fork_worker] && all_workers_in_phase?
        @fork_writer << "0\n"
      end
    end

    # ...

    def all_workers_in_phase?
      @workers.all? { |w| w.phase == @phase }
    end

    # ...
  end
end

This gets evaluated outside of the diff iteration, which is a key detail. If fork_worker is enabled and all workers are in the same phase, worker 0 is additionally written to the pipe, which makes it possible for servers running with this option to restart. We will look into phases and restarts in more depth later.

Now, let’s see what happens in spawn_worker, which may be called by the main process:

module Puma
  class Cluster < Runner
    # ...

    def spawn_worker(idx, master)
      @config.run_hooks(:before_worker_fork, idx, @log_writer)

      pid = fork { worker(idx, master) }
      if !pid
        log "! Complete inability to spawn new workers detected"
        log "! Seppuku is the only choice."
        exit! 1
      end

      @config.run_hooks(:after_worker_fork, idx, @log_writer)
      pid
    end

    # ...
  end
end

We have reached the point where Kernel::fork, Ruby’s API for the fork system call, gets called. The child process begins executing the block with a call to the worker method immediately, while the parent continues, runs the appropriate hooks, and returns the process ID of the forked child.

Let’s leave child processes to themselves for now and come back to the parent’s duties.

Back in Cluster#run, after spawn_workers returns, at which point child processes will be up and running, the following happens:

module Puma
  class Cluster < Runner
    # ...
    def run
      # ..
      spawn_workers

      Signal.trap "SIGINT" do
        stop
      end

      begin
        booted = false
        in_phased_restart = false
        workers_not_booted = @options[:workers]

        while @status == :run
          # ...
        end

        stop_workers unless @status == :halt
      ensure
        @check_pipe.close
        @suicide_pipe.close
        read.close
        @wakeup.close
      end
    end
    # ...
  end
end

The main process enters the loop where it will spend the rest of its time. At this point, @status will be true unless the server was shut down immediately after being started. Once the status changes, the loop finishes and stop_workers is called. You may recall that this method is also called directly from the SIGTERM handler. Once the server is shut down, which is represented by the run method returning, the previously created pipes get closed.

Let’s start examining the main loop:

module Puma
  class Cluster < Runner
    # ...
    def initialize(launcher)
      # ...
      @next_check = Time.now

      @phased_restart = false
    end

    # ...

    def run
      # ...

      while @status == :run
        begin
          if @options[:idle_timeout] && all_workers_idle_timed_out?
            log "- All workers reached idle timeout"
            break
          end
        
          if @phased_restart
            start_phased_restart
            @phased_restart = false
            in_phased_restart = true
            workers_not_booted = @options[:workers]
          end
        
          check_workers
      
          if read.wait_readable([0, @next_check - Time.now].max)
            # ...
          end
      
          if in_phased_restart && workers_not_booted.zero?
            @events.fire_on_booted!
            debug_loaded_extensions("Loaded Extensions - master:") if @log_writer.debug?
            in_phased_restart = false
          end
        rescue Interrupt
          @status = :stop
        end
      end

      # ...
    end

    # ...

    def all_workers_idle_timed_out?
      (@workers.map(&:pid) - idle_timed_out_worker_pids).empty?
    end

    def idle_timed_out_worker_pids
      idle_workers.keys
    end

    def idle_workers
      @idle_workers ||= {}
    end
  end
end

Each iteration starts by checking if all workers have been timed out due to being idle. The Server running in each worker is responsible for detecting this - you may recall that IO::select, which is used to monitor listening sockets, has an optional timeout if the idle_timeout configuration option is provided. Once this happens, the Server writes the following message to the @worker_write pipe, which was set up earlier for communication between workers and the main process: "#{PipeRequest::IDLE}#{Process.pid}\n".

We will learn how such messages get processed once we reach the read.wait_readable statement. In case the all_workers_idle_timed_out? condition is satisfied, the loop exits and the server shuts down.

The next conditional is there to implement a phased restart. We have a section dedicated to it coming up, so we’ll skip it for now.

module Puma
  class Cluster < Runner
    # ...
    def check_workers
      return if @next_check >= Time.now

      @next_check = Time.now + @options[:worker_check_interval]

      timeout_workers
      wait_workers
      cull_workers
      spawn_workers

      if all_workers_booted?
        # ...
      end

      t = @workers.reject(&:term?)
      t.map!(&:ping_timeout)

      @next_check = [t.min, @next_check].compact.min
    end
    # ...
  end
end

Workers are then checked. The method will execute only if the previously defined next_check timestamp is due. The default interval for passive checking is 5 seconds, controlled by the worker_check_interval configuration option.

If the check has to happen, a set of four methods is called in sequence. Let’s start with timeout_workers:

# lib/puma/cluster.rb

module Puma
  class Cluster < Runner
    # ...
    def timeout_workers
      @workers.each do |w|
        if !w.term? && w.ping_timeout <= Time.now
          details = if w.booted?
                      "(Worker #{w.index} failed to check in within #{@options[:worker_timeout]} seconds)"
                    else
                      "(Worker #{w.index} failed to boot within #{@options[:worker_boot_timeout]} seconds)"
                    end
          log "! Terminating timed out worker #{details}: #{w.pid}"
          w.kill
        end
      end
    end
    # ...
  end
end

# lib/puma/cluster/worker_handle.rb

module Puma
  class Cluster < Runner
    class WorkerHandle
      def initialize(idx, pid, phase, options)
        # ...
        @stage = :started
        # ...
        @last_checkin = Time.now
        # ...
      end

      # ...

      def ping_timeout
        @last_checkin +
          (booted? ?
            @options[:worker_timeout] :
            @options[:worker_boot_timeout]
          )
      end

      def boot!
        @last_checkin = Time.now
        @stage = :booted
      end

      def ping!(status)
        @last_checkin = Time.now
        # ...
      end

      def booted?
        @stage == :booted
      end

      def term!
        @term = true
      end

      def term?
        @term
      end

      # ...
    end
  end
end

To better understand what is going on, we need to unpack WorkerHandle a bit more. The concept of timed-out workers revolves around the PING command they supposedly send regularly. We’ve already encountered a ping handler when we briefly investigated fork_worker. That handler is not the reason pings exist; it simply piggybacks on them since it’s more convenient to utilise an existing heartbeat mechanism. The primary purpose of pings is to indicate to the master process that a particular worker is alive and well.

Each worker handle keeps track of the last time the underlying process emitted a ping command by using the last_checkin timestamp. The timeout value is calculated by adding the predefined timeout value to it. Both worker_boot_timeout and worker_timeout are 60 seconds by default.

timeout_workers iterates over all workers and kills those who have not been terminated gracefully (the x.term? check) but failed to emit a ping command within the predefined interval. We’ll see how workers emit commands later.

Next up in the check_workers pipeline is wait_workers:

module Puma
  class Cluster < Runner
    # ...
    def wait_workers
      reaped_children = {}
      loop do
        begin
          pid, status = Process.wait2(-1, Process::WNOHANG)
          break unless pid
          reaped_children[pid] = status
        rescue Errno::ECHILD
          break
        end
      end

      @workers.reject! do |w|
        next false if w.pid.nil?
        begin
          if reaped_children.delete(w.pid) || Process.wait(w.pid, Process::WNOHANG)
            true
          else
            w.term if w.term?
            nil
          end
        rescue Errno::ECHILD
          begin
            Process.kill(0, w.pid)
            w.term if w.term?
            false
          rescue Errno::ESRCH, Errno::EPERM
            true
          end
        end
      end

      reaped_children.each do |pid, status|
        log "! reaped unknown child process pid=#{pid} status=#{status}"
      end
    end
    # ...
  end
end

The purpose of this method is to reap terminated workers.

To collect the forked processes, Process::wait2 is used here, which is an abstraction over the wait system call. When -1 is provided as the first argument, it waits in a blocking manner for the first child process to exit, returning its ID and status. Since it returns only one potential process, the call is made in a loop.

To avoid grinding the whole cluster to a halt if no children are terminated, the WNOHANG flag is provided as an argument. This makes wait2 return immediately if no child has exited. This method does not wait for all children to exit; it simply checks on a best-effort basis. It will be called again during the next check to ensure that all orphaned processes are eventually detected.

Why is iterating over @workers not enough here? On Linux, the process with ID 1 is responsible for reaping processes orphaned by other processes. PID 1 is usually occupied by a process monitor. In containerised environments, however, the container’s command becomes the process with ID 1, making it responsible for this cleanup. The Rack application may fork processes that Puma may not know about. If they are not properly shut down, they will linger as zombie processes. Puma takes this possibility into account and diligently reaps all children, not just its workers.

Once all PIDs and statuses of all children are collected, wait_workers starts iterating over @workers via reject!, which deletes handles of dead processes directly from the array.

next false if w.pid.nil? skips processes that are yet to be forked by worker 0 if fork_worker is enabled.

The PID of each registered worker is then deleted from reaped_children. If the hash did not contain the given PID, there’s a fallback that explicitly checks if the particular worker process exited. This can happen if fork_worker is enabled: workers will be children of worker 0 in this case, which will be the only child of the master, hence wait2 won’t detect them.

If the worker is active, Process::wait will return nil. The second branch checks if this worker is supposed to be terminated. As we’ve seen, Worker#term first sends the SIGTERM signal. If the worker does not terminate itself gracefully, the follow-up invocation to term will SIGKILL it - this is where it happens.

If Process::wait is called with a PID that is not a direct child of the master process, which happens when workers are forked from worker 0, this method will raise Errno::ECHILD. To handle this edge case, Process::kill sends a special signal 0. It’s special because it’s not actually sent to the process; its use case is to detect if the process with the given ID exists. If it does not exist, Errno::ESRCH or Errno::EPERM are raised, indicating that the worker has been terminated. If it does exist, the same logic as in the second branch of the if statement applies.

The third method in the check_workers pipeline is cull_workers:

module Puma
  class Cluster < Runner
    # ...
    def cull_workers
      diff = @workers.size - @options[:workers]
      return if diff < 1
      debug "Culling #{diff} workers"

      workers = workers_to_cull(diff)
      debug "Workers to cull: #{workers.inspect}"

      workers.each do |worker|
        log "- Worker #{worker.index} (PID: #{worker.pid}) terminating"
        worker.term
      end
    end

    # ...

    def workers_to_cull(diff)
      workers = @workers.sort_by(&:started_at)

      workers.reject! { |w| w.index == 0 } if @options[:fork_worker]

      workers[cull_start_index(diff), diff]
    end

    def cull_start_index(diff)
      case @options[:worker_culling_strategy]
      when :oldest
        0
      else
        -diff
      end
    end

    # ...
  end
end

Its purpose is to remove excessive workers, which can be created during downscaling, upscaling, restarts, and other moments of Puma’s lifecycle. Without this detection, workers may leak, violating the contract imposed by the configured cluster size.

If there are more active workers than necessary, the youngest ones will be terminated. Their age is determined by the started_at attribute of their handle, and Puma can be configured to terminate the oldest workers first. Worker 0 is never culled if it’s responsible for forking.

The final method called by check_workers is the already analysed spawn_workers - it was already called by the main process at the very beginning.

The purpose of calling spawn_workers on each main loop iteration is to ensure the cluster runs at full capacity. Here is a non-exhaustive list of situations where this method will fork new workers:

If any of the workers could not be started for any reason;
If any of the workers timed out;
If another worker has terminated for any reason;
If a phased restart is happening.

Once the maintenance of workers is done, check_workers calculates the timestamp after which the next check-up should take place:

module Puma
  class Cluster < Runner
    # ...
    def check_workers
      return if @next_check >= Time.now

      @next_check = Time.now + @options[:worker_check_interval]

      timeout_workers
      wait_workers
      cull_workers
      spawn_workers

      if all_workers_booted?
        # ...
      end

      t = @workers.reject(&:term?)
      t.map!(&:ping_timeout)

      @next_check = [t.min, @next_check].compact.min
    end
    # ...
  end
end

The updated workers array is sorted by ping_timeout, which we’ve seen, and @next_check is assigned the minimum timeout value. The check will still happen at least around every worker_check_interval (5 by default) seconds, no matter what.

The all_workers_booted? branch will be covered in the section covering phased restarts.

Let’s return to Cluster#run:

module Puma
  class Cluster < Runner
    # ...
    def run
      # ...
      check_workers

      if read.wait_readable([0, @next_check - Time.now].max)
        req = read.read_nonblock(1)
        next unless req

        if req == Puma::Const::PipeRequest::WAKEUP
          @next_check = Time.now
          next
        end

        result = read.gets
        pid = result.to_i

        if req == Puma::Const::PipeRequest::BOOT || req == Puma::Const::PipeRequest::FORK
          pid, idx = result.split(':').map(&:to_i)
          w = worker_at idx
          w.pid = pid if w.pid.nil?
        end

        if w = @workers.find { |x| x.pid == pid }
          case req
          when Puma::Const::PipeRequest::BOOT
            w.boot!
            log "- Worker #{w.index} (PID: #{pid}) booted in #{w.uptime.round(2)}s, phase: #{w.phase}"
            @next_check = Time.now
            workers_not_booted -= 1
          when Puma::Const::PipeRequest::EXTERNAL_TERM
            w.term!
          when Puma::Const::PipeRequest::TERM
            w.term unless w.term?
          when Puma::Const::PipeRequest::PING
            status = result.sub(/^\d+/,'').chomp
            w.ping!(status)
            @events.fire(:ping!, w)

            if in_phased_restart && workers_not_booted.positive? && w0 = worker_at(0)
              w0.ping!(status)
              @events.fire(:ping!, w0)
            end

            if !booted && @workers.none? {|worker| worker.last_status.empty?}
              @events.fire_on_booted!
              debug_loaded_extensions("Loaded Extensions - master:") if @log_writer.debug?
              booted = true
            end
          when Puma::Const::PipeRequest::IDLE
            if idle_workers[pid]
              idle_workers.delete pid
            else
              idle_workers[pid] = true
            end
          end
        else
          log "! Out-of-sync worker list, no #{pid} worker"
        end
      end
    # ...
  end
end

read is the reader’s end of the @fork_writer pipe that workers are using to communicate their state to the master process. wait_readable is called on it with a timeout value, which is based on the @next_check value. So far, we’ve seen it being updated in the just-called check_workers method.

This mechanism prevents busy-looping by calculating the next earliest time when a check should be performed. When the master process knows that there’s no reason to perform the heavy lifting of managing the cluster state during an iteration of the main loop, it simply spends this time waiting for the next command to arrive over the pipe.

But what if something unexpected happens, like when the server is being ordered to stop by the process manager via a signal? Cluster#wakeup! is meant for this exact situation. As we’ve seen, it gets called from signal handlers and a bunch of other places. Once a command is written, the sleeping main process will immediately wake up, and wait_readable will return. This is the exact same approach that the Server uses to process state changes, except there the pipe’s file descriptor is monitored using IO::select.

So, despite how it might seem, the main process being blocked on reading from the pipe is not a problem, as it will be able to react to any changes immediately.

Let’s examine how each command is processed, starting with WAKEUP. We will look into what emits them in the next section.

It has the most straightforward handler: it simply updates @next_check to the current time and starts a new iteration of the loop, which triggers a call to check_workers to update the cluster’s state.

All of the other commands come with the ID of the process from which they were sent.

BOOT and FORK commands are first used to hydrate the PID of a handle for a worker that is supposed to be forked by worker 0. Remember that when fork_worker is enabled, a handle does not contain the process’s ID immediately since the main process has no way to know it.

For other commands, the worker handle is first extracted by the passed PID from the corresponding array.

When the BOOT command is received, the corresponding handle is updated to reflect the worker’s state and updates its last_checkin value. The workers_not_booted local variable, set to the configured total number of workers, is decremented. It becomes clear that once the desired number of workers emit this command, the main process considers the cluster to be complete.

The EXTERNAL_TERM handler calls term! on the handle, which simply sets its internal term instance variable to true. It does not send a signal to the worker, and we will find out why once we look into its internals.

On the other hand, the TERM handler does send a signal by calling term. Note that this does not happen if the worker is already going through the process of termination, otherwise, workers would get killed without a chance to gracefully shut down.

We’ve already discussed the PING command a little. Its handler is the place where ping! is called on the worker handle, updating its last_checkin timestamp.

The next conditional is there to make worker 0 perform reforking if the server is in the process of a phased restart. Relying on the PING command emitted from worker 0 itself is not enough here; we’ll soon learn why.

The PING handler is also responsible for executing the callbacks once the application boots. Clustered mode considers the application to be booted once all workers are up and running.

The handler for the IDLE command updates the idle_workers hash, which is used by the timeout_workers method.

The main process will continue looping through commands until the server begins the shutdown process, at which point it will stop and call stop_workers if the shutdown is graceful.

There are two ways to fully stop Puma: the graceful shutdown using stop and an immediate shutdown using halt. stop is called on SIGTERM, while halt can be initiated only manually.

We’ve already seen what stop_workers does, but as a quick reminder, it sends SIGTERM to all worker processes and calls wait_workers in a loop until all child processes exit.

Children Processes

Now that we’ve completely covered the loop of the main process, let’s come back to Cluster#spawn_worker and see what the children do:

module Puma
  class Cluster < Runner
    # ...
    def spawn_worker(idx, master)
      # ...
      pid = fork { worker(idx, master) }
      # ...
    end

    # ...

    def worker(index, master)
      @workers = []

      @master_read.close
      @suicide_pipe.close
      @fork_writer.close

      pipes = { check_pipe: @check_pipe, worker_write: @worker_write }
      if @options[:fork_worker]
        pipes[:fork_pipe] = @fork_pipe
        pipes[:wakeup] = @wakeup
      end

      server = start_server if preload?
      new_worker = Worker.new index: index,
                              master: master,
                              launcher: @launcher,
                              pipes: pipes,
                              server: server
      new_worker.run
    end
    # ...
  end
end

Once forked, each child process closes the writer ends of the suicide and fork pipes, and the reader end of the wakeup pipe. This is a common and necessary practice when working with pipes in forked processes, since otherwise associated file descriptors inherited from the parent will be kept open, making it impossible to read or write.

The Server is then initialised if preload is enabled. Since Worker, perhaps a little confusingly, is a subclass of Puma::Runner just like Cluster itself, not passing the @app object (abstracted away by the Server here), which gets lazily initialised once load_and_bind gets called, would result in each worker initialising its own app, completely eliminating the point of preloading. The Worker instance is then created:

module Puma
  class Cluster < Puma::Runner
    class Worker < Puma::Runner
      attr_reader :index, :master

      def initialize(index:, master:, launcher:, pipes:, server: nil)
        super(launcher)

        @index = index
        @master = master
        @check_pipe = pipes[:check_pipe]
        @worker_write = pipes[:worker_write]
        @fork_pipe = pipes[:fork_pipe]
        @wakeup = pipes[:wakeup]
        @server = server
        @hook_data = {}
      end

      # ...
    end
  end
end

Let’s look into Worker#run:

module Puma
  class Cluster < Puma::Runner
    class Worker < Puma::Runner
      # ...
      def run
        title  = "puma: cluster worker #{index}: #{master}"
        title += " [#{@options[:tag]}]" if @options[:tag] && !@options[:tag].empty?
        $0 = title

        Signal.trap "SIGINT", "IGNORE"
        Signal.trap "SIGCHLD", "DEFAULT"

        Thread.new do
          Puma.set_thread_name "wrkr check"
          @check_pipe.wait_readable
          log "! Detected parent died, dying"
          exit! 1
        end

        # ...

        @config.run_hooks(:before_worker_boot, index, @log_writer, @hook_data)

        begin
        server = @server ||= start_server
        rescue Exception => e
          log "! Unable to start worker"
          log e
          log e.backtrace.join("\n    ")
          exit 1
        end

        # ...
      end
    end
  end
end

When fork happens, signal handlers get inherited from the parent process by its children.

Interestingly, SIGINT and SIGCHLD signal traps get overridden: the handler for the former gets completely erased, while the latter is reset back to the default handler. This prevents child processes from calling stop when SIGINT is received and from reacting to SIGCHLD the way the master process would.

A thread is then started that will run indefinitely until it’s detected that the parent was killed. This acts as a safeguard; normally, the parent will terminate children first before stopping. If the suicide pipe receives an EOF, implying that it was closed, the worker will exit via exit!.

module Puma
  class Cluster < Puma::Runner
    class Worker < Puma::Runner
      # ...
      def run
        # ...

        restart_server = Queue.new << true << false

        fork_worker = @options[:fork_worker] && index == 0

        if fork_worker
          restart_server.clear
          worker_pids = []
          Signal.trap "SIGCHLD" do
            wakeup! if worker_pids.reject! do |p|
              Process.wait(p, Process::WNOHANG) rescue true
            end
          end

          Thread.new do
            Puma.set_thread_name "wrkr fork"
            while (idx = @fork_pipe.gets)
              idx = idx.to_i
              if idx == -1
                if restart_server.length > 0
                  restart_server.clear
                  server.begin_restart(true)
                  @config.run_hooks(:before_refork, nil, @log_writer, @hook_data)
                end
              elsif idx == 0
                restart_server << true << false
              else
                worker_pids << pid = spawn_worker(idx)
                @worker_write << "#{Puma::Const::PipeRequest::FORK}#{pid}:#{idx}\n" rescue nil
              end
            end
          end
        end

        # ...
      end
      # ...
    end
  end
end

If the worker has index 0 and fork_worker is enabled, this worker becomes responsible for forking other workers. First of all, it sets up a handler for SIGCHLD, which the process receives once a child exits. This handler calls wakeup!, just like the main process would do. It’s not clear what the purpose of the conditional with worker_pids is; it will always evaluate to true since an empty array is truthy.

Worker 0 then spawns a thread that reads from the ‘fork’ pipe. The master process writes worker indices to this pipe that need to be forked. There are 2 special cases: -1 and 0.

When -1 is received, the worker conducts a graceful stop of the server synchronously, waiting until it’s done. Note that it’s a stop and not a shutdown - the listening sockets do not get closed, which is very important for availability. If they were closed here, some unlucky clients’ connections would regularly get interrupted during phased restarts or reforks.

Why is this needed? Any file descriptor that is open during fork will get inherited by the child process. If a worker gets forked before its parent, worker 0, finishes processing its requests, it will inherit not only file descriptors but also the todo list of the thread pool, which may contain connections that are already being processed. This is the exact same reason why Puma tracks fork-unsafe threads - state can be leaked if the fork is performed carelessly, leading to incredible bugs.

Once the forking thread receives 0 on the pipe, it considers it to be a green light to start the server again. As you may recall, the master process writes it right after it’s done generating indices for to-be-forked processes. Worker 0 should not be slacking off on handling requests just because it’s the parent of other workers; hence, once the forking process is over, it restarts the server. This means that worker 0’s server is stopped only for a short while during a phased restart or an explicit refork, which has a moderate impact on latency.

Finally, each number that is not -1 or 0 results in a new worker being spawned. Worker 0 then writes the FORK command to @worker_write, which, as we’ve seen, is used by the master process to mark the corresponding handle with a PID. Worker overwrites spawn_worker to introduce a minor change that does not close irrelevant ends of inherited pipes, since they are already closed.

At this point, the fork_worker logic that is relevant only to worker 0 is over.

module Puma
  class Cluster < Puma::Runner
    class Worker < Puma::Runner
      # ...
      def run
        # ...

        Signal.trap "SIGTERM" do
          @worker_write << "#{Puma::Const::PipeRequest::EXTERNAL_TERM}#{Process.pid}\n" rescue nil
          restart_server.clear
          server.stop
          restart_server << false
        end

        begin
          @worker_write << "#{Puma::Const::PipeRequest::BOOT}#{Process.pid}:#{index}\n"
        rescue SystemCallError, IOError
          # ...
          STDERR.puts "Master seems to have exited, exiting."
          return
        end

        # ...
      end
      # ...
    end
  end
end

Each worker overrides the inherited SIGTERM handler. The new handler reacts to SIGTERM signals fanned out from the main process once shutdown is initiated. It begins by writing the EXTERNAL_TERM command to the pipe, which causes the main process to mark the corresponding worker handle as terminating.

The worker then emits the BOOT command, indicating that it’s ready.

module Puma
  class Cluster < Puma::Runner
    class Worker < Puma::Runner
      # ...
      def run
        # ...

        restart_server = Queue.new << true << false

        # ...

        while restart_server.pop
          server_thread = server.run

          if @log_writer.debug? && index == 0
            debug_loaded_extensions "Loaded Extensions - worker 0:"
          end

          stat_thread ||= Thread.new(@worker_write) do |io|
            Puma.set_thread_name "stat pld"
            base_payload = "p#{Process.pid}"

            while true
              begin
                # ...
                payload = %Q!#{base_payload}{ "backlog":#{b}, "running":#{r}, "pool_capacity":#{t}, "max_threads":#{m}, "requests_count":#{rc} }\n!
                io << payload
              rescue IOError
                break
              end
              sleep @options[:worker_check_interval]
            end
          end
          server_thread.join
        end

        @config.run_hooks(:before_worker_shutdown, index, @log_writer, @hook_data)
      ensure
        @worker_write << "#{Puma::Const::PipeRequest::TERM}#{Process.pid}\n" rescue nil
        @worker_write.close
      end
    end
  end
end

Finally, the worker enters its main loop. It utilises an interesting way to understand when it should exit: it reads from the restart_server queue once. Ordinarily, the queue will always contain true as its first element once the worker is booted up.

However, as we’ve seen, some actions, such as when worker 0 has to stop its server or when SIGTERM is received, clear the queue and push false, which remains as its only item. The SIGTERM handler also stops the server, which means that at some point, the next iteration of the worker’s main loop will pop that false from the queue, ending the execution of Worker#run.

The initial combination of items - true and false - also implies that if the server thread dies, it won’t be restarted, resulting in the worker being terminated after some time.

Once inside the loop, the worker starts the server by calling run, at which point requests start being served.

It then spawns a thread responsible for two things:

Propagating worker stats to the parent for observability;
Emitting a PING command every worker_check_interval seconds, which is 5 seconds by default.

This means that as long as the worker process is alive, the main process will keep getting pinged by it.

Finally, once the server is shut down, the worker sends the TERM command to the pipe. Once processed, the main process sends SIGTERM to the emitting worker. Without this command, the server thread might finish, and the master process may not learn that a worker is conceptually dead.

Phased Restarts

Let’s now summarise what we’ve already seen of phased restarts and understand what their benefit is.

A phased restart is a cluster-only zero-downtime gradual replacement of all worker processes. It does not result in unavailability for clients.

In containerised environments, such as Kubernetes, restarts are usually handled at the infrastructure level. From the perspective of an individual Puma cluster, these are not restarts, but shutdowns, since they are usually implemented by sending a SIGTERM to the pod. Therefore, if running in Kubernetes, Puma will usually react to that signal and, unless explicitly reconfigured, will never encounter an automatically triggered phased restart.

Phased restarts are useful in alternative deployment strategies, where there is no orchestrator to conduct a restart of multiple Puma instances. Unless they are used to simply re-read application configuration from some storage, they are mostly pointless if the host system does not have access to the new version of application code. This is almost never the case when running in containers and is rather meant for deployment tooling like Capistrano, which has been extremely popular in the Ruby ecosystem. Tools like Capistrano create symbolic links to release directories after a new version gets uploaded to the host machine, and then restart applications, forcing them to re-read the directory from which they must source new code to run.

Regardless of the fact that phased restarts could be rare to encounter in the real world, we will look into them regardless.

A phased restart can be initiated by sending the SIGUSR1 signal to the master process. Its handler is defined in Launcher:

module Puma
  class Launcher
    # ...
    def setup_signals
      # ...
      begin
        Signal.trap "SIGUSR1" do
          phased_restart
        end
      rescue Exception
        log "*** SIGUSR1 not implemented, signal based restart unavailable!"
      end
      # ...
    end

    # ...

    def phased_restart
      unless @runner.respond_to?(:phased_restart) and @runner.phased_restart
        log "* phased-restart called but not available, restarting normally."
        return restart
      end
      true
    end
    # ...
  end
end

Single runner does not support phased restarts because there’s only one process involved, which makes it impossible to perform a reload without causing downtime. This runner supports hot restarts, which will be covered next.

module Puma
  class Cluster < Runner
    # ...
    def phased_restart(refork = false)
      return false if @options[:preload_app] && !refork

      @phased_restart = true
      wakeup!

      true
    end
    # ...
  end
end

A phased restart is prevented if preload_app is enabled, which is logical: if the master process loads the application up-front, forked children will not be able to reload its code by re-reading the symlinked directory. The master process would have to be terminated in order to make it happen, which would not be a phased restart anymore.

This method simply sets @phased_restart to true. Let’s see how it affects the main process:

module Puma
  class Cluster < Runner
    # ...
    def run
      # ...
      begin
        booted = false
        in_phased_restart = false
        workers_not_booted = @options[:workers]

        while @status == :run
        begin
            #...

            if @phased_restart
              start_phased_restart
              @phased_restart = false
              in_phased_restart = true
              workers_not_booted = @options[:workers]
            end

            check_workers

            # ...
          end
          # ...
        end
        # ...
      end
      # ...
    end

    # ...

    def start_phased_restart
      @events.fire_on_restart!
      @phase += 1
      log "- Starting phased worker restart, phase: #{@phase}"

      dir = @launcher.restart_dir
      log "+ Changing to #{dir}"
      Dir.chdir dir
    end
    # ...
  end
end

Once the main process’s loop detects that a phased restart was signalled, it bumps the @phase of the cluster, which starts with 0 after startup.

Here is where the directory gets re-read, which enables reloading of code and gem dependencies. @restart_dir, unless explicitly configured, will default to Dir::pwd, which is the directory from where Puma was launched. If it’s a symlink, Dir::chdir will automatically force subsequent require calls during application loading to fetch newly deployed code.

During the very same main loop iteration, check_workers will perform the following logic:

module Puma
  class Cluster < Runner
    # ...
    def check_workers
      # ...

      if all_workers_booted?
        w = @workers.find { |x| x.phase != @phase }

        if w
          log "- Stopping #{w.pid} for phased upgrade..."
          unless w.term?
            w.term
            log "- #{w.signal} sent to #{w.pid}..."
          end
        end
      end

      # ...
    end
    # ...
  end
end

If cluster capacity is satisfied, it picks the first worker that it can find whose phase is different from the current cluster phase. In the case of the first restart of a server, the cluster’s phase will be 1, while the phase of every worker will be 0.

This worker is then terminated.

During one of the next iterations of the main loop, the spawn_workers method, called by check_workers, will fork a new worker, since it will detect that the cluster is not running at full capacity due to the terminated worker being evicted from the @workers array.

module Puma
  class Cluster < Runner
    # ...
    def spawn_workers
      diff = @options[:workers] - @workers.size
      return if diff < 1

      # ...
    end
    # ...
  end
end

This will continue on each iteration of the main loop until the condition at the end of run is met:

module Puma
  class Cluster < Runner
    # ...
    def run
      # ...

      if in_phased_restart && workers_not_booted.zero?
        @events.fire_on_booted!
        debug_loaded_extensions("Loaded Extensions - master:") if @log_writer.debug?
        in_phased_restart = false
      end

      # ...
    end
    # ...
  end
end

In case it’s not clear how workers_not_booted.zero? ever evaluates to true, recall that each BOOT command emitted from forked processes decrements this variable. After the desired worker count is reached, the phased restart is considered to be over. At this point, all workers enter the same phase as the cluster.

Because of the all_workers_booted? condition in check_workers, there is only ever going to be at most one restarting worker, hence the name ‘phased’ restart. This results in minimal impact on throughput without sacrificing client latency, since workers are always shut down gracefully. We will look at how Servers achieve availability during restarts in the final section.

You might wonder what happens when the fork_worker option is enabled and an explicit request to refork is made. Worker 0 should never be terminated, otherwise this feature would be useless - each restart and refork would result in worker 0 starting from a clean slate, eliminating the benefits of deferred forking.

Reforking can be conducted either manually by sending the SIGURG signal to the master process, or automatically exactly once during the server’s lifetime after a preconfigured amount of requests are processed by worker 0:

module Puma
  class Cluster < Runner
    # ...
    def setup_signals
      if @options[:fork_worker]
        Signal.trap "SIGURG" do
          fork_worker!
        end

        if (fork_requests = @options[:fork_worker].to_i) > 0
          @events.register(:ping!) do |w|
            fork_worker! if w.index == 0 &&
              w.phase == 0 &&
              w.last_status[:requests_count] >= fork_requests
          end
        end
      end

      # ...
    end
    # ...
  end
end

Here’s what prevents the forking worker from being reloaded:

module Puma
  class Cluster < Runner
    # ...
    def fork_worker!
      if (worker = worker_at 0)
        worker.phase += 1
      end
      phased_restart(true)
    end
    # ...
  end
end

As you can see, worker 0’s phase is always incremented to the same phase in which the cluster will end up during the next iteration. This initiates the normal phased restart with the guarantee that the forking worker will not get terminated.

Terminating

We’ve explored how cluster-mode exclusive phased restarts work, but we stopped at the process level - we did not fully investigate how does the lower-level connection handling behaves during shut downs. Throughout the previous sections, we looked into some aspects of the Server shutdown logic, but we’re yet to paint a full picture.

Now that we’re practically done with dissecting Puma, it is the perfect time to assemble everything we’ve learned into a holistic view of how this server affects client experience during a crucial cycle of almost every application’s life - deployments.

Shutting Down With Grace

In modern environments, servers such as Puma run in containers, which are usually gracefully terminated before new versions of applications get deployed. In this sense, we’re mostly interested in how Puma handles SIGTERM, which is the universal signal for graceful termination.

Let’s remember how Puma’s launcher sets up the SIGTERM handler:

module Puma
  class Launcher
    # ...
    def setup_signals
      begin
        Signal.trap "SIGTERM" do
          do_graceful_stop

          raise(SignalException, "SIGTERM") if @options[:raise_exception_on_sigterm]
        end
      rescue Exception
        log "*** SIGTERM not implemented, signal based gracefully stopping unavailable!"
      end
    end

    # ...

    def do_graceful_stop
      @events.fire_on_stopped!
      @runner.stop_blocked
    end

    # ...
  end
end

The above signal handler is only called for installations running the Single runner. For the Cluster runner’s signal handler, refer to the previous section.

# lib/puma/single.rb

module Puma
  class Single < Runner
    # ...
    def stop_blocked
      # ...
      @server&.stop true
    end
    # ...
  end
end

# lib/puma/server.rb

module Puma
  class Server
    # ...

    STOP_COMMAND = "?"
  
    # ...

    def run(background=true, thread_name: 'srv')
      # ...

      @check, @notify = Puma::Util.pipe unless @notify

      # ...

      if background
        @thread = Thread.new do
          Puma.set_thread_name thread_name
          handle_servers
        end
        return @thread
      else
        handle_servers
      end
    end

    # ...

    def stop(sync=false)
      notify_safely(STOP_COMMAND)
      @thread.join if @thread && sync
    end

    # ...

    def notify_safely(message)
      @notify << message
      # ...
    end

    # ...

    def handle_servers
      begin
        check = @check
        sockets = [check] + @binder.ios

        while @status == :run || (drain && shutting_down?)
          begin
            ios = IO.select sockets, nil, nil, (shutting_down? ? 0 : @idle_timeout)

            # ...

            ios.first.each do |sock|
              if sock == check
                break if handle_check
              # ...
              end

            # ...
          end
        end

        if queue_requests
          @queue_requests = false
          @reactor.shutdown
        end

        graceful_shutdown if @status == :stop || @status == :restart
      end
    end

    # ...

    def handle_check
      cmd = @check.read(1)

      case cmd
      when STOP_COMMAND
        @status = :stop
        return true
      when HALT_COMMAND
        @status = :halt
        return true
      when RESTART_COMMAND
        @status = :restart
        return true
      end

      false
    end

    # ...
  end
end

Once Runner#stop_blocked is called, a STOP_COMMAND is written to a pipe managed by the Server. The reader end of this pipe is added to the array of listening sockets on which IO::select is called in a loop, ensuring that once a command is available on the pipe, it will be processed.

A crucial detail here is that stop_blocked blocks until the server thread is joined.

Once the command reaches the server loop, its handler sets the current @status to :stop, causing the loop to stop.

This is the moment when the listening sockets stop accepting new connections, although the sockets themselves are not closed, which still makes it possible for clients to connect. If they do, they will be buffered in the socket queue of the configured size.

Afterwards, Reactor is shut down:

module Puma
  class Reactor
    # ...

    def shutdown
      @input.close
      begin
        @selector.wakeup
      rescue IOError
      end
      @thread&.join
    end

    # ...

    def run(background=true)
      if background
        @thread = Thread.new do
          Puma.set_thread_name "reactor"
          select_loop
        end
      else
        select_loop
      end
    end

    # ...

    def select_loop
      close_selector = true
      begin
        until @input.closed? && @input.empty?
          # ...
        end
        # ...
      end
      @timeouts.each(&@block)
      @selector.close if close_selector
    end

    # ...
  end
end

The shutdown method makes the reactor stop. At this point, no connections are added to the reactor, including those that aren’t new, due to @queue_requests being set to false.

Every connection that is currently being buffered in the reactor is iterated over via @timeouts.each(&@block) and triggers a call to Server#reactor_wakeup:

module Puma
  class Server
    # ...
    def reactor_wakeup(client)
      shutdown = !@queue_requests
      if client.try_to_finish || (shutdown && !client.can_close?)
        @thread_pool << client
      elsif shutdown || client.timeout == 0
        client.timeout!
      else
        # ...
      end
    rescue StandardError => e
      client_error(e, client)
      client.close
      true
    end
    # ...
  end
end

In-flight connections that are actively being read from are placed into the @thread_pool, while idle connections are gracefully timed out with a 408 response code. The reactor shuts down fully now, as it has successfully gotten rid of all its work.

Back in Server#handle_servers, graceful_shutdown is called as the final method. Keep in mind that the signal trap is waiting all this time and is not terminating the process yet.

# lib/puma/server.rb

module Puma
  class Server
    # ...
    def graceful_shutdown
      # ...

      if @status != :restart
        @binder.close
      end

      if @thread_pool
        if timeout = options[:force_shutdown_after]
          @thread_pool.shutdown timeout.to_f
        else
          @thread_pool.shutdown
        end
      end
    end
    # ...
  end
end

# lib/puma/binder.rb

module Puma
  class Binder
    # ...
    def close
      @ios.each { |i| i.close }
    end

    # ...
  end
end

The binder is closed, closing and unbinding all listening sockets. At this point, new connections cannot be made, and attempts to connect result in ECONNREFUSED.

The above is true for running in single mode only. If running in cluster mode, the listening sockets are opened before workers are forked. This means that the master process will be listening on these sockets indefinitely, while workers will be accepting connections. As long as there is always at least one active worker with sufficient capacity, clients will not be waiting too long to get accepted, and they will almost always be able to connect without being refused. Additionally, incoming connections will be subjected to rudimentary load balancing by the operating system.

After the binder is closed, ThreadPool#shutdown is called. There is no timeout by default, which makes the server wait for all connections to be processed indefinitely or until the process is SIGKILLed.

module Puma
  class ThreadPool
    # ...
    def shutdown(timeout=-1)
      threads = with_mutex do
        @shutdown = true
        @trim_requested = @spawned
        @not_empty.broadcast
        @not_full.broadcast

        @auto_trim&.stop
        @reaper&.stop
        @workers.dup
      end

      if timeout == -1
        threads.each(&:join)
      else
        join = ->(inner_timeout) do
          start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
          threads.reject! do |t|
            elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
            t.join inner_timeout - elapsed
          end
        end

        join.call(timeout)

        @shutdown_mutex.synchronize do
          @force_shutdown = true
          threads.each do |t|
            t.raise ForceShutdown if t[:with_force_shutdown]
          end
        end
        join.call(@shutdown_grace_time)

        threads.each(&:kill)
        join.call(1)
      end

      @spawned = 0
      @workers = []
    end
    # ...
  end
end

The pool’s @shutdown variable gets set to true, and @trim_requested is set to the number of currently spawned threads. The latter makes threads check themselves out from the pool once the todo queue is empty. Both condition variables are broadcast in order to wake the threads up if they are sleeping.

If the timeout is not set, which is the default, the pool simply waits for all threads to finish. If set, threads that terminate within the defined timeout are removed from the threads array. If there are threads that did not exit within the timeout, ForceShutdown is raised on them.

You may remember that some parts of the Server utilise the with_force_shutdown method:

module Puma
  class ThreadPool
    # ...
    def with_force_shutdown
      t = Thread.current
      @shutdown_mutex.synchronize do
        raise ForceShutdown if @force_shutdown
        t[:with_force_shutdown] = true
      end
      yield
    ensure
      t[:with_force_shutdown] = false
    end
    # ...
  end
end

There are exactly three places where it gets called:

Around Client#finish in Server#process_client, which waits indefinitely until a request is fully read;
Around Client#reset in Server#process_client, which attempts to read the next request off a keep-alive connection;
Around the invocation of the underlying Rack application, i.e., application code, in Server#handle_request.

The ForceShutdown exception is only ever raised when threads are within code wrapped by with_force_shutdown.

We previously covered why raising an exception on a thread can be dangerous. Corrupting Puma’s internal state, even during shutdown, could potentially result in unforeseen circumstances. That’s why these exceptions are not allowed to be raised everywhere.

You may correctly ponder why Puma marks arbitrary application code as interruptible, when it could be considered the unsafest callsite in the whole Puma codebase. The reason is simple: the whole process will shut down almost immediately after exceptions are raised on hanging worker threads. Even if state corruption within the application occurs in this case, it is unlikely that it will result in severely negative consequences.

After raising the exceptions, Puma waits 5 additional seconds as a grace period and outright kills any remaining threads that are still running.

Now the call stack is unwound, Server#graceful_shutdown returns, Server#stop finally joins the server thread, Single#stop_blocked returns, the signal handler finishes, and optionally raises raise(SignalException, "SIGTERM"), thus marking the completion of the shutdown process.

However, if Puma is not configured to raise an exception from the SIGTERM signal trap, the process will not terminate. Launcher implements the following logic to handle this:

# lib/puma/launcher.rb

module Puma
  class Launcher
    def run
      # ...
      @runner.run

      do_run_finished(previous_env)
    end

    # ...

    def do_run_finished(previous_env)
      case @status
      when :halt
        do_forceful_stop
      when :run, :stop
        do_graceful_stop
      when :restart
        do_restart(previous_env)
      end

      close_binder_listeners unless @status == :restart
    end

    # ...

    def close_binder_listeners
      # ...
      @binder.close_listeners
      unless @status == :restart
        log "=== puma shutdown: #{Time.now} ==="
        log "- Goodbye!"
      end
    end

    # ...
  end
end

# lib/puma/binder.rb

module Puma
  class Binder
    # ...

    def close_listeners
      @listeners.each do |l, io|
        begin
          io.close unless io.closed?
          uri = URI.parse l
          next unless uri.scheme == 'unix'
          unix_path = "#{uri.host}#{uri.path}"
          File.unlink unix_path if @unix_paths.include?(unix_path) && File.exist?(unix_path)
        rescue Errno::EBADF
        end
      end
    end

    # ...
  end
end

Once @runner.run returns, which in the case of a Single runner will happen once the server thread is joined, Launcher#run will call do_run_finished and do_graceful_stop again. These methods will simply exit early since the server thread will already be terminated. This has the side effect of triggering the on_stopped callbacks twice. It will also close any leftover binder listeners, which is mostly relevant for UNIX sockets that also require unbinding and not just closing.

Apart from SIGTERM, there’s another way to trigger the graceful shutdown of Puma, which is usually used in local environments: via SIGINT. Its only difference is that the signal handler will not wait for the process to shut down fully, as is the case with SIGTERM. It will simply set the state to :stop via Runner#stop, which at some point will cause the runner to stop. At that point, the launcher will call do_run_finished.

Let’s see how Puma’s main process handles SIGTERM and SIGKILL in clustered mode:

module Puma
  class Cluster < Runner
    # ...
    def setup_signals
      # ...

      Signal.trap "SIGTERM" do
        if Process.pid != master_pid
          log "Early termination of worker"
          exit! 0
        else
          @launcher.close_binder_listeners

          stop_workers
          stop
          @events.fire_on_stopped!
          raise(SignalException, "SIGTERM") if @options[:raise_exception_on_sigterm]
          exit 0
        end
      end

      # ...
    end

    # ...

    def run
      # ...

      Signal.trap "SIGINT" do
        stop
      end

      # ...

      while @status == :run
        # ...
      end

      stop_workers unless @status == :halt

      # ...
    end

    # ...

    def stop
      @status = :stop
      wakeup!
    end

    # ...
  end
end

When SIGINT is received, the handler simply sets the state of the runner to :stop without doing anything else. Once the main loop of the master process detects that the state has changed, it calls stop_workers, which sends a SIGTERM to every child process and waits indefinitely until all of them exit.

Worker processes execute the same shutdown logic as Puma in single mode by calling Server#stop in a non-blocking manner, with the only difference being that workers will eventually get killed if they can’t terminate in time.

module Puma
  class Cluster < Puma::Runner
    class Worker < Puma::Runner
      # ...
      def run
        # ...

        Signal.trap "SIGTERM" do
          @worker_write << "#{Puma::Const::PipeRequest::EXTERNAL_TERM}#{Process.pid}\n" rescue nil
          restart_server.clear
          server.stop
          restart_server << false
        end

        # ...
      end
      # ...
    end
  end
end

The key difference between SIGINT and SIGTERM in clustered mode is that when the latter signal is sent, the listening sockets will be closed much earlier. The children will still be able to accept incoming connections while they are in the appropriate state, but it will ensure that once all children have closed their copies of these sockets, no new connections will be accepted. With shutdown initiated by SIGINT, there may be a small timeframe during which clients may be able to connect, even though they’ll never get served.

Before we proceed, it’s worth mentioning that it’s also possible to halt a server. This uses the same mechanisms as the ordinary shutdown, with the exception that in-flight connections will be abruptly terminated.

Hot Restarts

Contrary to phased restarts, hot restarts are available in both modes.

Documentation claims that such restarts result in brief latency increases, but not in full unavailability, since clients will still be able to connect during the reboot.

While still better than shutting the process down completely and restarting it from scratch, hot restarts are probably the least scalable deployment strategy available. It’s recommended to either use phased restarts or let an orchestrator like Kubernetes implement the deployment strategy.

Such a restart can be initiated by sending the SIGUSR2 signal to the only process in single mode or to the main process of the cluster:

# lib/puma/launcher.rb

module Puma
  class Launcher
    # ...

    def setup_signals
      # ...
      begin
        Signal.trap "SIGUSR2" do
          restart
        end
      rescue Exception
        log "*** SIGUSR2 not implemented, signal based restart unavailable!"
      end
      # ...
    end

    # ...

    def restart
      @status = :restart
      @runner.restart
    end

    # ...
  end
end

# lib/puma/single.rb

module Puma
  class Single < Runner
    # ...

    RESTART_COMMAND = "R"

    # ...

    def restart
      @server&.begin_restart
    end

    # ...

    def begin_restart(sync=false)
      notify_safely(RESTART_COMMAND)
      @thread.join if @thread && sync
    end

    # ...  
  end
end

# lib/puma/cluster.rb
module Puma
  class Cluster < Runner
    # ...
    def restart
      @restart = true
      stop
    end
    # ...
  end
end

Both restart behaviours are almost identical.

While the Single runner simply sends the restart command immediately to the server, the Cluster runner simply stops all existing workers.

In both cases, the Launcher will eventually reach do_run_finished since its @status will be set to :restart:

module Puma
  class Launcher
    # ...

    def do_run_finished(previous_env)
      case @status
      when :halt
        do_forceful_stop
      when :run, :stop
        do_graceful_stop
      when :restart
        do_restart(previous_env)
      end

      close_binder_listeners unless @status == :restart
    end

    # ...

    def do_restart(previous_env)
      log "* Restarting..."
      ENV.replace(previous_env)
      @runner.stop_control
      restart!
    end

    def restart!
      @events.fire_on_restart!
      @config.run_hooks :on_restart, self, @log_writer

      if Puma.jruby?
        # ...
      elsif Puma.windows?
        # ...
      else
        argv = restart_args
        Dir.chdir(@restart_dir)
        ENV.update(@binder.redirects_for_restart_env)
        argv += [@binder.redirects_for_restart]
        Kernel.exec(*argv)
      end
    end

    # ...
  end
end

There, it will set everything up for code and dependency reloading to be possible and calls Kernel::exec using the arguments previously built in Launcher#generate_restart_data. The execve system call is used to completely replace a running process with another one. This way, a Puma server process is replaced with another server process running the newest version of the code and dependencies. Once Kernel.exec is invoked, the new process starts from scratch.

How do hot restarts allegedly not result in downtime? First of all, downtime depends on the definition - for applications with high load, a complete absence of request handling, even for a brief period, may still be unacceptable. By “no downtime,” Puma means that connections which are opened right before exec but after the Server instances are shut down, do not get closed.

The answer is that Puma does not close the listening sockets - as you can see, there is no call to close_listeners happening anywhere. Since the sockets are never closed, they get inherited by exec and become available in the new process.

However, you may recall that all sockets are set with the CLOEXEC flag set to true in Ruby, which makes affected sockets close once an exec-like system call is made. Puma never disables this flag explicitly, so how does it work then?

Before the restart, Puma updates the global environment hash with variables generated via Binder#redirects_for_restart_env and passes additional arguments using Binder#redirects_for_restart:

module Puma
  class Binder
    # ...
    def redirects_for_restart
      redirects = @listeners.map { |a| [a[1].to_i, a[1].to_i] }.to_h
      redirects[:close_others] = true
      redirects
    end

    def redirects_for_restart_env
      @listeners.each_with_object({}).with_index do |(listen, memo), i|
        memo["PUMA_INHERIT_#{i}"] = "#{listen[1].to_i}:#{listen[0]}"
      end
    end
    # ...
  end
end

This is Puma making use of Ruby’s obscure file descriptor redirection feature, scarce documentation for which can be found in the Process module.

Once forked, Binder builds a hash from inherited file descriptor numbers provided as environment PUMA_INHERIT_ variables. While these allow Puma to detect inherited sockets, as we’ll see further, the actual secret sauce for bypassing CLOEXEC is the exec arguments built in redirects_for_restart.

When close_others is set to true, Kernel::exec automatically closes all sockets, even those which have CLOEXEC explicitly disabled, except for the standard 3 file descriptors: STDIN, STDOUT, and STDERR. ‘Others’ in this argument’s name implies that there are some file descriptors that should not be closed - apart from the standard streams, it’s possible to make exec retain any sockets whose file descriptor numbers are provided as arguments. That’s why Puma iterates over the @listeners array right before restarting - all existing listeners get propagated as arguments to exec.

Here’s how the restarted process consumes those:

module PUma
  class Binder
    # ...

    def create_inherited_fds(env_hash)
      env_hash.select {|k,v| k =~ /PUMA_INHERIT_\d+/}.each do |_k, v|
        fd, url = v.split(":", 2)
        @inherited_fds[url] = fd.to_i
      end.keys
    end

    # ...

    def parse(binds, log_writer = nil, log_msg = 'Listening')
      log_writer ||= @log_writer
      binds.each do |str|
        uri = URI.parse str
        case uri.scheme
        when "tcp"
          if fd = @inherited_fds.delete(str)
            io = inherit_tcp_listener uri.host, uri.port, fd
            log_writer.log "* Inherited #{str}"
          elsif sock = @activated_sockets.delete([ :tcp, uri.host, uri.port ])
            # ...
          else
            # ...
          end
        end
        # ...
      end
      # ...
    end

    # ...

    def inherit_tcp_listener(host, port, fd)
      s = fd.kind_of?(::TCPServer) ? fd : ::TCPServer.for_fd(fd)

      @ios << s
      s
    end

    # ...
  end
end

It extracts the listener data from the environment variables to discover which file descriptors were inherited from the previous process. Once parse is called, such binds avoid the default branch with add_tcp_listener and call inherit_tcp_listener instead. This happens in order to avoid errors from redundant bind attempts - they would fail since the inherited listening socket would have already been bound to the same address.

The above does not happen if Puma is instructed to bind to localhost:

module Puma
  class Binder
    # ...

    def add_tcp_listener(host, port, optimize_for_latency=true, backlog=1024)
      if host == "localhost"
        loopback_addresses.each do |addr|
          add_tcp_listener addr, port, optimize_for_latency, backlog
        end
        return
      end

      # ...
    end

    # ...

    def parse(binds, log_writer = nil, log_msg = 'Listening')
      # ...
      binds.each do |str|
        uri = URI.parse str
        case uri.scheme
        when "tcp"
          # ...
          if fd = @inherited_fds.delete(str)
            # ...
          elsif sock = @activated_sockets.delete([ :tcp, uri.host, uri.port ])
            # ...
          else
            # ...
            io = add_tcp_listener uri.host, uri.port, low_latency, backlog
            # ...
          end

          @listeners << [str, io] if io
        # ...
        end
      # ...
      end
    end

    
  end
end

localhost makes add_tcp_listener call itself recursively, returning nil at the end, which never adds the opened sockets to the @listeners array. Because of this, Puma does not propagate them during restart and they get closed automatically because of close_others. To provide an example of why propagation of sockets via environment variables is needed: if CLOEXEC was explicitly disabled on these sockets after they got opened, and if close_others were set to false or omitted, the restarted Puma process would always crash, since it will try to bind on localhost listeners without knowing that it inherited sockets that have already been bound. The fact that localhost binds do not work well with hot restarts seems to be an oversight, which may be easily fixable.

Conclusion

After having gone through almost every aspect of Puma, it is safe to say that it is a solid default choice for most Ruby applications.

Puma provides a balance between heavier servers like Unicorn and Pitchfork, which employ the process-per-request model, and lighter options such as Falcon⁹, which uses fibers to serve requests. Although the capacity setting should be tuned according to each individual application, Puma comes with sensible defaults of running as many clustered workers as there are CPU cores available, making the most use of the underlying hardware, while cautiously keeping the sizes of thread pools below 5 due to GVL limitations.

Even a single instance can provide decent availability when running in cluster mode with phased restarts, with client experience being smooth due to Puma’s graceful shutdown handling.

While a Puma cluster may face some high tail latency when keep-alive connections are enabled, I am sure the effort to fix it will be concluded and included in a future release.

Although it’s almost always a good idea to run Puma behind a reverse proxy or a load balancer when possible and applicable, Puma can still stand on its own when required, thanks to the out-of-the-box support for SSL/TLS, read buffering, and graceful restarts.

According to the BestGems.org index in the Web Servers category. ↩
This can easily happen in a typical Rails application if Puma is launched using its own CLI and the third-party plugin is defined in a separate gem, since Puma will load the application and require bundled gems only after having evaluated the configuration file. ↩
The @read_buffer string is used later as the second argument passed to read_nonblock. It’s not clear why it’s needed here in the first place, since if read_nonblock had been called without it, the returned string’s encoding would be ASCII-8BIT by default, at least on MRI v3.4, making it unnecessary to provide a buffer. The @read_buffer does not seem to be used anywhere else, which makes it look like a vestige. ↩
It’s safe to say that Puma will almost always be behind a load balancer or a reverse proxy unless the deployment strategy and infrastructure are rudimentary or old-fashioned. ↩
Tangential question: Why does Puma use select in the server accept loop when it’s so limited? Most likely, this is for cross-platform compatibility. Other multiplexors are usually OS-specific and not portable. Using select there is fine since it’s unlikely to have more than one bind address, let alone a thousand. ↩
The impact of deleteing items from an array, especially when there are many clients in the reactor, is interesting to know. ↩
Unless shared memory is involved. ↩
You may realise that a handler for SIGTERM was already previously registered in the Launcher. When a handler is registered using Ruby’s Signal.trap, the new block overrides the previous one. If the previous handler obtained from the Signal.trap return value is not explicitly called, then it’s dropped. This means that launcher’s handler is never ran after the new trap is established. ↩
Falcon is a very interesting alternative due its concurrency model, which is a relatively recent addition to Ruby. With the introduction of non-blocking fibers to Ruby 3.0, most of Ruby’s internal blocking IO calls integrate with an optionally set implementation of a scheduler. Falcon allows to achieve an exponentially higher number of concurrent connections because of this, but the underlying application would have to be supremely bound by downstream network calls or other blocking IO in order to keep the latencies tolerable. ↩

According to the BestGems.org index in the Web Servers category.

↩

This can easily happen in a typical Rails application if Puma is launched using its own CLI and the third-party plugin is defined in a separate gem, since Puma will load the application and require bundled gems only after having evaluated the configuration file.

↩

The @read_buffer string is used later as the second argument passed to read_nonblock. It’s not clear why it’s needed here in the first place, since if read_nonblock had been called without it, the returned string’s encoding would be ASCII-8BIT by default, at least on MRI v3.4, making it unnecessary to provide a buffer. The @read_buffer does not seem to be used anywhere else, which makes it look like a vestige.

↩

It’s safe to say that Puma will almost always be behind a load balancer or a reverse proxy unless the deployment strategy and infrastructure are rudimentary or old-fashioned.

↩

Tangential question: Why does Puma use select in the server accept loop when it’s so limited? Most likely, this is for cross-platform compatibility. Other multiplexors are usually OS-specific and not portable. Using select there is fine since it’s unlikely to have more than one bind address, let alone a thousand.

↩

The impact of deleteing items from an array, especially when there are many clients in the reactor, is interesting to know.

↩

Unless shared memory is involved.

↩

You may realise that a handler for SIGTERM was already previously registered in the Launcher. When a handler is registered using Ruby’s Signal.trap, the new block overrides the previous one. If the previous handler obtained from the Signal.trap return value is not explicitly called, then it’s dropped. This means that launcher’s handler is never ran after the new trap is established.

↩

Falcon is a very interesting alternative due its concurrency model, which is a relatively recent addition to Ruby. With the introduction of non-blocking fibers to Ruby 3.0, most of Ruby’s internal blocking IO calls integrate with an optionally set implementation of a scheduler. Falcon allows to achieve an exponentially higher number of concurrent connections because of this, but the underlying application would have to be supremely bound by downstream network calls or other blocking IO in order to keep the latencies tolerable.

↩