Redis latency monitoring framework¶
Redis is often used in the context of demanding use cases, where it serves a big amount of queries per second per instance, and at the same time, there are very strict latency requirements both for the average response time and for the worst case latency.
While Redis is an in memory system, it deals with the operating system in different ways, for example, in the context of persisting to disk. Moreover Redis implements a rich set of commands. Certain commands are fast and run in constant or logarithmic time, other commands are slower O(N) commands, that can cause latency spikes.
Finally Redis is single threaded: this is usually an advantage from the point of view of the amount of work it can perform per core, and in the latency figures it is able to provide, but at the same time it poses a challenge from the point of view of latency, since the single thread must be able to perform certain tasks incrementally, like for example keys expiration, in a way that does not impact the other clients that are served.
For all this reasons, Redis 2.8.13 introduced a new feature called Latency Monitoring, that helps the user to check and troubleshoot possible latency problems. Latency monitoring is composed of the following conceptual parts:
- Latency hooks that sample different latency sensitive code paths.
- Time series recording of latency spikes split by different event.
- Reporting engine to fetch raw data from the time series.
- Analysis engine to provide human readable reports and hints according to the measurements.
The remaining part of this document covers the latency monitoring subsystem details, however for more information about the general topic of Redis and latency, please read the Redis latency problems troubleshooting page in this documentation.
Events and time series¶
Different monitored code paths have different names, and are called
events. For example command
is an event measuring latency spikes
of possibly slow commands executions, while fast-command
is the
event name for the monitoring of the O(1) and O(log N) commands. Other
events are less generic, and monitor a very specific operation performed
by Redis. For example the fork
event only monitors the time taken by
Redis to execute the fork(2)
system call.
A latency spike is an event that runs in more time than the configured latency threshold. There is a separated time series associated with every monitored event. This is how the time series work:
- Every time a latency spike happens, it is logged in the appropriate time series.
- Every time series is composed of 160 elements.
- Each element is a pair: an unix timestamp of the time the latency spike was measured, and the number of milliseconds the event took to executed.
- Latency spikes for the same event happening in the same second are merged (by taking the maximum latency), so even if continuous latency spikes are measured for a given event, for example because the user set a very low threshold, at least 180 seconds of history are available.
- For every element the all-time maximum latency is recorded.
How to enable latency monitoring¶
What is high latency for an use case, is not high latency for another. There are applications where all the queries must be served in less than 1 millisecond and applications where from time to time a small percentage of clients experiencing a 2 seconds latency is acceptable.
So the first step to enable the latency monitor is to set a latency threshold in milliseconds. Only events that will take more than the specified threshold will be logged as latency spikes. The user should set the threshold according to its needs. For example if for the requirements of the application based on Redis the maximum acceptable latency is 100 milliseconds, the threshold should be set to such a value in order to log all the events blocking the server for a time equal or greater to 100 milliseconds.
The latency monitor can easily be enabled at runtime in a production server with the following command:
CONFIG SET latency-monitor-threshold 100
By default monitoring is disabled (threshold set to 0), even if the actual cost of latency monitoring is near zero. However while the memory requirements of latency monitoring are very small, there is no good reason to raise the baseline memory usage of a Redis instance that is working well.
Information reporting with the LATENCY command¶
The user interface to the latency monitoring subsystem is the
LATENCY
command. Like many other Redis commands, LATENCY
accept
subcommands that modify the behavior of the command. The next sections
document each subcommand.
LATENCY LATEST¶
The LATENCY LATEST
command reports the latest latency events logged.
Each event has the following fields:
- Event name.
- Unix timestamp of the latest latency spike for the event.
- Latest event latency in millisecond.
- All time maximum latency for this event.
All time does not really mean the maximum latency since the Redis
instance was started, because it is possible to reset events data using
LATENCY RESET
as we’ll see later.
The following is an example output:
127.0.0.1:6379> debug sleep 1
OK
(1.00s)
127.0.0.1:6379> debug sleep .25
OK
127.0.0.1:6379> latency latest
1) 1) "command"
2) (integer) 1405067976
3) (integer) 251
4) (integer) 1001
LATENCY HISTORY event-name
¶
The LATENCY HISTORY
command is useful in order to fetch raw data
from the event time series, as timestamp-latency pairs. The command will
return up to 160 elements for a given event. An application may want to
fetch raw data in order to perform monitoring, display graphs, and so
forth.
Example output:
127.0.0.1:6379> latency history command
1) 1) (integer) 1405067822
2) (integer) 251
2) 1) (integer) 1405067941
2) (integer) 1001
LATENCY RESET [event-name
... event-name
]¶
The LATENCY RESET
command, if called without arguments, resets all
the events, discarding the currently logged latency spike events, and
resetting the maximum event time register.
It is possible to reset only specific events by providing the event names as arguments. The command returns the number of events time series that were reset during the command execution.
LATENCY GRAPH event-name
¶
Produces an ASCII-art style graph for the specified event:
127.0.0.1:6379> latency reset command
(integer) 0
127.0.0.1:6379> debug sleep .1
OK
127.0.0.1:6379> debug sleep .2
OK
127.0.0.1:6379> debug sleep .3
OK
127.0.0.1:6379> debug sleep .5
OK
127.0.0.1:6379> debug sleep .4
OK
127.0.0.1:6379> latency graph command
command - high 500 ms, low 101 ms (all time high 500 ms)
--------------------------------------------------------------------------------
#_
_||
_|||
_||||
11186
542ss
sss
The vertical labels under each graph column represent the amount of seconds, minutes, hours or days ago the event happened. For example “15s” means that the first graphed event happened 15 seconds ago.
The graph is normalized in the min-max scale so that the zero (the underscore in the lower row) is the minumum, and a # in the higher row is the maximum.
The graph subcommand is useful in order to get a quick idea about the
trend of a given latency event without using additional tooling, and
without the need to interpret raw data as provided by
LATENCY HISTORY
.
LATENCY DOCTOR¶
The LATENCY DOCTOR
command is the most powerful analysis tool in the
latency monitoring, and is able to provide additional statistical data
like the average period between latency spikes, the median deviation,
and an human readable analysis of the event. For certain events, like
fork
, additional informations are provided, like the rate at which
the system forks processes.
This is the output you should post in the Redis mailing list if you are looking for help about Latency related issues.
Example output:
127.0.0.1:6379> latency doctor
Dave, I have observed latency spikes in this Redis instance.
You don't mind talking about it, do you Dave?
1. command: 5 latency spikes (average 300ms, mean deviation 120ms,
period 73.40 sec). Worst all time event 500ms.
I have a few advices for you:
- Your current Slow Log configuration only logs events that are
slower than your configured latency monitor threshold. Please
use 'CONFIG SET slowlog-log-slower-than 1000'.
- Check your Slow Log to understand what are the commands you are
running which are too slow to execute. Please check
http://redis.io/commands/slowlog for more information.
- Deleting, expiring or evicting (because of maxmemory policy)
large objects is a blocking operation. If you have very large
objects that are often deleted, expired, or evicted, try to
fragment those objects into multiple smaller objects.
The doctor has erratic psychological behaviors, so we recommend interacting with it carefully.