lnmp访问速度变慢排查方法

最近线上服务器遇到一个问题是网站突然访问变慢,使用17ce的GET测试都能让把网站给搞卡死,只能等待响应超时。抽空整理一下整个排查过程。

定位问题点

修改log_format分析是nginx出问题不是上游服务器出问题

首先我们需要判断的是到底在在nginx出问题了还是后面干活的php-fpm出问题了,所以我们修改nginx的日志格式为如下所示:

log_format access '$remote_addr - $remote_user [$time_local] "$request" '
'$request_time $upstream_response_time '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" $http_x_forwarded_for';

其中第二行为新增配置,下面简单概括一下大致意思
$request_time

request processing time in seconds with a milliseconds resolution (1.3.9, 1.2.6); time elapsed since the first bytes were read from the client
从接收到第一个字节到发送回最后一个字节所经历的时间

$upstream_response_time

keeps time spent on receiving the response from the upstream server; the time is kept in seconds with millisecond resolution. Times of several responses are separated by commas and colons like addresses in the $upstream_addr variable.
接收上游服务器处理请求所消耗的时间(在lnmp架构里面一般就是php-fpm所处理的时间了)

查看访问日志

500783 119.167.246.14 - - [15/Feb/2016:16:06:31 +0800] "GET / HTTP/1.1" 9.492 9.451 200 10777 "-" "Baidu-YunGuanCe-SLABot(ce.baidu.com)" -
500784 183.61.236.15 - - [15/Feb/2016:16:06:34 +0800] "GET / HTTP/1.1" 12.541 12.470 200 11081 "-" "Baidu-YunGuanCe-SLABot(ce.baidu.com)" -
500785 183.60.235.19 - - [15/Feb/2016:16:06:34 +0800] "GET / HTTP/1.1" 12.887 12.834 200 10402 "-" "Baidu-YunGuanCe-SLABot(ce.baidu.com)" -
500786 51.255.65.82 - - [15/Feb/2016:16:06:35 +0800] "GET /photo/7938738 HTTP/1.1" 8.369 8.369 200 6837 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" -
500787 117.34.28.15 - - [15/Feb/2016:16:06:35 +0800] "GET / HTTP/1.1" 13.759 13.729 200 10499 "-" "Baidu-YunGuanCe-SLABot(ce.baidu.com)" -
500788 66.249.79.52 - - [15/Feb/2016:16:06:35 +0800] "GET /photo/7391384 HTTP/1.1" 8.119 8.119 200 6768 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" -
500789 117.27.149.13 - - [15/Feb/2016:16:06:35 +0800] "GET / HTTP/1.1" 13.700 13.660 200 10504 "-" "Baidu-YunGuanCe-SLABot(ce.baidu.com)" -

注意GET / HTTP/1.1”字符后面的两个数据,很明显可以观察出来是上游的服务器出现了问题(即php-fpm)

继续追踪上游服务器(php-fpm)

查看php-fpm的日志

[15-Feb-2016 00:22:17] WARNING: [pool www] server reached pm.max_children setting (20), consider raising it
[15-Feb-2016 02:06:25] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 0 idle, and 9 total children
[15-Feb-2016 02:06:26] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 0 idle, and 10 total children
[15-Feb-2016 02:06:27] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 11 total children
[15-Feb-2016 02:06:28] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 12 total children
[15-Feb-2016 02:06:29] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 13 total children
[15-Feb-2016 02:06:30] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 14 total children
[15-Feb-2016 02:06:31] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 15 total children
[15-Feb-2016 02:06:32] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 32 children, there are 0 idle, and 16 total children

日志很明显告诉我们提高几个参数的值,去对应配置里面修改即可,修改完后发现网站还是时不时卡住无法正常加载内容页。继续往下走。。。

开启php-fpm自带的慢执行日志

request_slowlog_timeout = 5 # 处理时间超过5秒的即记录下来
slowlog = var/log/slow.log # 日志保存路径,注意这里是相对路径,为了调试方便用绝对路径会更方便点

查看访问日志

php-fpm.log如下所示,可以看到已经正常记录响应慢的请求了

[15-Feb-2016 11:16:26] WARNING: [pool www] child 24407, script '/home/wwwroot/laravel/public/index.php' (request: "GET /index.php") executing too slow (5.608171 sec), logging
[15-Feb-2016 11:16:26] WARNING: [pool www] child 23892, script '/home/wwwroot/laravel/public/index.php' (request: "GET /index.php") executing too slow (6.414836 sec), logging
[15-Feb-2016 11:16:26] WARNING: [pool www] child 23891, script '/home/wwwroot/laravel/public/index.php' (request: "GET /index.php") executing too slow (6.455498 sec), logging
[15-Feb-2016 11:16:26] WARNING: [pool www] child 23890, script '/home/wwwroot/laravel/public/index.php' (request: "GET /index.php") executing too slow (6.401039 sec), logging
[15-Feb-2016 11:16:26] WARNING: [pool www] child 23889, script '/home/wwwroot/laravel/public/index.php' (request: "GET /index.php") executing too slow (6.069696 sec), logging
[15-Feb-2016 11:16:26] WARNING: [pool www] child 23888, script '/home/wwwroot/laravel/public/index.php' (request: "GET /index.php") executing too slow (6.398848 sec), logging

继续观察slow.log的情况

[15-Feb-2016 21:30:15] [pool www] pid 25686
script_filename = /home/wwwroot/laravel/public/index.php
[0x00007f4f56e961b0] execute() /home/wwwroot/laravel/vendor/laravel/framework/src/Illuminate/Database/Connection.php:321
[0x00007f4f56e95e28] Illuminate\Database\{closure}() /home/wwwroot/laravel/vendor/laravel/framework/src/Illuminate/Database/Connection.php:644
[0x00007f4f56e95bf0] runQueryCallback() /home/wwwroot/laravel/vendor/laravel/framework/src/Illuminate/Database/Connection.php:611
[0x00007f4f56e958f8] run() /home/wwwroot/laravel/vendor/laravel/framework/src/Illuminate/Database/Connection.php:324
[0x00007f4f56e957b0] select() /home/wwwroot/laravel/vendor/laravel/framework/src/Illuminate/Database/Query/Builder.php:1406
[0x00007f4f56e95598] runSelect() /home/wwwroot/laravel/vendor/laravel/framework/src/Illuminate/Database/Query/Builder.php:1383
[0x00007f4f56e95390] get() /home/wwwroot/laravel/vendor/laravel/framework/src/Illuminate/Database/Eloquent/Builder.php:409
[0x00007f4f56e95118] getModels() /home/wwwroot/laravel/vendor/laravel/framework/src/Illuminate/Database/Eloquent/Builder.php:168
[0x00007f4f56e94ef8] get() /home/wwwroot/laravel/app/Http/Controllers/DoutuBase.php:74
[0x00007f4f56e94cb8] getRandomPhoto() /home/wwwroot/laravel/app/Http/Controllers/DoutuBase.php:32
[0x00007f4f56e945e8] __construct() /home/wwwroot/laravel/app/Http/Controllers/ImageController.php:28
[0x00007fff8f3f0ca0] __construct() unknown:0
[0x00007f4f56e943c8] newInstanceArgs() /home/wwwroot/laravel/vendor/laravel/framework/src/Illuminate/Container/Container.php:776
[0x00007f4f56e93d88] build() /home/wwwroot/laravel/vendor/laravel/framework/src/Illuminate/Container/Container.php:627
[0x00007f4f56e93808] make() /home/wwwroot/laravel/vendor/laravel/framework/src/Illuminate/Foundation/Application.php:674
[0x00007f4f56e93620] make() /home/wwwroot/laravel/vendor/laravel/framework/src/Illuminate/Routing/ControllerDispatcher.php:85
[0x00007f4f56e93468] makeController() /home/wwwroot/laravel/vendor/laravel/framework/src/Illuminate/Routing/ControllerDispatcher.php:57
[0x00007f4f56e931c8] dispatch() /home/wwwroot/laravel/vendor/laravel/framework/src/Illuminate/Routing/Route.php:203
[0x00007f4f56e92f30] runWithCustomDispatcher() /home/wwwroot/laravel/vendor/laravel/framework/src/Illuminate/Routing/Route.php:134
[0x00007f4f56e92bb0] run() /home/wwwroot/laravel/vendor/laravel/framework/src/Illuminate/Routing/Router.php:712

一看文件就知道是哪里出问题了mysql,一定是有某条SQL执行太慢引起连锁反应了!

开启MySQL的慢查询日志

添加配置文件

修改/etc/my.cnf文件,添加如下配置

long_query_time = 2
slow_query_log = 1

查看慢查询日志

mysql默认会生成一个文件来保存日志信息,可以通过如下命令查询到具体位置

mysql> show variables like '%slow%';
+---------------------+--------------------------------------------+
| Variable_name | Value |
+---------------------+--------------------------------------------+
| log_slow_queries | ON |
| slow_launch_time | 2 |
| slow_query_log | ON |
| slow_query_log_file | /usr/local/mysql/var/iZ23r8gu7whZ-slow.log |
+---------------------+--------------------------------------------+
4 rows in set (0.00 sec)

打开/usr/local/mysql/var/iZ23r8gu7whZ-slow.log文件,查看一下日志如下所示

# User@Host: server[server] @ localhost []
# Query_time: 4.025911 Lock_time: 0.000016 Rows_sent: 32 Rows_examined: 65284
SET timestamp=1455591112;
select * from `photo` where `status` = '1' and `desc` is not null order by rand() limit 32;

一次查询消耗了4秒多,并发稍微一上来,php-fpm进程就全部被占用光了,全部都在等待状态,不卡死就怪了。

发现问题及解决

检查SQL是如何执行查询操作的

select * from `photo` where `status` = '1' and `desc` is not null order by rand() limit 32;

为什么上面的这条SQL会这么消耗时间呢?我们explain一下

mysql> explain select * from `photo` where `status` = '1' and `desc` is not null order by rand() limit 32;
+----+-------------+-------+------+---------------+------+---------+------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+-------+----------------------------------------------+
| 1 | SIMPLE | photo | ALL | NULL | NULL | NULL | NULL | 47606 | Using where; Using temporary; Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+-------+----------------------------------------------+
1 row in set (0.00 sec)

可以看到photo表内总共47606条记录,使用 order by rand() 使用到了临时表,随着表记录的逐渐增加性能会迅速的下降,导致查询变慢。

解决方案

采取了一个非常简单的思路从程序上来实现从4万多条记录里随机获取32条。

  1. 程序随机获取1-40000中的200个随机数
  2. 使用一条in查询迅速从表内获取到记录结果
  3. 从200条记录里面重新筛选出符合要求的随机记录
    当然这个也有可能200条记录里面筛选不出符合条件的记录,对应增加到400条即可,问题不是很大。

关于order by rand() 的优化及benchmark可以看参考一下这篇文章
Selecting Random Rows Using MySQL + Benchmark Results