您的位置:js12345金沙官网登入 > 网络编程 > 详细分析Redis集群故障js12345金沙官网登入

详细分析Redis集群故障js12345金沙官网登入

2020-01-23 10:10

3.调整cluster-node-timeout,不能少于15s

1.1.6. Resharding(Slots重新分配)

Resharding操作实际上是Redis Cluster的一部分slots从由一个master负责,转换为由另一个master负责的过程,也就是slots的重新分配。

 

为了描述方便,先建立一个空的master节点7009,然后将7000上的5461个slots全部转移到7009节点上。

./redis-trib.rb add-node 192.168.197.101:7009 192.168.197.101:7000

 

目前的节点情况如下:

./redis-cli -c -h 192.168.197.101 -p 7000 cluster nodes

37ccec5145b4e071687e671bda36789e124fc9ed 192.168.197.101:7001 master - 0 1500106989599 2 connected 5461-10922

78ae31a28bcd62b87f93c932552b5f6c1fe3329c 192.168.197.101:7006 slave 4314bb678cda2ba1550e3ec1081db5d5fae74c87 0 1500106990102 10 connected

c48ead74999cf71f3f7446f6ae9771423de65890 192.168.197.101:7004 slave 37ccec5145b4e071687e671bda36789e124fc9ed 0 1500106991610 5 connected

5d0632d76008ea3010878317d804b3c0ae50a13f 192.168.197.101:7009 master - 0 1500106991914 9 connected

b8be626d33d07cb10094ab9f1345d6436d18d27f 192.168.197.101:7002 master - 0 1500106990908 3 connected 10923-16383

38f95bb38e691efdb45f926eb9157cdba7111d92 192.168.197.101:7005 slave b8be626d33d07cb10094ab9f1345d6436d18d27f 0 1500106992014 6 connected

4314bb678cda2ba1550e3ec1081db5d5fae74c87 192.168.197.101:7000 myself,master - 0 0 10 connected 0-5460

f53441ccbe2c3bec2fb03f8180f723c7c5b735c7 192.168.197.101:7007 slave 4314bb678cda2ba1550e3ec1081db5d5fae74c87 0 1500106990605 10 connected

[d@192.168.197.101:/opt/redis_cluster/7009]$./redis-cli -c -h 192.168.197.101 -p 7000

192.168.197.101:7000> keys *

1) "host"

192.168.197.101:7000> get host

"redis.coe2coe.me"

 

下面将开始进行真正的Resharding操作。

以下命令将节点7000(NODEID:4314bb678cda2ba1550e3ec1081db5d5fae74c87 )负责的5461个slots迁移到7009(NODEID:5d0632d76008ea3010878317d804b3c0ae50a13f)中。

./redis-trib.rb reshard --from 4314bb678cda2ba1550e3ec1081db5d5fae74c87 --to  5d0632d76008ea3010878317d804b3c0ae50a13f --slots 5461 --yes 192.168.197.101:7000

 

 

输出结果如下:

>>> Performing Cluster Check (using node 192.168.197.101:7000)^[[0m

M: 4314bb678cda2ba1550e3ec1081db5d5fae74c87 192.168.197.101:7000

   slots:0-5460 (5461 slots) master

   2 additional replica(s)

M: 37ccec5145b4e071687e671bda36789e124fc9ed 192.168.197.101:7001

   slots:5461-10922 (5462 slots) master

   1 additional replica(s)

S: 78ae31a28bcd62b87f93c932552b5f6c1fe3329c 192.168.197.101:7006

   slots: (0 slots) slave

   replicates 4314bb678cda2ba1550e3ec1081db5d5fae74c87

S: c48ead74999cf71f3f7446f6ae9771423de65890 192.168.197.101:7004

   slots: (0 slots) slave

   replicates 37ccec5145b4e071687e671bda36789e124fc9ed

M: 5d0632d76008ea3010878317d804b3c0ae50a13f 192.168.197.101:7009

   slots: (0 slots) master

   0 additional replica(s)

M: b8be626d33d07cb10094ab9f1345d6436d18d27f 192.168.197.101:7002

   slots:10923-16383 (5461 slots) master

   1 additional replica(s)

S: 38f95bb38e691efdb45f926eb9157cdba7111d92 192.168.197.101:7005

   slots: (0 slots) slave

   replicates b8be626d33d07cb10094ab9f1345d6436d18d27f

S: f53441ccbe2c3bec2fb03f8180f723c7c5b735c7 192.168.197.101:7007

   slots: (0 slots) slave

   replicates 4314bb678cda2ba1550e3ec1081db5d5fae74c87

[OK] All nodes agree about slots configuration.^[[0m

>>> Check for open slots...^[[0m

>>> Check slots coverage...^[[0m

[OK] All 16384 slots covered.^[[0m

 

Ready to move 5461 slots.

  Source nodes:

    M: 4314bb678cda2ba1550e3ec1081db5d5fae74c87 192.168.197.101:7000

   slots:0-5460 (5461 slots) master

   2 additional replica(s)

  Destination node:

    M: 5d0632d76008ea3010878317d804b3c0ae50a13f 192.168.197.101:7009

   slots: (0 slots) master

   0 additional replica(s)

  Resharding plan:

    Moving slot 0 from 4314bb678cda2ba1550e3ec1081db5d5fae74c87

    Moving slot 1 from 4314bb678cda2ba1550e3ec1081db5d5fae74c87

    Moving slot 2 from 4314bb678cda2ba1550e3ec1081db5d5fae74c87

    Moving slot 3 from 4314bb678cda2ba1550e3ec1081db5d5fae74c87

    Moving slot 4 from 4314bb678cda2ba1550e3ec1081db5d5fae74c87

    Moving slot 5 from 4314bb678cda2ba1550e3ec1081db5d5fae74c87

//为了节省篇幅,此处省略了若干行文字。

Moving slot 5457 from 192.168.197.101:7000 to 192.168.197.101:7009:

Moving slot 5458 from 192.168.197.101:7000 to 192.168.197.101:7009:

Moving slot 5459 from 192.168.197.101:7000 to 192.168.197.101:7009:

Moving slot 5460 from 192.168.197.101:7000 to 192.168.197.101:7009:

 

至此,7001的全部5461个slots全部由新的master7009负责。可以使用以下命令验证Sharding的结果:

./redis-cli -c -h 192.168.197.101 -p 7000 cluster nodes

37ccec5145b4e071687e671bda36789e124fc9ed 192.168.197.101:7001 master - 0 1500107530823 2 connected 5461-10922

78ae31a28bcd62b87f93c932552b5f6c1fe3329c 192.168.197.101:7006 slave 5d0632d76008ea3010878317d804b3c0ae50a13f 0 1500107529816 11 connected

c48ead74999cf71f3f7446f6ae9771423de65890 192.168.197.101:7004 slave 37ccec5145b4e071687e671bda36789e124fc9ed 0 1500107529816 5 connected

5d0632d76008ea3010878317d804b3c0ae50a13f 192.168.197.101:7009 master - 0 1500107530823 11 connected 0-5460

b8be626d33d07cb10094ab9f1345d6436d18d27f 192.168.197.101:7002 master - 0 1500107531327 3 connected 10923-16383

38f95bb38e691efdb45f926eb9157cdba7111d92 192.168.197.101:7005 slave b8be626d33d07cb10094ab9f1345d6436d18d27f 0 1500107531831 6 connected

4314bb678cda2ba1550e3ec1081db5d5fae74c87 192.168.197.101:7000 myself,master - 0 0 10 connected

f53441ccbe2c3bec2fb03f8180f723c7c5b735c7 192.168.197.101:7007 slave 5d0632d76008ea3010878317d804b3c0ae50a13f 0 1500107531831 11 connected

 

上述结果说明slot 0到5460总共5461个slots已经成功的从7007节点迁移到7009节点上了。

 

查询相关的键进一步验证键数据的迁移结果:

./redis-cli -c -h 192.168.197.101 -p 7000

192.168.197.101:7000> keys *

(empty list or set)

192.168.197.101:7000> get host

-> Redirected to slot [2130] located at 192.168.197.101:7009

"redis.coe2coe.me"

192.168.197.101:7009> keys *

1) "host"

 

在节点7009上找到位于编号为2130的slot上的键host,说明键数据迁移成功。

 

这时使用redis-trib.rb工具检查Cluster的状态:

./redis-trib.rb check 192.168.197.101:7009

>>> Performing Cluster Check (using node 192.168.197.101:7009)

M: 5d0632d76008ea3010878317d804b3c0ae50a13f 192.168.197.101:7009

   slots:0-5460 (5461 slots) master

   2 additional replica(s)

M: 37ccec5145b4e071687e671bda36789e124fc9ed 192.168.197.101:7001

   slots:5461-10922 (5462 slots) master

   1 additional replica(s)

S: c48ead74999cf71f3f7446f6ae9771423de65890 192.168.197.101:7004

   slots: (0 slots) slave

   replicates 37ccec5145b4e071687e671bda36789e124fc9ed

S: 78ae31a28bcd62b87f93c932552b5f6c1fe3329c 192.168.197.101:7006

   slots: (0 slots) slave

   replicates 5d0632d76008ea3010878317d804b3c0ae50a13f

M: 4314bb678cda2ba1550e3ec1081db5d5fae74c87 192.168.197.101:7000

   slots: (0 slots) master

   0 additional replica(s)

S: 38f95bb38e691efdb45f926eb9157cdba7111d92 192.168.197.101:7005

   slots: (0 slots) slave

   replicates b8be626d33d07cb10094ab9f1345d6436d18d27f

S: f53441ccbe2c3bec2fb03f8180f723c7c5b735c7 192.168.197.101:7007

   slots: (0 slots) slave

   replicates 5d0632d76008ea3010878317d804b3c0ae50a13f

M: b8be626d33d07cb10094ab9f1345d6436d18d27f 192.168.197.101:7002

   slots:10923-16383 (5461 slots) master

   1 additional replica(s)

[OK] All nodes agree about slots configuration.

>>> Check for open slots...

>>> Check slots coverage...

[OK] All 16384 slots covered.

 

可以看到7000的2个slaves已经转变为7009的slaves了。

 

总结:

redis-trib.rb工具在使用reshard参数时,执行了以下三个动作:

(1)将源master负责的slots转变为归目标master负责。

(2)将源master存储的键数据转移到目标master上。

(3)将源master的slaves转变为目标master的slaves.

 

步7:新从节点8371开始从新主节点8373,第一次全量同步数据:8373日志::4255:M 09 Sep 18:57:51.906 * Full resync requested by slave xx.x.xxx.200:83714255:M 09 Sep 18:57:51.906 * Starting BGSAVE for SYNC with target: disk4255:M 09 Sep 18:57:51.941 * Background saving started by pid 52308371日志::46590:S 09 Sep 18:57:51.948 * Full resync from master: d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440721826993

1.添加slave节点。

步18:从节点8371同步数据成功,耗时7分钟:46590:S 09 Sep 19:08:19.303 * MASTER - SLAVE sync: Finished with success

1.1.5. 删除master节点

Cluster中当前的节点情况如下所示,准备删除一个master节点:7003。这个master节点目前有2个slave节点7000和7007,并且负责的slots范围为:0到5460,还有1个键数据:host。

./redis-cli -c -h 192.168.197.101 -p 7001 cluster nodes

37ccec5145b4e071687e671bda36789e124fc9ed 192.168.197.101:7001 myself,master - 0 0 2 connected 5461-10922

4314bb678cda2ba1550e3ec1081db5d5fae74c87 192.168.197.101:7000 slave dbcdc9682acbd8c52dd6184fe01bf5f9500b2180 0 1500102709303 7 connected

b8be626d33d07cb10094ab9f1345d6436d18d27f 192.168.197.101:7002 master - 0 1500102708296 3 connected 10923-16383

c48ead74999cf71f3f7446f6ae9771423de65890 192.168.197.101:7004 slave 37ccec5145b4e071687e671bda36789e124fc9ed 0 1500102707288 5 connected

f53441ccbe2c3bec2fb03f8180f723c7c5b735c7 192.168.197.101:7007 slave dbcdc9682acbd8c52dd6184fe01bf5f9500b2180 0 1500102708296 7 connected

78ae31a28bcd62b87f93c932552b5f6c1fe3329c 192.168.197.101:7006 slave 37ccec5145b4e071687e671bda36789e124fc9ed 0 1500102708296 2 connected

38f95bb38e691efdb45f926eb9157cdba7111d92 192.168.197.101:7005 slave b8be626d33d07cb10094ab9f1345d6436d18d27f 0 1500102708799 6 connected

dbcdc9682acbd8c52dd6184fe01bf5f9500b2180 192.168.197.101:7003 master - 0 1500102707792 7 connected 0-5460

 

./redis-cli -c -h 192.168.197.101 -p 7003

192.168.197.101:7003> keys *

1) "host"

 

 

这种情况下如果直接删除,将不能成功,而是产生下面的错误,原因是只能删空的master节点:不负责任何slots。

[d@192.168.197.101:/opt/redis_cluster/7008]$./redis-trib.rb del-node 192.168.197.101:7003 dbcdc9682acbd8c52dd6184fe01bf5f9500b2180

>>> Removing node dbcdc9682acbd8c52dd6184fe01bf5f9500b2180 from cluster 192.168.197.101:7003

[ERR] Node 192.168.197.101:7003 is not empty! Reshard data away and try again.

 

这种master的删除方法有两种:

(1)方法一:停止该master7003的服务,使得slave自动提升为master。再次启动7003,此时7003将自动成为slave。从而可以方便的删除掉,而且还不会造成任何数据损失,而且不涉及slots的Resharding。

 

依次执行以下命令完成上述操作:

(a)停止7003服务。

./redis-cli -c -h 192.168.197.101 -p 7003 shutdown

在服务停止的情况下,不能直接删除该节点,否则出现下面的错误:

./redis-trib.rb del-node 192.168.197.101:7000  dbcdc9682acbd8c52dd6184fe01bf5f9500b2180

>>> Removing node dbcdc9682acbd8c52dd6184fe01bf5f9500b2180 from cluster 192.168.197.101:7000

[ERR] No such node ID dbcdc9682acbd8c52dd6184fe01bf5f9500b2180

 

(b)重新启动7003服务。

在已经确认7003的slave选举提升已经成功完成的前提下,重新启动7003服务,此时7003将变化为7000的一个slave。

[d@192.168.197.101:/opt/redis_cluster/7003]$./redis-server ./redis.conf

 

(c)执行删除节点操作,删除7003节点。

此时可以成功从Cluster中删除7003节点。

./redis-trib.rb del-node 192.168.197.101:7000  dbcdc9682acbd8c52dd6184fe01bf5f9500b2180

>>> Removing node dbcdc9682acbd8c52dd6184fe01bf5f9500b2180 from cluster 192.168.197.101:7000

>>> Sending CLUSTER FORGET messages to the cluster...

>>> SHUTDOWN the node.

 

至此,节点删除完毕。

 

(3)方法二:使用CLUSTER FAILOVER命令人工产生一个故障转移事件,从而触发slave的自动提升。此方法跟方法一的基本原理很类似。这里暂不介绍。

 

(2)方法二:使用Redis Cluster的Resharding,将master7003负责的slots迁移到其它master,使得7003不再负责任何slots。从而7003成为一个空的master,此时可以删除掉该master。

涉及到Resharding操作,这里暂不介绍。

 

步17:集群恢复正常:42645:M 09 Sep 19:05:58.786 * Clear FAIL state for node bedab2c537fe94f8c0363ac4ae97d56832316e65: slave is reachable again.

3.删除slave节点。

步14:从节点8371重新开始同步,连接失败,主节点8373的连接数满了:46590:S 09 Sep 19:00:21.103 * Connecting to MASTER xx.x.xxx.199:837346590:S 09 Sep 19:00:21.103 * MASTER - SLAVE sync started46590:S 09 Sep 19:00:21.104 * Non blocking connect for SYNC fired the event.46590:S 09 Sep 19:00:21.104 # Error reply to PING from master: '-ERR max number of clients reached'

2.添加master节点。

xx.x.xxx.199xx.x.xxx.200xx.x.xxx.201

1.1.4. 删除slave节点

先使用redis-cli查看待删除节点的NODEID,然后使用redis-trib.rb工具删除这个节点即可。

./redis-cli -c -h 192.168.197.101 -p 7008 cluster nodes |grep myself

5377470350bb3fec9165a24589d115ca4fc1a644 192.168.197.101:7008 myself,slave dbcdc9682acbd8c52dd6184fe01bf5f9500b2180 0 0 0 connected

 

[d@192.168.197.101:/opt/redis_cluster/7008]$./redis-trib.rb del-node 192.168.197.101:7008 5377470350bb3fec9165a24589d115ca4fc1a644

>>> Removing node 5377470350bb3fec9165a24589d115ca4fc1a644 from cluster 192.168.197.101:7008

>>> Sending CLUSTER FORGET messages to the cluster...

>>> SHUTDOWN the node.

 

至此,7008节点不仅从Cluster中删除掉了,而且其服务端口也关闭了。

 

关于client-output-buffer-limit参数:

  1. 删除master节点。

总结来说,有以下几个问题:

1.1.3. 修改结点的master-slave关系

当前7008节点是一个新加入的master节点,没有负责任何slots。

./redis-cli -c -h 192.168.197.101 -p 7008

192.168.197.101:7008> cluster nodes

5377470350bb3fec9165a24589d115ca4fc1a644 192.168.197.101:7008 myself,master - 0 0 0 connected

c48ead74999cf71f3f7446f6ae9771423de65890 192.168.197.101:7004 slave 37ccec5145b4e071687e671bda36789e124fc9ed 0 1500101360347 2 connected

b8be626d33d07cb10094ab9f1345d6436d18d27f 192.168.197.101:7002 master - 0 1500101359843 3 connected 10923-16383

dbcdc9682acbd8c52dd6184fe01bf5f9500b2180 192.168.197.101:7003 master - 0 1500101360851 7 connected 0-5460

//为了节省篇幅,此处省略了若干行文字。

f53441ccbe2c3bec2fb03f8180f723c7c5b735c7 192.168.197.101:7007 slave dbcdc9682acbd8c52dd6184fe01bf5f9500b2180 0 1500101360851 7 connected

192.168.197.101:7008> cluster replicate dbcdc9682acbd8c52dd6184fe01bf5f9500b2180

OK

 

现在使用Redis Cluster的cluster replicate命令将7008这个master节点修改为7003节点的slave节点。

 

192.168.197.101:7008> cluster replicate dbcdc9682acbd8c52dd6184fe01bf5f9500b2180

OK

至此,修改成功。可以使用cluster nodes命令查看修改结果:

 

192.168.197.101:7008> cluster nodes

5377470350bb3fec9165a24589d115ca4fc1a644 192.168.197.101:7008 myself,slave dbcdc9682acbd8c52dd6184fe01bf5f9500b2180 0 0 0 connected

c48ead74999cf71f3f7446f6ae9771423de65890 192.168.197.101:7004 slave 37ccec5145b4e071687e671bda36789e124fc9ed 0 1500101430401 2 connected

b8be626d33d07cb10094ab9f1345d6436d18d27f 192.168.197.101:7002 master - 0 1500101430905 3 connected 10923-16383

dbcdc9682acbd8c52dd6184fe01bf5f9500b2180 192.168.197.101:7003 master - 0 1500101429897 7 connected 0-5460

//为了节省篇幅,此处省略了若干行内容。

 

进一步验证一下复制关系已经成功建立:

192.168.197.101:7008> keys *

1) "host"

说明键host已经从其新的master上成功复制过来了。

 

 

 

 

 

步3:从节点8372/8373/8374收到主节点8375说8371失联:46986:S 09 Sep 18:57:50.120 * FAIL message received from 5ab4f85306da6d633e4834b4d3327f45af02171b about bedab2c537fe94f8c0363ac4ae97d56832316e65

5.Resharding(slots重新分配)。

5.第一次全量同步失败后,从节点重连主节点花了30秒

 

机器分布:

1.1.1. 添加slave节点

如何向Redis Cluster中增加一个新的节点,作为现存节点的slave呢?至少有以下几种方法:

 

(1)使用redis-trib.rb工具,随机选择master节点。

还是使用redis-trib.rb这个工具。以下命令将7006节点添加到Cluster中作为slave节点,通过7001节点执行这个命令。至于作为哪个master节点的slave节点,答案是在slave数量最少的master节点中随机选择一个master。

./redis-trib.rb add-node --slave  192.168.197.101:7006 192.168.197.101:7001

>>> Adding node 192.168.197.101:7006 to cluster 192.168.197.101:7001

>>> Performing Cluster Check (using node 192.168.197.101:7001)

M: 37ccec5145b4e071687e671bda36789e124fc9ed 192.168.197.101:7001

   slots:5461-10922 (5462 slots) master

   1 additional replica(s)

S: 4314bb678cda2ba1550e3ec1081db5d5fae74c87 192.168.197.101:7000

   slots: (0 slots) slave

   replicates dbcdc9682acbd8c52dd6184fe01bf5f9500b2180

M: b8be626d33d07cb10094ab9f1345d6436d18d27f 192.168.197.101:7002

   slots:10923-16383 (5461 slots) master

   1 additional replica(s)

S: 38f95bb38e691efdb45f926eb9157cdba7111d92 192.168.197.101:7005

   slots: (0 slots) slave

   replicates b8be626d33d07cb10094ab9f1345d6436d18d27f

M: dbcdc9682acbd8c52dd6184fe01bf5f9500b2180 192.168.197.101:7003

   slots:0-5460 (5461 slots) master

   1 additional replica(s)

S: c48ead74999cf71f3f7446f6ae9771423de65890 192.168.197.101:7004

   slots: (0 slots) slave

   replicates 37ccec5145b4e071687e671bda36789e124fc9ed

[OK] All nodes agree about slots configuration.

>>> Check for open slots...

>>> Check slots coverage...

[OK] All 16384 slots covered.

Automatically selected master 192.168.197.101:7001

>>> Send CLUSTER MEET to node 192.168.197.101:7006 to make it join the cluster.

Waiting for the cluster to join.

>>> Configure node as replica of 192.168.197.101:7001.

[OK] New node added correctly.

 

(2)使用redis-trib.rb工具,人工指定master节点。

使用--master-id这个选项来指定master节点的NODEID。

./redis-trib.rb add-node --slave  --master-id 'dbcdc9682acbd8c52dd6184fe01bf5f9500b2180' 192.168.197.101:7007 192.168.197.101:7001

>>> Adding node 192.168.197.101:7007 to cluster 192.168.197.101:7001

>>> Performing Cluster Check (using node 192.168.197.101:7001)

//为了节省篇幅,此处省略了若干行文字。

[OK] All nodes agree about slots configuration.

>>> Check for open slots...

>>> Check slots coverage...

[OK] All 16384 slots covered.

>>> Send CLUSTER MEET to node 192.168.197.101:7007 to make it join the cluster.

Waiting for the cluster to join.

>>> Configure node as replica of 192.168.197.101:7003.

[OK] New node added correctly.

 

根据之前的验证过程,已知host这个键的slot由master 7003负责,而7007目前已经加入到这个Cluster中,而且是7003的slave。因此,7007上应该有host这个键,但是如果通过7007查询host,则会重定向到其master7003上。

./redis-cli  -c -h 192.168.197.101 -p 7007

192.168.197.101:7007> keys *

1) "host"

192.168.197.101:7007> get host

-> Redirected to slot [2130] located at 192.168.197.101:7003

"redis.coe2coe.me"

 

客户端执行了耗时1条8.3s的命令,

本文包含以下内容:

出现节点失联的原因:

1.1.2. 添加master节点

使用redis-trib.rb工具使得添加master节点很方便。

./redis-trib.rb add-node 192.168.197.101:7008 192.168.197.101:7001

>>> Adding node 192.168.197.101:7008 to cluster 192.168.197.101:7001

>>> Performing Cluster Check (using node 192.168.197.101:7001)

M: 37ccec5145b4e071687e671bda36789e124fc9ed 192.168.197.101:7001

   slots:5461-10922 (5462 slots) master

   2 additional replica(s)

//为了节省篇幅,此处略去了若干行文字。

S: c48ead74999cf71f3f7446f6ae9771423de65890 192.168.197.101:7004

   slots: (0 slots) slave

   replicates 37ccec5145b4e071687e671bda36789e124fc9ed

[OK] All nodes agree about slots configuration.

>>> Check for open slots...

>>> Check slots coverage...

[OK] All 16384 slots covered.

>>> Send CLUSTER MEET to node 192.168.197.101:7008 to make it join the cluster.

[OK] New node added correctly.

 

进一步查看7008节点的状态,可知7008节点是作为master加入的。

./redis-trib.rb check 192.168.197.101:7008

>>> Performing Cluster Check (using node 192.168.197.101:7008)

M: 5377470350bb3fec9165a24589d115ca4fc1a644 192.168.197.101:7008

   slots: (0 slots) master

   0 additional replica(s)

//为了节省篇幅,此处省略了若干行文字。

[OK] All nodes agree about slots configuration.

>>> Check for open slots...

>>> Check slots coverage...

[OK] All 16384 slots covered.

 

这个命令新增加的master节点7008暂时没有负责任何slots,但是确实已经是这个Cluster中的一个节点了。

./redis-cli -c -h 192.168.197.101 -p 7008

192.168.197.101:7008> keys *

(empty list or set)

192.168.197.101:7008> get host

-> Redirected to slot [2130] located at 192.168.197.101:7003

"redis.coe2coe.me"

 

 

以上就是本文关于详细分析Redis集群故障的全部内容,希望对大家有所帮助。感兴趣的朋友可以参阅:Spring AOP实现Redis缓存数据库查询源码、简述Redis和MySQL的区别、oracle 数据库启动阶段分析等,如有不足之处,请留言之处。小编会及时更正。感谢朋友们对脚本之家网站的支持!

采取措施:

由于几台机器在同一个机架,不太可能发生网络中断的情况,于是通过SLOWLOG GET命令查看了慢查询日志,发现有一个KEYS命令被执行了,耗时8.3秒,再查看集群节点超时设置,发现是5s(cluster-node-timeout 5000)

5.修复客户端类似SYN攻击的疯狂连接方式

业务层面显示提示查询redis失败

步9:主节点8370/8375判定8373(新主)恢复:60295:M 09 Sep 18:58:18.181 * Clear FAIL state for node 792020fe66c00ae56e27cd7a048ba6bb2b67adb6: is reachable again and nobody is serving its slots after some time.

发现进程已经持续运行了3个月

redis-server进程状态:

故障表象:

4.又由于PHP客户端的连接池有问题,疯狂连接服务器,产生了类似SYN攻击的效果

集群组成:

步10:主节点8373完成全量同步所需要的BGSAVE操作:5230:C 09 Sep 18:59:01.474 * DB saved on disk5230:C 09 Sep 18:59:01.491 * RDB: 7112 MB of memory used by copy-on-write4255:M 09 Sep 18:59:01.877 * Background saving terminated with success

2016/9/9 18:57:43 开始执行KEYS命令2016/9/9 18:57:50 8371被判断失联2016/9/9 18:57:51 执行完KEYS命令

本文由js12345金沙官网登入发布于网络编程,转载请注明出处:详细分析Redis集群故障js12345金沙官网登入

关键词: