Rancher无法启动healthcheck和lb|rancher,healthcheck,lb,initializing,start|cyj

# Rancher无法启动healthcheck和lb

> [Rancher](https://rancher.com/) 是一个容器管理平台，最新2.0版本已经内嵌k8s

## 问题描述

一个新产品临近上线，全部采购了腾讯云ECS服务器，安装了Rancher 1.6.17做容器编排。在添加主机到服务器集群时，rancher的 **healthcheck** 容器和自定义的 **loadbalance** 容器无法启动，一直处于`Initializing` 状态

![](https://blog-1256695615.cos.ap-shanghai.myqcloud.com/2018/05/29/02ba63ccf0c44c37b987d642dbedc962.png)

## 问题排查

### 问题容器日志

查看healthcheck和lb容器的日志，并没发现什么特别的信息。因本身没有使用k8s，因此k8s相关的日志也只是普通提示信息

```
Failed to initialize Kubernetes controller: KUBERNETES_URL is not set, CATTLE_ACCESS_KEY is not set, skipping init of Rancher LB provider
Starting Rancher LB service
LB controller: rancher
LB provider: haproxy
starting rancher controller
Healthcheck handler is listening on :10241
Initializing event router" workerCount=25
Connection established
Starting websocket pings
 -- starting haproxy\n * Starting haproxy haproxy\n   ...done.\n
 -- reloading haproxy config with the new config changes\n * Reloading haproxy haproxy\n[WARNING] 147/162016 (61) : config : 'option forwardfor' ignored for proxy 'default' as it requires HTTP mode.\n[WARNING] 147/162016 (63) : config : 'option forwardfor' ignored for proxy 'default' as it requires HTTP mode.\n   ...done.\n
```

### Google到的蛛丝马迹

这种问题最头疼，没有日志，全看经验，日志甚至会错误引导。

在rancher github的Issues [#9916](https://github.com/rancher/rancher/issues/9916) 中发现了点蛛丝马迹，有两个有用的回答：

**回答1:**

```
LB not working might be a different issue, this error message shouldn't affect the behavior. If the LB gets stuck in initializing state, in most of the cases - especially when there are multiple hosts in the system - it means that the healthcheck for LB is failing due to cross hosts communication failures.
```

经验证，发现集群中只有一台机器时，没有发生任何问题；当集群超过一台机器时，出现问题。

**回答2：**

```
This is becoming an aggregation of issues, while the log message is not telling us anything. I marked this an enhancement.

For other issues, please ensure cross host networking is working properly (http://rancher.com/docs/rancher/v1.6/en/faqs/troubleshooting/#cross-host-communication). If this is the case, please file a new issue describing what isn't working as expected with all the relevant versions and information to reproduce.
```

经验证，发现不同机器上的容器之间无法ping通，此时恍然大悟，是机器之间的网络通讯出了问题。

因为往集群中添加机器时，若不设置机器的IP，rancher会默认读取机器的公网IP，如下图：

![](https://blog-1256695615.cos.ap-shanghai.myqcloud.com/2018/05/29/ec3b615de2de4a519e67612d937c6b32.png)

因为是云服务器，基于安全考虑，除了开放80/443/22等必要端口外，拒绝了全部公网进入的流量。由于Rancher上添加机器时没有指定内网IP，所以不同机器上的容器之间无法通讯。

## 问题解决

添加主机时指定内网IP即可，指定后healthcheck和lb容器均恢复正常

## 拓展阅读

> 以下资料来自rancher官网英文文档：[CROSS HOST COMMUNICATION](https://rancher.com/docs/rancher/v1.6/en/faqs/troubleshooting/#cross-host-communication)

本小节记录关于 **跨主机通讯**

如果不同主机上的容器相互无法ping通，可能由以下常见的场景导致：

### 如何检查跨主机通讯是否正常？

查看`healthcheck` 容器的状态，如果状态是`active`，那么跨主机通讯一切正常。

### 控制台上显示的主机IP是否正确？

每隔一段时间，会使用docker的桥接IP替换主机的IP(不会使用机器的实际IP)，这些IP通常是172.17.42.1或从172.17.x.x开始。如果是这种情况，你需要在`docker run`命令中设置`CATTLE_AGENT_IP`并重新注册主机。

### Ubuntu上容器之间无法通讯

如果你启用了`UFW`，你可以禁用`UFW`或者把`/etc/default/ufw`设置为：

```shell
DEFAULT_FORWARD_POLICY="ACCEPT"
```

### 为什么Load Balances一直是INITIALIZING状态？

Load Balancers会自动启动健康检查，如果它们一直是这个状态，则很有可能是跨主机通讯出了问题。

## 处理总结

处理这个问题需要了解Rancher的healthcheck和lb，知道这两个点的人一眼就能看出问题。算是涨经验了。

本文由 cyj 创作，可自由转载、引用，但需署名作者且注明文章出处。

文章标题: Rancher无法启动healthcheck和lb

文章链接: https://chenyongjun.vip/articles/37

扫码或搜索 cyjrun 关注微信公众号, 结伴学习, 一起努力