分布式选主 -- 利用Mysql ACID和Lease协议实现选主和高可用
在实际生产开发中,遇到一些多节点共存,需要选主,并且要实现HA自动容错的场景,思考了写方法拿出来和大家分享一下。
Lease协议,Mysql ACID
高可用选主方案设计
适用场景
Java语言实现描述
进一步优化
系统中有很多应用场景要类似主从架构,主服务器(Master)对外提供服务,从服务器(Salve)热备份,不提供服务但随时活着,如果Master出现宕机或者网络问题,Slave即可接替Master对外服务,并由Slave提升为Master(新主)。典型的多节点共存,但只能同时存在一个主,并且所有节点的状态能统一维护。
大家一定首先想到了著名的Paxos算法(http://baike.baidu.com/view/8438269.htm)。简单的说,Paxos通过每个节点的投票算法,来决议一个事情,当多余1/2个节点都投票通过时,Paxos产生一个唯一结果的决议,并通知各个节点维护这个信息。例如Paxos的选主,首先产生一个关于某个节点希望当Master的投票,然后各个节点给出反馈,最终Paxos集群维护唯一的Master的结论。Zookeeper就是Paxos的一种实现。这种场景最适合用zookeeper来选主,但zookeeper有个明显的缺点,当存活的节点小于zookeeper集群的1/2时,就不能工作了。比如zk有10各节点,那么必须满足可用的节点大于5才可。
在实际环境中,如果对Master要求不是那么严格的话,可以通过某些改进和取舍来达到目的。比如可能在秒级别允许Master暂时不能访问、选主时间内可能存在一定的冲突但通过再次选主即可。本人设计了一个简易的利用Mysql一致性和简易版Lease来workaround。
Mysql ACID保证了一条数据记录的一致性、完整性,不会出现多进程读写的一致性问题和唯一正确性。Lease协议(协议细节可以Google之)通过向Master发送一个lease(租期)包,Master在这个lease期之内充当主角色,如果lease期到了则再次去申请lease,如果lease期到了,但是网络除了问题,这时Master可以i主动下线,让其他节点去竞选Master。举个例子,三个节点A、B、C经过第一轮选主之后,A成为Master,它获得了10秒的lease,当前时间假设是00:00:00,那么它Master地位可以用到00:00:10,当时间到达00:00:10时,A、B、C会重新进行Master选举,每个节点都有可能成为Master(从工程的角度触发,A继续为Master的概率更大),如果这时候A的网络断了,不能联通B、C的集群了,那么A会自动下线,不会去竞争,这样就不会出现“脑裂”的现象。
---------------------------------------------- 华丽的分割线 ----------------------------------------------
设计方案如下:(server代表集群中的一台机器,也可看作一个进程,server之间是平等的)
各个server之间用ntpserver时间同步(保证服务器之间秒级同步即可)
各个server持有一个唯一ID号(ip+进程号),通过此id唯一标识一个server实例
各个server定义一个lease租期,单位为秒
Mysql唯一表唯一一条记录维护全局Master的信息,ACID保证一致性
Master Server每半个lease期向Mysql更新如上的唯一一条记录,并更新心跳,维护Master状态
Slaver Server每半个lease周期从mysql获取Master Server信息,如果数据库中Master的Lease超过了当前时间(heartbeat_time+ lease > current_time),则申请当Master。
这其中比较棘手的问题是:
1、由于数据库访问和休眠的时间(lease的一半),有时延的存在,要处理Mysql异常、网络异常。
2、可能存在同时抢占Master的server,这个时候就需要一个验证机制保证为抢到Master的server自动退位为Slaver
下面给出图实例 :(10.0.0.1为Master)
10.0.0.1 crash了。mysql中维护的10.0.0.1的主信息已过期,其他节点去抢占
各个节点再次读取数据库,查看是否是自己抢占成功了:
之后,10.0.0.3作为Master对外服务。此时如果10.0.0.1重启,可作为Slaver。如果10.0.0.1因为网络分化或者网络异常而不能维护心跳,则在超过自身lease时自动停止服务,不会出现“双Master”的现象。
每个Server遵循如下流程:
数据库设计:
某一时刻,数据库中Master的信息:
当前时间: 45分15秒
当前Master Lease :6秒
当前Master Lease可用到: 45分21秒
---------------------------------------------- 华丽的分割线 ----------------------------------------------
3、适用的场景
一、生命周期内可使用Mysql、并且各个server之间时间同步。
二、需要集群中选出唯一主对外提供服务,其他节点作为slaver做standby,主lease过期时竞争为Master
三、对比zookeeper,可满足如果集群挂掉一半节点,也可正常工作的情况,比如只有一主一备。
四、允许选主操作在秒级容错的系统,选主的时候可能有lease/2秒的时间窗口,此时服务可能不可用。
五、允许lease/2秒内出现极限双Master情况,但是概率很小。
---------------------------------------------- 华丽的分割线 ----------------------------------------------
4、Java语言实现描述
一些配置信息和时间相关、休眠周期相关的时间变量
final long interval = lease / intervalDivisor;
long waitForLeaseChallenging = 0L;
lease = lease / 1000L;
long challengeFailTimes = 0L;
long takeRest = 0L;
long dbExceptionTimes = 0L;
long offlineTime = 0L;
Random rand = new Random();
Status stateMechine = Status.START;
long activeNodeLease = 0L;
long activeNodeTimeStamp = 0L;
数据库异常的处理:
KeepAlive keepaliveNode = null;
try {
/* first of all get it from mysql */
keepaliveNode = dbService.accquireAliveNode();
if (stateMechine != Status.START && keepaliveNode==null)
throw new Exception();
// recount , avoid network shake
dbExceptionTimes = 0L;
} catch (Exception e) {
log.fatal("[Scanner] Database Exception with times : " + dbExceptionTimes++);
if (stateMechine == Status.OFFLINE) {
log.warn("[Scanner] Database Exception , OFFLINE ");
} else if (dbExceptionTimes >= 3) {
log.fatal("[Scanner] Database Exception , Node Offline Mode Active , uniqueid : " + uniqueID);
stateMechine = Status.OFFLINE;
dbExceptionTimes = 0L;
offlineTime = System.currentTimeMillis();
online = false;
} else
continue;
}
总的循环和状态机的变迁:
while (true) {
SqlSession session = dbConnecction.openSession();
ActionScanMapper dbService = session.getMapper(ActionScanMapper.class);
KeepAlive keepaliveNode = null;
try {
/* first of all get it from mysql */
keepaliveNode = dbService.accquireAliveNode();
if (stateMechine != Status.START && keepaliveNode==null)
throw new Exception();
// recount , avoid network shake
dbExceptionTimes = 0L;
} catch (Exception e) {
log.fatal("[Scanner] Database Exception with times : " + dbExceptionTimes++);
if (stateMechine == Status.OFFLINE) {
log.warn("[Scanner] Database Exception , OFFLINE ");
} else if (dbExceptionTimes >= 3) {
log.fatal("[Scanner] Database Exception , Node Offline Mode Active , uniqueid : " + uniqueID);
stateMechine = Status.OFFLINE;
dbExceptionTimes = 0L;
offlineTime = System.currentTimeMillis();
online = false;
} else
continue;
}
try {
activeNodeLease = keepaliveNode!=null ? keepaliveNode.getLease() : activeNodeLease;
activeNodeTimeStamp = keepaliveNode!=null ? keepaliveNode.getTimestamp() : activeNodeTimeStamp;
takeRest = interval;
switch (stateMechine) {
case START:
if (keepaliveNode == null) {
log.fatal("[START] Accquire node is null , ignore ");
// if no node register here , we challenge it
stateMechine = Status.CHALLENGE_REGISTER;
takeRest = 0;
} else {
// check the lease , wether myself or others
if (activeNodeLease < timestampGap(activeNodeTimeStamp)) {
log.warn("[START] Lease Timeout scanner for uniqueid : " + uniqueID + ", timeout : "
+ timestampGap(activeNodeTimeStamp));
if (keepaliveNode.getStatus().equals(STAT_CHALLENGE))
stateMechine = Status.HEARTBEAT;
else {
stateMechine = Status.CHALLENGE_MASTER;
takeRest = 0;
}
} else if (keepaliveNode.getUniqueID().equals(uniqueID)) {
// I'am restart
log.info("[START] Restart Scanner for uniqueid : " + uniqueID
+ ", timeout : " + timestampGap(activeNodeTimeStamp));
stateMechine = Status.HEARTBEAT;
} else {
log.info("[START] Already Exist Keepalive Node with uniqueid : " + uniqueID);
stateMechine = Status.HEARTBEAT;
}
}
break;
case HEARTBEAT:
/* uniqueID == keepaliveNode.uniqueID */
if (keepaliveNode.getUniqueID().equals(uniqueID)) {
if (activeNodeLease < timestampGap(activeNodeTimeStamp)) {
// we should challenge now , without nessesary to checkout Status[CHALLENGE]
log.warn("[HEARTBEAT] HEART BEAT Lease is timeout for uniqueid : " + uniqueID
+ ", time : " + timestampGap(activeNodeTimeStamp));
stateMechine = Status.CHALLENGE_MASTER;
takeRest = 0;
break;
} else {
// lease ok , just update mysql keepalive status
dbService.updateAliveNode(keepaliveNode.setLease(lease));
online = true;
log.info("[HEARTBEAT] update equaled keepalive node , uniqueid : " + uniqueID
+ ", lease : " + lease + "s, remain_usable : " +
((activeNodeTimeStamp * 1000L + lease * 1000L) - System.currentTimeMillis()) + " ms");
}
} else {
/* It's others , let's check lease */
if (activeNodeLease < timestampGap(activeNodeTimeStamp)) {
if (keepaliveNode.getStatus().equals(STAT_CHALLENGE)) {
waitForLeaseChallenging = (long) (activeNodeLease * awaitFactor);
if ((waitForLeaseChallenging) < timestampGap(activeNodeTimeStamp)) {
log.info("[HEARTBEAT] Lease Expired , Diff[" + timestampGap(activeNodeTimeStamp) + "] , Lease[" + activeNodeLease + "]");
stateMechine = Status.CHALLENGE_MASTER;
takeRest = 0;
} else {
log.info("[HEARTBEAT] Other Node Challenging , We wait for a moment ...");
}
} else {
log.info("[HEARTBEAT] Lease Expired , Diff[" + timestampGap(activeNodeTimeStamp) + "] , lease[" + activeNodeLease + "]");
stateMechine = Status.CHALLENGE_MASTER;
takeRest = 0;
}
} else {
online = false;
log.info("[HEARTBEAT] Exist Active Node On The Way with uniqueid : "
+ keepaliveNode.getUniqueID() + ", lease : " + keepaliveNode.getLease());
}
}
break;
case CHALLENGE_MASTER:
dbService.challengeAliveNode(new KeepAlive().setUniqueID(uniqueID).setLease(lease));
online = false;
// wait for the expired node offline automatic
// and others also have changce to challenge
takeRest = activeNodeLease;
stateMechine = Status.CHALLENGE_COMPLETE;
log.info("[CHALLENGE_MASTER] Other Node is timeout["
+ timestampGap(activeNodeTimeStamp) + "s] , I challenge with uniqueid : " + uniqueID
+ ", lease : " + lease + ", wait : " + lease);
break;
case CHALLENGE_REGISTER:
dbService.registerNewNode(new KeepAlive().setUniqueID(uniqueID).setLease(lease));
online = false;
// wait for the expired node offline automatic
// and others also have changce to challenge
takeRest = activeNodeLease;
stateMechine = Status.CHALLENGE_COMPLETE;
log.info("[CHALLENGE_REGISTER] Regiter Keepalive uniqueid : " + uniqueID + ", lease : " + lease);
break;
case CHALLENGE_COMPLETE :
if (keepaliveNode.getUniqueID().equals(uniqueID)) {
dbService.updateAliveNode(keepaliveNode.setLease(lease));
online = true;
log.info("[CHALLENGE_COMPLETE] I Will be the Master uniqueid : " + uniqueID);
// make the uptime correct
stateMechine = Status.HEARTBEAT;
} else {
online = false;
log.warn("[CHALLENGE_COMPLETE] So unlucky , Challenge Failed By Other Node with uniqueid : " + keepaliveNode.getUniqueID());
if (challengeFailTimes++ >= (rand.nextLong() % maxChallenge) + minChallenge) {
// need't challenge anymore in a long time
takeRest=maxChallengeAwaitInterval;
stateMechine = Status.HEARTBEAT;
challengeFailTimes = 0L;
log.info("[CHALLENGE_COMPLETE] Challenge Try Times Used Up , let's take a long rest !");
} else {
stateMechine = Status.HEARTBEAT;
log.info("[CHALLENGE_COMPLETE] Challenge Times : " + challengeFailTimes + ", Never Give Up , to[" + stateMechine + "]");
}
}
break;
case OFFLINE :
log.fatal("[Scanner] Offline Mode Node with uniqueid : " + uniqueID);
if (System.currentTimeMillis() - offlineTime >= maxOfflineFrozen) {
// I am relive forcely
log.info("[Scanner] I am relive to activie node , uniqueid : " + uniqueID);
stateMechine = Status.HEARTBEAT;
offlineTime = 0L;
} else if (keepaliveNode != null) {
// db is reconnected
stateMechine = Status.HEARTBEAT;
offlineTime = 0L;
log.info("[Scanner] I am relive to activie node , uniqueid : " + uniqueID);
}
break;
default :
System.exit(0);
}
session.commit();
session.close();
if (takeRest != 0)
Thread.sleep(takeRest);
log.info("[Scanner] State Stage [" + stateMechine + "]");
} catch (InterruptedException e) {
log.fatal("[System] Thread InterruptedException : " + e.getMessage());
} finally {
log.info("[Scanner] UniqueID : " + uniqueID + ", Mode : " + (online?"online":"offline"));
}
}
}
enum Status {
START, HEARTBEAT, CHALLENGE_MASTER, CHALLENGE_REGISTER, CHALLENGE_COMPLETE, OFFLINE
}
建议继续学习:
- 分布式缓存系统 Memcached 入门 (阅读:14700)
- Zookeeper工作原理 (阅读:10352)
- GFS, HDFS, Blob File System架构对比 (阅读:9356)
- Zookeeper研究和应用 (阅读:8500)
- 分布式日志系统scribe使用手记 (阅读:8021)
- 一致性哈希算法及其在分布式系统中的应用 (阅读:7911)
- 分布式哈希和一致性哈希 (阅读:7634)
- HBase技术介绍 (阅读:6736)
- 分布式系统的事务处理 (阅读:5967)
- Memcache分布式部署方案 (阅读:5429)
扫一扫订阅我的微信号:IT技术博客大学习
- 作者:GugeMichael 来源: 曦轩 技术茶话小屋
- 标签: ACID Lease 分布式
- 发布时间:2016-03-19 22:46:41
- [68] Go Reflect 性能
- [68] 如何拿下简短的域名
- [67] Oracle MTS模式下 进程地址与会话信
- [62] IOS安全–浅谈关于IOS加固的几种方法
- [61] 图书馆的世界纪录
- [60] 【社会化设计】自我(self)部分――欢迎区
- [58] android 开发入门
- [56] 视觉调整-设计师 vs. 逻辑
- [49] 给自己的字体课(一)——英文字体基础
- [48] 读书笔记-壹百度:百度十年千倍的29条法则