lua定位CPU100%问题

在开始这个话题前,说下我遇到的 skynet 进程 CPU100% 占用问题,查找 bug 的过程比较繁琐,后来想到做改进,这也是促成我写这篇文章的原因。

查到是一段 lua 代码有问题,这里截取其中出问题的代码:

-- 活动开始时间: 2033-01-01 00:00:00
local start_time = 1988121600

function start_activity()
    local now = os.time()
    local diff = now - start_time
    if diff < 0 then
        skynet.timeout((-diff)*100, start_activity)
        return
    end

    -- todo 省略活动内容
end

以上代码,通过 skynet.timeout 设置定时器,定时触发 start_activity , 看似没有问题,但是,当 skynet.timeout 第一个参数过大时,会有意想不到的收获。

skynet.timeout

开始分析问题,我这里找当前最新发布版本 skynet-1.6.0 做说明。

-- skynet.lua

function skynet.timeout(ti, func)
	local session = c.intcommand("TIMEOUT",ti)
	assert(session)
	local co = co_create_for_timeout(func, ti)
	assert(session_id_coroutine[session] == nil)
	session_id_coroutine[session] = co
	return co	-- for debug
end

以上,skynet.timeout 实现通过 c.intcommand("TIMEOUT",ti) ,到 c 态执行 skynet_server.c 的 cmd_timeout 函数。

// skynet_server.c

static const char *
cmd_timeout(struct skynet_context * context, const char * param) {
	char * session_ptr = NULL;
	int ti = strtol(param, &session_ptr, 10);  // 获得参数 ti
	int session = skynet_context_newsession(context);
	skynet_timeout(context->handle, ti, session);
	sprintf(context->result, "%d", session);
	return context->result;
}

这里的 int ti 就是利用 strtol 获取 skynet.timeout 的第一个参数值,也就是前面传的 ti,但实际上,两者是有差异的。

strtol 是一个标准 C 函数,作用就是将一个字符串转换为长整型,其函数原型为:
long int strtol (const char* str, char** endptr, int base);

以上代码中, skynet 将 long int 转成 int 看似没错,但实际上, long int 在 linux64 下是 8 个字节, int只是 4 个字节。

所以,将 long int 转成 int 会丢失精度,可能导致结果由正数变为负数。给例子说明下 long int 和 int 两者的差异。

#include <stdio.h>
#include <stdlib.h>
 
int main()
{
	char * session_ptr = NULL;
	char param[] = "30061624500";
	long int ti = strtol(param, &session_ptr, 10);
	printf("long int value %ld\n", ti);
	
	int tii = strtol(param, &session_ptr, 10);
	printf("int value %d\n", tii);
	
	printf("int size %d, long int size %d\n", sizeof(int), sizeof(long int));
	return(0);
}

linux64 编译运行后,结果如下:

再结合 skynet 代码,当 ti 为负数时,相当于 skynet.timeout(0, func) ,skynet 会发信息给自己,将 func 放到下一条消息处理。

int
skynet_timeout(uint32_t handle, int time, int session) {
	if (time <= 0) {
		struct skynet_message message;
		message.source = 0;
		message.session = session;
		message.data = NULL;
		message.sz = (size_t)PTYPE_RESPONSE << MESSAGE_TYPE_SHIFT;

		if (skynet_context_push(handle, &message)) {
			return -1;
		}
	} else {
		struct timer_event event;
		event.handle = handle;
		event.session = session;
		timer_add(TI, &event, sizeof(event), time);
	}

	return session;
}

这样,结合文章开头提到的函数, start_activity 发给消息给自己,“延迟一帧”再执行 start_activity ,如此反复,无限循环。虽然 skyent 线程本身不会陷入死循环无法调度,但大部分的cpu时间会被这个循环执行的“空操作”抢占。

顺带提下, skynet.sleep 也有这个问题,所以这种情况的调用要注意了。

local wait_time = 30061624500
while true do
	skynet.sleep(wait_time)
	-- do something
end

CPU 100% 定位

导致 CPU 100% 的原因可能有很多,比如密集型计算、频繁的上下文切换、死循环等等。lua 死循环的问题,我在以前的文章讨论过了,这里不再赘述,有兴趣的话可以翻开看看《找到 lua 死循环代码》。

所以现在讨论的是,如何定位死循环以外的 cpu 100% 代码。

要利用 lua 的调试函数 debug.sethook 和 debug.getinfo 。
debug.sethook 提供了一种 hook 的方式,可以提前注册一个 hook 函数,等待某些既定事件达成时,lua 会回调这个 hook 函数,给你一个观察 lua 内部函数调用过程的机会。

debug.sethook
函数原型:debug.sethook ([thread,] hook, mask [, count])
参数 mask 类型为 string,表示什么时候 hook 函数会被回调,可以包含以下字符:
'c': 每当 lua 调用函数时;
'r': 每当 lua 从函数返回时;
'l': 每当 lua 执行新的一行代码时;
参数 count 类型为 integer,当值大于零时生效,表示每执行完 count 数量的指令后回调 hook 函数。
另外,debug.sethook 如果不带参数时,表示关闭 hook。
当 hook 回调时,第一个参数表示事件名,有 "call", "return", "tail return", "line", "count",你可以在 hook 函数内,通过 debug.getinfo(2, what) 获取当前调用函数信息。

debug.getinfo
函数原型:debug.getinfo ([thread,] f [, what])
参数 f 类型为 integer,表示当前调用函数栈第几级的函数,0 获得的是 debug.getinfo, 1 获得的是当前调用 debug.getinfo 的函数,如果 debug.sethook 想获取当前回调 hook 时处理的函数,就要用 2
参数 what 类型为 string,表示要返回的字段,可以包含以下字符:
'S': 获取 source, short_src, linedefined, lastlinedefined, what;
'l': 获取 currentline;
'n': 获取 name, namewhat;
't': 获取 istailcall;
'u': 获取 nups, nparams, isvararg;
'r': 获取 ftransfer, ntransfer;
这里说明以上字段的含义,避免表述不当,就以 lua 原文档说明吧。

what字段 含义
source the source of the chunk that created the function. If source starts with a '@', it means that the function was defined in a file where the file name follows the '@'. If source starts with a '=', the remainder of its contents describes the source in a user-dependent manner. Otherwise, the function was defined in a string where source is that string.
short_src a "printable" version of source, to be used in error messages.
linedefined the line number where the definition of the function starts.
lastlinedefined the line number where the definition of the function ends.
what the string "Lua" if the function is a Lua function, "C" if it is a C function, "main" if it is the main part of a chunk.
currentline the current line where the given function is executing. When no line information is available, currentline is set to -1.
name a reasonable name for the given function. Because functions in Lua are first-class values, they do not have a fixed name: some functions can be the value of multiple global variables, while others can be stored only in a table field. The lua_getinfo function checks how the function was called to find a suitable name. If it cannot find a name, then name is set to NULL.
namewhat explains the name field. The value of namewhat can be "global", "local", "method", "field", "upvalue", or "" (the empty string), according to how the function was called. (Lua uses the empty string when no other option seems to apply.)
istailcall true if this function invocation was called by a tail call. In this case, the caller of this level is not in the stack.
nups the number of upvalues of the function.
nparams the number of parameters of the function (always 0 for C functions).
isvararg true if the function is a variadic function (always true for C functions).
ftransfer the index in the stack of the first value being "transferred", that is, parameters in a call or return values in a return. (The other values are in consecutive indices.) Using this index, you can access and modify these values through lua_getlocal and lua_setlocal. This field is only meaningful during a call hook, denoting the first parameter, or a return hook, denoting the first value being returned. (For call hooks, this value is always 1.)
ntransfer The number of values being transferred (see previous item). (For calls of Lua functions, this value is always equal to nparams.)

所以,我们可以写个统计 hook 函数,统计一段时间内的所有函数调用次数,以及调用时消耗的时间,来定位问题。
代码如下:

local reports = {}

local function _get_key(funcinfo)
    local name = funcinfo.name or '[anon]'
    local line = funcinfo.linedefined or 0
    local source = funcinfo.short_src or '[C]'
    return string.format("%s|%s|%s", name, source, line)
end

local function _tracing_call(funcinfo)
    local key = _get_key(funcinfo)
    if not reports[key] then
        reports[key] = {
            callcount = 0,
            totaltime = 0,
        }
    end
    local report = reports[key]
    report.calltime    = os.clock()
    report.callcount   = report.callcount + 1
end

local function _tracing_return(funcinfo)
    local key = _get_key(funcinfo)
    local stoptime = os.clock()
    local report = reports[key]
    if report and report.calltime and report.calltime > 0 then
        report.totaltime = report.totaltime + (stoptime-report.calltime)
        report.calltime = 0
    end
end

local function _tracing_handler(hooktype)
    local funcinfo = debug.getinfo(2, 'nS')
    if hooktype == "call" then
        _tracing_call(funcinfo)
    elseif hooktype == "return" then
        _tracing_return(funcinfo)
    end
end



local function stat_start()
    debug.sethook(_tracing_handler, 'cr', 0)
end

local function stat_stop()
    local rep = reports
    debug.sethook()
    reports = {}
    for k,v in pairs(rep) do
        print(string.format("%s callcount %s totaltime %.5f", 
            k, v.callcount, v.totaltime))
    end
end

以上,调用 stat_start 开始统计,调用 stat_stop 结束统计。

文章到这里就结束了,感谢阅读!

发表评论

邮箱地址不会被公开。 必填项已用*标注