今天,同事被Bug #42981坑了,看了同事发的文章,觉得有必要分析下这个bug。这篇博客主要讲acpi_pad是如何工作的。
模块注册
内核模块在加载的时候首先会执行init函数,acpi_pad注册的init函数是acpi_pad_init。acpi_pad_init最终调用driver_register来将acpi_pad_driver.drv 注册到系统中。
acpi_pad_driver的定义如下:
1 2 3 4 5 6 7 8 9
| static struct acpi_driver acpi_pad_driver = { .name = "processor_aggregator", .class = ACPI_PROCESSOR_AGGREGATOR_CLASS, .ids = pad_device_ids, .ops = { .add = acpi_pad_add, .remove = acpi_pad_remove, }, };
|
没有 .drv 字段?看下struct acpi_driver 的定义:
1 2 3 4 5 6 7 8 9
| struct acpi_driver { char name[80]; char class[80]; const struct acpi_device_id *ids; unsigned int flags; struct acpi_device_ops ops; struct device_driver drv; struct module *owner; };
|
这边需要注意的是,acpi_driver里面直接嵌套了一个device_driver结构体,而不是用指针引用,这一点很重要。
但是,acpi_pad_driver.drv没有初始化!后来找了找,发现了初始化的代码(在acpi_bus_register_driver中):
1 2 3
| driver->drv.name = driver->name; driver->drv.bus = &acpi_bus_type; driver->drv.owner = driver->owner;
|
这个时候,driver是指向acpi_pad_driver的指针。
acpi_bus_type的定义如下:
1 2 3 4 5 6 7 8 9
| struct bus_type acpi_bus_type = { .name = "acpi", .suspend = acpi_device_suspend, .resume = acpi_device_resume, .match = acpi_bus_match, .probe = acpi_device_probe, .remove = acpi_device_remove, .uevent = acpi_device_uevent, };
|
注册了driver之后,我们就应该关注acpi_device_probe函数了,这个函数真正在sysfs中创建了idlecpus文件(这个文件是用户控制acpi_pad特性的入口)。
static int acpi_device_probe(struct device * dev)
函数是被内核调用的,相当于回调:
1 2 3 4 5 6 7 8 9 10
| static int acpi_device_probe(struct device * dev) { struct acpi_device *acpi_dev = to_acpi_device(dev); struct acpi_driver *acpi_drv = to_acpi_driver(dev->driver); int ret;
ret = acpi_bus_driver_init(acpi_dev, acpi_drv); return ret; }
|
to_acpi_driver就是container_of宏,可以将struct acpi_driver的drv的地址,转化微acpi_driver的地址(就是根据子结构体地址,获取父级结构体地址):
1 2 3
| #define container_of(ptr, type, member) ({ const typeof( ((type *)0)->member ) *__mptr = (ptr); (type *)( (char *)__mptr - offsetof(type,member) );})
|
acpi_device_probe函数最终在acpi_bus_driver_init中调用了acpi_pad_driver.ops.add 函数,即acpi_pad_add函数。最终使用在acpi_pad_add_sysfs中将idlecpus绑定到了sysfs:
1 2 3 4 5 6 7
| static int acpi_pad_add_sysfs(struct acpi_device *device) { int result; result = device_create_file(&device->dev, &dev_attr_idlecpus); return 0; }
|
dev_attr_idlecpus的定义:
1 2 3
| static DEVICE_ATTR(idlecpus, S_IRUGO|S_IWUSR, acpi_pad_idlecpus_show, acpi_pad_idlecpus_store);
|
被展开为结构体变量定义struct device_attribute dev_attr_idlecpus
。
该文件的读写函数分别是acpi_pad_idlecpus_show和acpi_pad_idlecpus_store。
至此,acpi_pad模块加载完成,idlecpus文件也在sysfs中加载完成了。
通过acpi_pad修改cpu状态
根据bug重现说明:
to make the failure more likely:
# echo 1 > rrtime
# echo 31 > idlecpus; echo 0 > idlecpus
# echo 31 > idlecpus; echo 0 > idlecpus
# echo 31 > idlecpus; echo 0 > idlecpus
(it usually takes only a few attempts)
etc. until the echo does not return
我们通过idlecpus节点,先空置31个cpu,再激活,多试几次就可以重现该bug了。
在此过程中,调用了acpi_pad_idlecpus_store函数:
1 2 3 4 5 6 7 8 9 10 11
| static ssize_t acpi_pad_idlecpus_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { unsigned long num; if (strict_strtoul(buf, 0, &num)) return -EINVAL; mutex_lock(&isolated_cpus_lock); acpi_pad_idle_cpus(num); mutex_unlock(&isolated_cpus_lock); return count; }
|
将用户输入的buf转化为long,获取isolated_cpus_lock锁(这个就导致了前面提到的bug)。然后通过acpi_pad_idle_cpus将用户需要的cpu数置空:
1 2 3 4 5 6 7 8 9 10 11 12
| static void acpi_pad_idle_cpus(unsigned int num_cpus) { get_online_cpus();
num_cpus = min_t(unsigned int, num_cpus, num_online_cpus()); set_power_saving_task_num(num_cpus); put_online_cpus(); }
|
set_power_saving_task_num的逻辑很简单,根据当前的power_saving_thread线程数量,少了就通过create_power_saving_task补足,多了就通过destroy_power_saving_task去掉一些。
destory_power_saving_task调用kthread_stop来结束多余的power_saving_thread线程。kthread_stop设置对应kthread的should_stop为1,然后等待该kthread结束:
1 2 3
| kthread->should_stop = 1; wake_up_process(k); wait_for_completion(&kthread->exited);
|
但是它在等待kthread结束的时候,还拿着isolated_cpus_lock这个锁呢!!
我们来看下power_saving_thread到底干了啥,导致了bug。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| static int power_saving_thread(void *data) {
while (!kthread_should_stop()) { int cpu; u64 expire_time;
try_to_freeze();
if (last_jiffies + round_robin_time * HZ < jiffies) { last_jiffies = jiffies; round_robin_cpu(tsk_index); } } }
|
看起来,没有问题,我们来看下round_robin_cpu的代码:
1 2 3 4 5 6 7 8 9 10
| static void round_robin_cpu(unsigned int tsk_index) { mutex_lock(&isolated_cpus_lock); cpumask_clear(tmp); mutex_unlock(&isolated_cpus_lock);
set_cpus_allowed_ptr(current, cpumask_of(preferred_cpu)); }
|
代码中有对isolated_cpus_lock加锁的操作,这导致了这个bug。
Bug是如何出现的
一边,acpi_pad_idlecpus_store函数拿到ioslated_cpus_lock锁,调用kthread_stop等待power_saving_thread结束。
另一边,要结束的这个kthread/power_saving_thread,在round_robin_cpu函数中,等待isolated_cpu_lock锁。 两个kthread互相等待,成了死锁。