今天,同事被Bug #42981 坑了,看了同事发的文章 ,觉得有必要分析下这个bug。这篇博客主要讲acpi_pad是如何工作的。
模块注册 内核模块在加载的时候首先会执行init函数,acpi_pad注册的init函数 是acpi_pad_init 。acpi_pad_init最终调用driver_register 来将acpi_pad_driver.drv 注册到系统中。
acpi_pad_driver的定义 如下:
1 2 3 4 5 6 7 8 9 static struct acpi_driver acpi_pad_driver = { .name = "processor_aggregator" , .class = ACPI_PROCESSOR_AGGREGATOR_CLASS, .ids = pad_device_ids, .ops = { .add = acpi_pad_add, .remove = acpi_pad_remove, }, };
没有 .drv 字段?看下struct acpi_driver 的定义 :
1 2 3 4 5 6 7 8 9 struct acpi_driver { char name[80 ]; char class [80]; const struct acpi_device_id *ids ; unsigned int flags; struct acpi_device_ops ops ; struct device_driver drv ; struct module *owner ; };
这边需要注意的是,acpi_driver里面直接嵌套了一个device_driver结构体,而不是用指针引用,这一点很重要。
但是,acpi_pad_driver.drv没有初始化!后来找了找,发现了初始化的代码(在acpi_bus_register_driver 中):
1 2 3 driver->drv.name = driver->name; driver->drv.bus = &acpi_bus_type; driver->drv.owner = driver->owner;
这个时候,driver是指向acpi_pad_driver的指针。
acpi_bus_type的定义 如下:
1 2 3 4 5 6 7 8 9 struct bus_type acpi_bus_type = { .name = "acpi" , .suspend = acpi_device_suspend, .resume = acpi_device_resume, .match = acpi_bus_match, .probe = acpi_device_probe, .remove = acpi_device_remove, .uevent = acpi_device_uevent, };
注册了driver之后,我们就应该关注acpi_device_probe函数了,这个函数真正在sysfs中创建了idlecpus文件(这个文件是用户控制acpi_pad特性的入口)。
static int acpi_device_probe(struct device * dev)
函数 是被内核调用的,相当于回调:
1 2 3 4 5 6 7 8 9 10 static int acpi_device_probe (struct device * dev) { struct acpi_device *acpi_dev = to_acpi_device(dev); struct acpi_driver *acpi_drv = to_acpi_driver(dev->driver); int ret; ret = acpi_bus_driver_init(acpi_dev, acpi_drv); return ret; }
to_acpi_driver就是container_of宏,可以将struct acpi_driver的drv的地址,转化微acpi_driver的地址(就是根据子结构体地址,获取父级结构体地址):
1 2 3 #define container_of(ptr, type, member) ({ const typeof ( ((type *)0 )->member ) *__mptr = (ptr); (type *)( (char *)__mptr - offsetof(type,member) );})
acpi_device_probe函数最终在acpi_bus_driver_init中调用 了acpi_pad_driver.ops.add 函数,即acpi_pad_add 函数。最终使用在acpi_pad_add_sysfs中将idlecpus绑定 到了sysfs:
1 2 3 4 5 6 7 static int acpi_pad_add_sysfs (struct acpi_device *device) { int result; result = device_create_file(&device->dev, &dev_attr_idlecpus); return 0 ; }
dev_attr_idlecpus的定义:
1 2 3 static DEVICE_ATTR (idlecpus, S_IRUGO|S_IWUSR, acpi_pad_idlecpus_show, acpi_pad_idlecpus_store) ;
被展开为结构体变量定义struct device_attribute dev_attr_idlecpus
。
该文件的读写函数分别是acpi_pad_idlecpus_show和acpi_pad_idlecpus_store。
至此,acpi_pad模块加载完成,idlecpus文件也在sysfs中加载完成了。
通过acpi_pad修改cpu状态 根据bug重现说明 :
to make the failure more likely:
# echo 1 > rrtime # echo 31 > idlecpus; echo 0 > idlecpus # echo 31 > idlecpus; echo 0 > idlecpus # echo 31 > idlecpus; echo 0 > idlecpus
(it usually takes only a few attempts)
etc. until the echo does not return
我们通过idlecpus节点,先空置31个cpu,再激活,多试几次就可以重现该bug了。
在此过程中,调用了acpi_pad_idlecpus_store 函数:
1 2 3 4 5 6 7 8 9 10 11 static ssize_t acpi_pad_idlecpus_store (struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { unsigned long num; if (strict_strtoul(buf, 0 , &num)) return -EINVAL; mutex_lock(&isolated_cpus_lock); acpi_pad_idle_cpus(num); mutex_unlock(&isolated_cpus_lock); return count; }
将用户输入的buf转化为long,获取isolated_cpus_lock锁(这个就导致了前面提到的bug)。然后通过acpi_pad_idle_cpus 将用户需要的cpu数置空:
1 2 3 4 5 6 7 8 9 10 11 12 static void acpi_pad_idle_cpus (unsigned int num_cpus) { get_online_cpus(); num_cpus = min_t (unsigned int , num_cpus, num_online_cpus()); set_power_saving_task_num(num_cpus); put_online_cpus(); }
set_power_saving_task_num 的逻辑很简单,根据当前的power_saving_thread线程数量,少了就通过create_power_saving_task 补足,多了就通过destroy_power_saving_task 去掉一些。
destory_power_saving_task调用kthread_stop 来结束多余的power_saving_thread线程。kthread_stop设置对应kthread的should_stop为1,然后等待该kthread结束:
1 2 3 kthread->should_stop = 1 ; wake_up_process(k); wait_for_completion(&kthread->exited);
但是它在等待kthread结束的时候,还拿着isolated_cpus_lock这个锁呢!!
我们来看下power_saving_thread 到底干了啥,导致了bug。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 static int power_saving_thread (void *data) { while (!kthread_should_stop()) { int cpu; u64 expire_time; try_to_freeze(); if (last_jiffies + round_robin_time * HZ < jiffies) { last_jiffies = jiffies; round_robin_cpu(tsk_index); } } }
看起来,没有问题,我们来看下round_robin_cpu 的代码:
1 2 3 4 5 6 7 8 9 10 static void round_robin_cpu (unsigned int tsk_index) { mutex_lock(&isolated_cpus_lock); cpumask_clear(tmp); mutex_unlock(&isolated_cpus_lock); set_cpus_allowed_ptr(current, cpumask_of(preferred_cpu)); }
代码中有对isolated_cpus_lock加锁的操作,这导致了这个bug。
Bug是如何出现的 一边,acpi_pad_idlecpus_store函数拿到ioslated_cpus_lock锁,调用kthread_stop等待power_saving_thread结束。
另一边,要结束的这个kthread/power_saving_thread,在round_robin_cpu函数中,等待isolated_cpu_lock锁。 两个kthread互相等待,成了死锁。