yupeng的博客

2011年2月11日星期五

linux内核raid5一次整条带写操作的流程（基于2.6.33内核）

Table of Contents
=================
1 raid5写入数据的顺序
2 make_request函数
        2.1 rand5_compute_sector
        2.2 get_active_stripe
            2.2.1 get_free_stripe
            2.2.2 init_stripe
                2.2.2.1 raid5_build_block
            2.2.3 get_active_stripe的其他部分
            2.2.4 __find_stripe
        2.3 add_stripe_bio
        2.4 在add_stripe_bio和release_stripe之间的操作
        2.5 release_stripe
            2.5.1 __release_stripe
3 raid5d
    3.1 __get_priority_stripe
    3.2 handle_stripe
        3.2.1 handle_stripe5
            3.2.1.1 handle_stripe_dirtying5
                3.2.1.1.1 schedule_reconstruction
            3.2.1.2 raid_run_ops
            3.2.1.3 handle_stripe5结束
    3.3 release_stripe
4 ops_complete_reconstruct
5 再次进入raid5d
    5.1 ops_run_io
6 raid5_end_write_request
7 第三次进入raid5d
    7.1 handle_stripe_clean_event
    7.2 release_stripe

1 raid5写入数据的顺序
######################
假设，stripe size是4k大小，chunk size是32k，共有4块盘。首先，第一块
盘的4k数据会写入第一块盘的第一个chunk的第一个stripe，如图1所示：

图1

图中桔黄色的方块表示写入的数据，蓝色的方块表示该处将用于存储校验数
据。图中只画出了第一个chunk，所以校验数据都存于最后一个磁盘中。
接下来的4k数据会写入第一块盘的第一个chunk的第二个stripe，如图2所示：

图2

直到写入8次4k的数据后，第一块盘的第一个chunk全部写完，如图3所示：

图3

写完这32k数据后，接下来的4k数据会写入第二块盘的第一个chunk的第一个
stripe，如图4所示：

图4

依此类推，直到第三块盘的第一个4k数据写完，此时，第一个stripe即
stripe0就被填满了，如图5所示：


图5

在raid5的make_request函数中，如果能够找到一个stripe来描述当前这次写
操作，则将该stripe提交到一个全局链表中，然后唤醒守护进程raid5d来进
行处理，make_request并不等到raid5d将数据真正写入磁盘，就直接返回了。
所以，有大量的数据连续写入时，make_request函数会被连续调用，每次使
用一个stripe来描述这次任务，就返回了。默认一共有256个stripe可用。当
这256个stripe都使用完了之后，再有写操作的时候，make_request函数就会
进入休眠，直到至少有四分之一的stripe被释放之后才会被唤醒，使用新释
放出来的stripe来描述新的写操作。
我们来看一下第一个stripe(即stripe0)中的数据的处理流程。

2 make_request函数
###################
对一次正常的写操作，make_request函数中需要调用4个比较重要的函数。下
面依次介绍。

2.1 rand5_compute_sector
~~~~~~~~~~~~~~~~~~~~~~~~~
    首先会调用raid5_compute_sector函数通过整个md设备的扇区号
    logical_sector换算出在实际磁盘上对应的扇区号new_sector，以及需要使
    用的是第几块磁盘dd_idx。我们只看和stripe0相关的部分。
    在本例中，第一次写入磁盘的4k数据会落入到stripe0中，即
    logical_sector=0，此时计算出的new_sector=0，dd_idx=0，即写入第一块
    磁盘的第一个sector。
    然后，在写入第九个4k的数据的时候，又落入到stripe0中，此
    时，new_sector = 0, dd_idx = 1, logical_sector = 64。
    在写入第17个4k的数据时，会再次落入到stripe0中，此时，new_sector =
    0, dd_idx = 2, logical_sector = 128。
    至此，第一个stripe被填满。
    每次调用raid5_compute_sector函数后，会返回new_sector和dd_idx两个变
    量。同一个stripe中的每块盘的扇区号是一样的，所以通过new_sector的值
    来判断本次写操作发生在哪个stripe中，并使用一个stripe_head类型的结
    构体来描述这次写操作。

2.2 get_active_stripe
~~~~~~~~~~~~~~~~~~~~~~
    分配stripe_head的操作是在get_active_stripe函数中完成的。首先，这个
    函数会尝试调用__find_stripe函数查看本次操作所在的stripe是不是已经
    在使用中了，如果是的话，就不用新分配stripe了。
    在第一次写入4k数据时，stripe0肯定不在使用中，所以__find_stripe函数
    的返回值是NULL，然后，会调用get_free_stripe函数去获得一个空闲的
    stripe。

2.2.1 get_free_stripe
======================
     进入get_free_stripe函数看一下，该函数很简单，从inactive_list链表
     中取下一个sh，然后给active_stripes的值加一，表示又有一个stripe被
     使用了。此外，还会将sh从它所在的hash表中拿下来（如果它确实挂在一
     个hash表上的话）。这个hash表是在__find_stripe中使用的，介绍后续的
     写操作时会进行说明。

2.2.2 init_stripe
==================
     如果成功的分配到了一个sh，就会调用init_stripe函数对该sh进行初始化。
     包括这个stripe的起始扇区号，使用第几块盘存放校验数据，以及将它的
     状态sh->state设成0。后面我们会看到这个状态在不停的变化。
     对于属于这个sripe的每一个块设备，都有一个r5dev类型的结构体来描述。
     这些个r5dev型的结构体也需要初始化。首先将表示该块设备状态的标志
     dev->flags清零。然后调用raid5_build_block函数初始化与bio相关的变
     量。

2.2.2.1 raid5_build_block
--------------------------
      raid5_build_block函数首先设置提交一次bio所使用的bi_io_vec以及存
      放数据的page之类的变量。然后会调用这样一个函数：
      dev->sector = compute_blocknr(sh, i, previous);
      这里计算出的sector是该磁盘在该stripe中的起始扇区，在整个md设备中
      的扇区号。

2.2.3 get_active_stripe的其他部分
==================================
     在get_active_stripe函数的末尾部分，有这样一句话：
     atomic_inc(&sh->count);
     将sh的记数器加1，后面会介绍到，每当进入release_stripe函数时，会将
     count的值减1，如果减到0了就表示当前对这个sh的处理都完成了，唤醒守
     护进程，来决定对sh的下一步处理。

2.2.4 __find_stripe
====================
     对于写入raid设备的第一个4k的数据，__find_stripe函数会返回空。但当
     写入第九个4k的数据时，也就是像图2中所示的情况发生时，__find_stripe函
     数通过sector为key值在一个全局hash表中查找，由于这时的sector值和第
     一个4k数据的sector值是一样的，而写入第一个4k数据时，在init_stripe
     函数中已经把那时分配到的sh以sector为key值挂到hash表中了，所以这
     时__find_stripe函数会找到写入第一个4k数据所用的sh，也就是用来描述
     stripe0的sh。

2.3 add_stripe_bio
~~~~~~~~~~~~~~~~~~~
    add_stripe_bio函数写在一个条件表达式中的不起眼的位置上，但功能很重
    要。
    首先判断是要执行读操作还是写操作。
    然后用一个临时变量bip指向写磁盘时需要使用的bio（即towrite）的地址，
    后面操作bip就等于改变了towrite的值。然后判断是否发生了overlap的情
    况，也就是前面已经有一个读写请求发生在这个stripe的这块盘上，而本次
    操作又发生在同一个stripe的同一块盘上，并且两次读写数据的位置还有重
    叠。对于我们的顺序写操作，肯定是不会发生这种情况的。
    接下来，对于写操作，要判断是否发生了overwrite的情况。所谓
    overwrite，就是这次写操作是不是覆盖了stripe在这块磁盘上的整个区间。
    如果是的话，在计算xor校验值的时候，对于这块磁盘，就直接使用上层传
    下来的数据，如果不是的话，就需要读回stripe在这块磁盘上的数据到内
    存，然后在把上层传下来的数据覆盖到内存，再计算xor，最后把内存中的
    数据写入磁盘。
    判断写入数据是否覆盖整个stripe的方法也很简单，如果写入数据的起始扇
    区号小于等于stripe的扇区号(bi->bi_sector <= sector)，并且写入数据的
    结束扇区号大于等于stripe的结束扇区号(sector >=
    sh->dev[dd_idx].sector + STRIPE_SECTORS)，那么这次写操作就是
    overwrite的。
    在我们的例子中，每次写入4k数据，刚好等于stripe的大小，所以是
    overwrite的。因此，下面语句会被调用：
    set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);

2.4 在add_stripe_bio和release_stripe之间的操作
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    此时make_request函数中会执行两步比较有意义的操作：
    set_bit(STRIPE_HANDLE, &sh->state);
    clear_bit(STRIPE_DELAYED, &sh->state);
    设置STRIPE_HANDLE位很重要，它表示这个stripe需要进一步的处理，在
    release_stripe函数中会通过该位来判断是否需要唤醒守护进程。

2.5 release_stripe
~~~~~~~~~~~~~~~~~~~
    接下来进入make_request中会调用到的最后一个比较重要的函数。
    release_stripe函数只有几行：获取锁，调用__release_stripe函数，释放
    锁。所以真正重要的操作都在__release_stripe函数中。

2.5.1 __release_stripe
=======================
     首先会判断sh的引用计数sh->count，如果它减小到0则说明所有对sh的操
     作都完成了，需要唤醒守护进程进行下一步的处理。
     我们记得，在前面调用过的init_stripe函数中有这样一行语句：
     BUG_ON(atomic_read(&sh->count) != 0);
     就是说一开始获取到的sh，引用计数肯定是0，然后，在
     get_active_stripe函数的的末尾处，会执行这样一行语句：
     atomic_inc(&sh->count);
     将sh的引用计数加1。
     对于stripe0共进行了3次4k数据的读写，每次进入get_active_stripe函数
     后，会让sh->count加1，进入__release_stripe函数后，又将其减到0。
     然后检测sh->state，目前sh->state的值应该是0x00000004，只有
     STRIPE_HANDLE被置位（在make_request函数调用add_stripe_bio之后，调
     用release_stripe函数之前）。因此在条件判断中会进入这一行：
     list_add_tail(&sh->lru, &conf->handle_list);
     将stripe挂在handle_list链表上。
     然后执行：
     md_wakeup_thread(conf->mddev->thread);
     唤醒raid5d守护线程。
     虽然守护线程会在这里被唤醒，但并不会马上执行。实际上，除非
     make_request函数由于sh耗尽而进入休眠，否则在它返回之前raid5d线程
     是一直都得不到机会执行的。所以，当写入stripe0的3块4k数据都执行
     过__release_stripe之后，再等到make_request函数返回或进入休
     眠，raid5d线程会开始执行，提交到strpe0的数据会得到进一步处理。

3 raid5d
#########
raid5d线程开始执行后，会进入一个大循环，这个循环中，首先调
用__get_priority_stripe函数获取一个需要处理的sh，然后调用
handle_stripe函数对这个sh进行处理，最后调用release_stripe函数来决定
是否要再次唤醒raid5d线程，或是把sh释放掉。一直等到所有的sh都处理完
了，才退出循环。在raid5d的最后会调用async_tx_issue_pending_all函数，
如果在处理过程中有进行dma操作的话，这个函数会确保dma开始运行。

3.1 __get_priority_stripe
^^^^^^^^^^^^^^^^^^^^^^^^^^
   这个函数会尝试从handle_list和hold_list两个链表上获取sh，对于写操作
   来说，如果一个sh已经是整条带写，那么它会被挂在handle_list上，否则就
   会挂在hold_list上。所以总是优先搜索handle_list，如果有整条带写，就
   先处理。等到整条带写的都处理完了，可能已经又有新的写操作提交进来，
   让那些之前不是整条带写的sh也变成整条带写了。这样可以提高性能。
   在这里，将stripe0的sh从handle_list上取下来，然后给sh->count加1。通
   常，把sh放到链表上的操作都是在release_stripe函数里完成的。而
   release_stripe函数只有将sh->count减到0才会把它挂到某个链表上。所以，这
   里给sh->count加1后，结果总是1。

3.2 handle_stripe
^^^^^^^^^^^^^^^^^^
   通过__get_priority_stripe函数获取到sh之后，接下来就是调用
   handle_stripe来处理这个sh。handle_stripe判断这个sh的raid等级来调相
   应的处理函数。

3.2.1 handle_stripe5
~~~~~~~~~~~~~~~~~~~~~
    首先循环查询一遍每个dev的flag，根据flag的状态设置相应的变量。对于
    stripe0的sh，它的dev0 dev1 dev2对应的flag都是0x04，dev3对应的flag
    是0x00。即，0，1，2三块盘的R5_OVERWRITE被置位。
    循环结束后还要进行许多的状态判断，用来确定这个sh究竟要执行哪些操作。
    最终，实际会被调用到的是handle_stripe_dirtying5函数。

3.2.1.1 handle_stripe_dirtying5
================================
     对于一次写操作，该函数用来判断是要使用rcw还是rmw。比如在一个
     stripe中我只写1块盘。那么我可以通过把要写的这块盘的数据与校验盘的
     数据读回，与新的数据做异或，得出校验数据，这就是rmw。我也可以把所
     有其他盘的数据读回，与新写入的数据做异或，算出校验数据，这就是rcw。
     该函数统计出使用rcw与rmw两种操作时所需的读盘次数，哪种操作需要读
     盘的次数少，就采用哪种操作。
     对于整条带写，肯定是使用rcw，因为一次读盘操作都不需要。

3.2.1.1.1 schedule_reconstruction
----------------------------------
      对于一次回读都不需要的情况，还需调用schedule_reconstruction函数
      来设置一些状态标志。
      对于我们的整条带写操作，会进行如下的设置：
      sh->reconstruct_state = reconstruct_state_drain_run;
      set_bit(STRIPE_OP_BIODRAIN, &s->ops_request);
      set_bit(STRIPE_OP_RECONSTRUCT, &s->ops_request);

3.2.1.2 raid_run_ops
=====================
     接下来，handle_stripe5函数会调用raid_run_ops函数来进行xor运算。在
     raid5.c中有如下宏定义：
     #define raid_run_ops __raid_run_ops
     所以，实际调用的函数是__raid_run_ops。
     该函数首先用memcpy把bio中的数据拷贝到sh的buffer中，然后对sh的
     buffer中的数据进行xor计算。如果硬件支持memcpy和xor操作的话，这些
     操作将会异步进行。在完成后调用callback函数，callback函数中会调用
     release_stripe，以便在适当的时候唤醒raid5d线程进行后续的处理。

3.2.1.3 handle_stripe5结束
===========================
     我们假设使用异步的硬件dma进行memcpy和xor运算，那么，对于整条带的写
     操作，handle_stripe5接下来不会再进行什么实质性的操作了。直到硬件
     操作完成，再次唤醒raid5d后才会处理。

3.3 release_stripe
^^^^^^^^^^^^^^^^^^^
   在raid5d中调用完handle_stripe后，回再次调用release_stripe。由于在
   ops_run_reconstruct5函数中执行了：
   atomic_inc(&sh->count);
   此时sh->count的值是2，减1后得1，不是0，所以不会进行任何操作，直接退
   出。

4 ops_complete_reconstruct
###########################
memcpy与xor操作都完成后，会调用ops_complete_reconstruct函数。该函数
会再次调用release_stripe函数，进入release_stripe函数时，sh->count的
值是1，减1后成为0。因此会再次唤醒raid5d线程。

5 再次进入raid5d
#################
与第一次执行raid5d一样，依然是先获取sh，调用handle_stripe处理sh，最
后调用release_stripe。只是这次进入handle_stripe5函数后，会调用
ops_run_io函数将计算完的校验数据与上层通过make_request传递下来的数据
一起写入磁盘。

5.1 ops_run_io
^^^^^^^^^^^^^^^
   第一次调用handle_stripe函数的时候也会进入ops_run_io，只是当时
   R5_Wantwrite标志和R5_Wantread都没有被设置，所以不会进行任何操作。然
   而在ops_complete_reconstruct函数中，sh->reconstruct_state的值会被设
   置成reconstruct_state_drain_result。这样，在handle_stripe5函数中，
   就会将所有需要执行写入操作的dev的flag置上R5_Wantwrite标志。
   ops_run_io通过这个标志判断那块盘需要执行写操作。没执行一次写操作前，都
   会把sh->count的值加1。每执行完一块盘的写操作，就会调用一次回调函数
   raid5_end_write_request。该函数会调用release_stripe函数，而
   release_stripe函数会将sh->count的值减1，并检测sh->count是不是已经减
   到0了。。这样，当最后一次写操作完成后，release_stripe函数中会发现
   sh->count的值减到0了，于是第三次唤醒raid5d线程。

6 raid5_end_write_request
##########################
每一次写磁盘完成后调用，设置一些标志位，并调用release_stripe函数，当
最后一个写操作完成后，release_stripe函数会唤醒raid5d守护线程。

7 第三次进入raid5d
###################
与前两次一样，依然是先获取sh，然后处理sh，最后release sh，只是这次在
handle_stripe5函数中会调用handle_stripe_clean_event函数。

7.1 handle_stripe_clean_event
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   当一个sh处理完成后，调用该函数设置一些相应的状态。

7.2 release_stripe
^^^^^^^^^^^^^^^^^^^
   sh处理完成，将其挂到inactive_list链表上，并将active_stripes减一。

2010年11月14日星期日

linux内核中的红黑树

和内核中的hash table一样，内核中的红黑树比较“裸”。出于效率方面的考虑，并没有把所有的操作都封装起来。

在执行插入操作的时候，需要使用者自己按照搜索二叉树的方法找到想要插入的节点的父节点，然后调用rb_link_node函数将节点插入，再调用rb_insert_color函数对红黑树进行适当的“旋转”。

而在搜索红黑树的时候，则需要使用者自己按照普通二叉树的搜索方法进行搜索。当然，如何比较键值的大小也是使用者自己决定的。

内核的Documentation目录下有一篇rbtree.txt的文档详细的介绍了该如何使用红黑树。

下面是我写的一个使用红黑树的小例子，可在2.6.32下运行。在这个小程序中，coef被当作键值来使用。

#include <linux/init.h>
#include <linux/module.h>
#include <linux/rbtree.h>

struct q_coef
{
    u8 coef;
    u8 index;
    struct rb_node node;
};

#define COEF_NUM 15
u8 coef[15] = {
    0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80,
    0x1d, 0x3a, 0x74, 0xe8, 0xcd, 0x87, 0x13,
};
struct q_coef q_coef[COEF_NUM];

static void q_coef_init(void)
{
    int i;
    memset(&q_coef, 0, sizeof(q_coef));
    for (i = 0 ; i < COEF_NUM ; i++) {
        q_coef[i].coef = coef[i];
        q_coef[i].index = i + 1;
    }
}

struct rb_root q_coef_tree = RB_ROOT;

static int q_coef_insert(struct rb_root *root, struct q_coef *data)
{
    struct rb_node **new = &(root->rb_node), *parent = NULL;

    /* Figure out where to put new code */
    while (*new) {
        struct q_coef *this = rb_entry(*new, struct q_coef, node);
        parent = *new;
        if (data->coef < this->coef)
            new = &((*new)->rb_left);
        else if (data->coef > this->coef)
            new = &((*new)->rb_right);
        else
            return -1;
    }

    /* Add new node and rebalance tree. */
    rb_link_node(&data->node, parent, new);
    rb_insert_color(&data->node, root);

    return 0;
}

static struct q_coef *q_coef_search(struct rb_root *root, u8 coef)
{
    struct rb_node *node = root->rb_node;
    while (node) {
        struct q_coef *data = rb_entry(node, struct q_coef, node);
        if (coef < data->coef)
            node = node->rb_left;
        else if (coef > data->coef)
            node = node->rb_right;
        else
            return data;
    }
    return NULL;
}

static int rbtest_init (void)
{
    int i;
    struct q_coef *ptr;
    struct rb_node *node;
    int ret;

    q_coef_init();

    for (i = 0 ; i < COEF_NUM ; i++) {
        ret = q_coef_insert(&q_coef_tree, &q_coef[i]);
        if (ret < 0) {
            printk(KERN_WARNING "q_coef_insert failed, i=%d\n", i);
            return -1;
        }
    }

    printk(KERN_INFO "search by input order:\n");
    for (i = 0 ; i < COEF_NUM ; i++) {
        ptr = q_coef_search(&q_coef_tree, coef[i]);
        if (ptr == NULL) {
            printk(KERN_WARNING "q_coef_search failed, i=%d\n", i);
            return -1;
        }
        printk(KERN_INFO "coef[%02d]=0x%02x ptr->coef=0x%02x ptr->index=%02d\n",
            i, coef[i], ptr->coef, ptr->index);
    }

    printk(KERN_INFO "search from first:\n");
    for (node = rb_first(&q_coef_tree) ; node ; node = rb_next(node)) {
        ptr = rb_entry(node, struct q_coef, node);
        printk(KERN_INFO "ptr->coef=0x%02x ptr->index=%02d\n", ptr->coef, ptr->index);
    }

    printk(KERN_INFO "search from last:\n");
    for (node = rb_last(&q_coef_tree) ; node ; node = rb_prev(node)) {
        ptr = rb_entry(node, struct q_coef, node);
        printk(KERN_INFO "ptr->coef=0x%02x ptr->index=%02d\n", ptr->coef, ptr->index);
    }

    printk(KERN_INFO "rbtest done\n");
    return -1;
}

static void rbtest_exit (void)
{
}

module_init(rbtest_init);
module_exit(rbtest_exit);

MODULE_LICENSE("Dual BSD/GPL");

2010年11月12日星期五

GF(2**8)的计算器

实现生成多项式为F(x) = x**8 + x**4 + x**3 + x**2 + 1的伽罗华域的加，减，乘，除，乘方运算。可以带括号。

将下面代码保存为pq.py,chmod +x pq.py，然后./pq.py即可进入计算器。按q退出。

解析算数表达式的程序有些问题，不支持乘方和其他运算混合。

#! /usr/bin/env python

# reference to http://www.cnblogs.com/flyingbread/archive/2007/02/03/638932.html

import re

def is_operator(ch):
    if ch == '+' or ch == '-' or ch == '*' or ch == '/' or ch == '**':
        return True
    else:
        return False

def is_parentheses_left(ch):
    if ch == '(':
        return True
    else:
        return False

def is_parentheses_right(ch):
    if ch == ')':
        return True
    else:
        return False

def opt_priority(ch):
    if ch == '+' or ch == '-':
        priority = 1
    elif ch == '*' or ch == '/':
        priority = 2
    elif ch == '**':
        priority = 3
    else:
        # maybe '('
        priority = 0
    return priority

# now we only support 2 operation number
def get_operation_number(ch):
    if ch == '+' or ch == '-' or ch == '*' or ch == '/' or ch == '**':
        return 2
    else:
        return 1

# primitive polynomial for GF(2**8)
# F(x) = x**8 + x**4 + x**3 + x**2 + 1
def do_add(a, b):
    return a ^ b

def do_sub(a, b):
    return a ^ b

gflog = [
    0x00, 0x00, 0x01, 0x19, 0x02, 0x32, 0x1a, 0xc6, 0x03, 0xdf, 0x33, 0xee, 0x1b, 0x68, 0xc7, 0x4b,
    0x04, 0x64, 0xe0, 0x0e, 0x34, 0x8d, 0xef, 0x81, 0x1c, 0xc1, 0x69, 0xf8, 0xc8, 0x08, 0x4c, 0x71,
    0x05, 0x8a, 0x65, 0x2f, 0xe1, 0x24, 0x0f, 0x21, 0x35, 0x93, 0x8e, 0xda, 0xf0, 0x12, 0x82, 0x45,
    0x1d, 0xb5, 0xc2, 0x7d, 0x6a, 0x27, 0xf9, 0xb9, 0xc9, 0x9a, 0x09, 0x78, 0x4d, 0xe4, 0x72, 0xa6,
    0x06, 0xbf, 0x8b, 0x62, 0x66, 0xdd, 0x30, 0xfd, 0xe2, 0x98, 0x25, 0xb3, 0x10, 0x91, 0x22, 0x88,
    0x36, 0xd0, 0x94, 0xce, 0x8f, 0x96, 0xdb, 0xbd, 0xf1, 0xd2, 0x13, 0x5c, 0x83, 0x38, 0x46, 0x40,
    0x1e, 0x42, 0xb6, 0xa3, 0xc3, 0x48, 0x7e, 0x6e, 0x6b, 0x3a, 0x28, 0x54, 0xfa, 0x85, 0xba, 0x3d,
    0xca, 0x5e, 0x9b, 0x9f, 0x0a, 0x15, 0x79, 0x2b, 0x4e, 0xd4, 0xe5, 0xac, 0x73, 0xf3, 0xa7, 0x57,
    0x07, 0x70, 0xc0, 0xf7, 0x8c, 0x80, 0x63, 0x0d, 0x67, 0x4a, 0xde, 0xed, 0x31, 0xc5, 0xfe, 0x18,
    0xe3, 0xa5, 0x99, 0x77, 0x26, 0xb8, 0xb4, 0x7c, 0x11, 0x44, 0x92, 0xd9, 0x23, 0x20, 0x89, 0x2e,
    0x37, 0x3f, 0xd1, 0x5b, 0x95, 0xbc, 0xcf, 0xcd, 0x90, 0x87, 0x97, 0xb2, 0xdc, 0xfc, 0xbe, 0x61,
    0xf2, 0x56, 0xd3, 0xab, 0x14, 0x2a, 0x5d, 0x9e, 0x84, 0x3c, 0x39, 0x53, 0x47, 0x6d, 0x41, 0xa2,
    0x1f, 0x2d, 0x43, 0xd8, 0xb7, 0x7b, 0xa4, 0x76, 0xc4, 0x17, 0x49, 0xec, 0x7f, 0x0c, 0x6f, 0xf6,
    0x6c, 0xa1, 0x3b, 0x52, 0x29, 0x9d, 0x55, 0xaa, 0xfb, 0x60, 0x86, 0xb1, 0xbb, 0xcc, 0x3e, 0x5a,
    0xcb, 0x59, 0x5f, 0xb0, 0x9c, 0xa9, 0xa0, 0x51, 0x0b, 0xf5, 0x16, 0xeb, 0x7a, 0x75, 0x2c, 0xd7,
    0x4f, 0xae, 0xd5, 0xe9, 0xe6, 0xe7, 0xad, 0xe8, 0x74, 0xd6, 0xf4, 0xea, 0xa8, 0x50, 0x58, 0xaf,
]

gfilog = [
    0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1d, 0x3a, 0x74, 0xe8, 0xcd, 0x87, 0x13, 0x26,
    0x4c, 0x98, 0x2d, 0x5a, 0xb4, 0x75, 0xea, 0xc9, 0x8f, 0x03, 0x06, 0x0c, 0x18, 0x30, 0x60, 0xc0,
    0x9d, 0x27, 0x4e, 0x9c, 0x25, 0x4a, 0x94, 0x35, 0x6a, 0xd4, 0xb5, 0x77, 0xee, 0xc1, 0x9f, 0x23,
    0x46, 0x8c, 0x05, 0x0a, 0x14, 0x28, 0x50, 0xa0, 0x5d, 0xba, 0x69, 0xd2, 0xb9, 0x6f, 0xde, 0xa1,
    0x5f, 0xbe, 0x61, 0xc2, 0x99, 0x2f, 0x5e, 0xbc, 0x65, 0xca, 0x89, 0x0f, 0x1e, 0x3c, 0x78, 0xf0,
    0xfd, 0xe7, 0xd3, 0xbb, 0x6b, 0xd6, 0xb1, 0x7f, 0xfe, 0xe1, 0xdf, 0xa3, 0x5b, 0xb6, 0x71, 0xe2,
    0xd9, 0xaf, 0x43, 0x86, 0x11, 0x22, 0x44, 0x88, 0x0d, 0x1a, 0x34, 0x68, 0xd0, 0xbd, 0x67, 0xce,
    0x81, 0x1f, 0x3e, 0x7c, 0xf8, 0xed, 0xc7, 0x93, 0x3b, 0x76, 0xec, 0xc5, 0x97, 0x33, 0x66, 0xcc,
    0x85, 0x17, 0x2e, 0x5c, 0xb8, 0x6d, 0xda, 0xa9, 0x4f, 0x9e, 0x21, 0x42, 0x84, 0x15, 0x2a, 0x54,
    0xa8, 0x4d, 0x9a, 0x29, 0x52, 0xa4, 0x55, 0xaa, 0x49, 0x92, 0x39, 0x72, 0xe4, 0xd5, 0xb7, 0x73,
    0xe6, 0xd1, 0xbf, 0x63, 0xc6, 0x91, 0x3f, 0x7e, 0xfc, 0xe5, 0xd7, 0xb3, 0x7b, 0xf6, 0xf1, 0xff,
    0xe3, 0xdb, 0xab, 0x4b, 0x96, 0x31, 0x62, 0xc4, 0x95, 0x37, 0x6e, 0xdc, 0xa5, 0x57, 0xae, 0x41,
    0x82, 0x19, 0x32, 0x64, 0xc8, 0x8d, 0x07, 0x0e, 0x1c, 0x38, 0x70, 0xe0, 0xdd, 0xa7, 0x53, 0xa6,
    0x51, 0xa2, 0x59, 0xb2, 0x79, 0xf2, 0xf9, 0xef, 0xc3, 0x9b, 0x2b, 0x56, 0xac, 0x45, 0x8a, 0x09,
    0x12, 0x24, 0x48, 0x90, 0x3d, 0x7a, 0xf4, 0xf5, 0xf7, 0xf3, 0xfb, 0xeb, 0xcb, 0x8b, 0x0b, 0x16,
    0x2c, 0x58, 0xb0, 0x7d, 0xfa, 0xe9, 0xcf, 0x83, 0x1b, 0x36, 0x6c, 0xd8, 0xad, 0x47, 0x8e, 0x00,
]

def do_mul(a, b):
    if a == 0 or b == 0:
        return 0;
    else:
        tmp = gflog[a] + gflog[b]
        tmp = tmp % 255
        return gfilog[tmp]

def do_div(a, b):
    if a == 0:
        return 0
    elif b == 0:
        print 'can not div 0'
        return 0;
    else:
        tmp = gflog[a] - gflog[b]
        if tmp < 0:
            tmp = tmp + 255
        return gfilog[tmp]

def do_power(a, b):
    count = 0
    result = 1
    while (count < b):
        result = do_mul(result, a)
        count += 1
    return result

def calc_once(num, ch):
    if ch == '+':
        ret = do_add(num[1], num[0])
    elif ch == '-':
        ret = do_sub(num[1], num[0])
    elif ch == '*':
        ret = do_mul(num[1], num[0])
    elif ch == '/':
        ret = do_div(num[1], num[0])
    elif ch == '**':
        ret = do_power(num[1], num[0])
    else:
        print 'unknow operation'
        ret = 0
    return ret

# 1. get ch from left to right
# 2. if ch is a number, output it
# 3. if ch is a operator or parenthese:
#    a: if ch is '(', push to stack
#    b: if ch is ')', pop stack until meet '('
#    c: if ch is not parenthese, compare its priority with stack pop
#          if ch priority is higher than the stack pop, push ch to stack
#          else pop stack, push ch to stack
def midfix_to_posfix(midfix):
    stack = []
    posfix = []
    for ch in midfix:
        if is_parentheses_left(ch):
            stack.append(ch)
        elif is_parentheses_right(ch):
            while True:
                ch1 = stack.pop()
                if is_parentheses_left(ch1):
                    break
                else:
                    posfix.append(ch1)
        elif is_operator(ch):
            if len(stack) == 0:
                stack.append(ch)
            else:
                ch1 = stack[-1]
                if opt_priority(ch) > opt_priority(ch1):
                    stack.append(ch)
                else:
                    ch1 = stack.pop()
                    posfix.append(ch1)
                    stack.append(ch)
        else:
            if len(ch) > 2 and ch[0:2] == '0x':
                ch = int(ch, 16)
            else:
                ch = int(ch, 10)
            posfix.append(ch)
    while len(stack) != 0:
        posfix.append(stack.pop())
    return posfix

# get data from left to right
# if ch is a number, push to stack
# if ch is a operator, pop the number it needed, do calc, and push result to stack
# if data is paser comlete, pop the stack as result
def calc_posfix(posfix):
    stack = []
    for ch in posfix:
        if is_operator(ch):
            num = []
            num.append(stack.pop())
            if get_operation_number(ch) > 1:
                num.append(stack.pop())
            stack.append(calc_once(num, ch))
        else:
            stack.append(ch)
    return stack.pop()

def main_loop():
    # match all hex and dec number, and +,-,*,/,**
    # note: hex must before dec number, and * must before **
    print "please do not use ** mix with other operation, it's not support!"
    regu_for_exp = re.compile('0x[0-9,a-f]+|[0-9]+|\*\*|\*|\+|\-|\/|\(|\)')
    while True:
        expression = raw_input('pq>:')
        if expression != 'quit' and expression != 'q' and expression != 'exit':
            midfix = regu_for_exp.findall(expression)
            posfix = midfix_to_posfix(midfix)
            result = calc_posfix(posfix)
            if result is not None:
                print "0x%02x" % result
            else:
                print result
        else:
            return
if __name__ == '__main__':
    main_loop()

2010年11月10日星期三

raid6中gflog与gfilog

写了一篇介绍raid6中gflog与gfilog的文章，用了太多的数学公式，所以用lyx写了，放到网页上似乎不太方便，于是放到了google code上，下面是下载地址：

http://raid6theory.googlecode.com/files/raid6_theory.pdf

2010年11月3日星期三

创建/sys入口和使用waitqueue的小例子

一个简单的示例程序。创建一个/sys的接口，可以读写，每次回读都是上次写入的内容。每次读写都会触发一次event。

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/fs.h>
#include <linux/cdev.h>
#include <linux/device.h>
#include <linux/kthread.h>

static struct cdev test_cdev;
static dev_t test_devno;
static struct class *test_class;
struct device *test_device;

struct test_thread {
    wait_queue_head_t    wqueue;
    unsigned long           flags;
    struct task_struct    *tsk;
    unsigned long        timeout;
}test_thread;

#define FROM_SHOW 0
#define FROM_STORE 1
static const struct file_operations test_fops =
{
    .owner          = THIS_MODULE,
};

static int test_fun(void *arg)
{
    struct test_thread *thread = arg;
    allow_signal(SIGKILL);

    while (!kthread_should_stop()) {

        /* We need to wait INTERRUPTIBLE so that
        * we don't add to the load-average.
        * That means we need to be sure no signals are
        * pending
        */
        if (signal_pending(current))
            flush_signals(current);

        wait_event_interruptible_timeout
            (thread->wqueue,
                test_bit(FROM_SHOW, &thread->flags)
                || test_bit(FROM_STORE, &thread->flags)
                || kthread_should_stop(),
                thread->timeout);
        printk("thread %s is waken up by %s\n",
            thread->tsk->comm,
            test_bit(FROM_SHOW, &thread->flags) ? "show" :
            test_bit(FROM_STORE, &thread->flags) ? "store" : "kill");
        thread->flags = 0;
    }

    return 0;
}

#define TEST_LEN 4096
static char test_buf[TEST_LEN];
static ssize_t test_show(struct device *ddev,
            struct device_attribute *attr, char *buf)
{
    int len;
    struct test_thread *thread;

    thread = dev_get_drvdata(ddev);
    len = strlen(test_buf) + 1;
    memcpy(buf, test_buf, len);

    set_bit(FROM_SHOW, &thread->flags);
    wake_up(&thread->wqueue);

    return len;
}

static ssize_t test_store(struct device *ddev,
            struct device_attribute *attr, const char *buf, size_t count)
{
    size_t len;
    struct test_thread *thread;

    thread = dev_get_drvdata(ddev);

    if (count < TEST_LEN - 1)
        len = count;
    else
        len = TEST_LEN - 1;

    memcpy(test_buf, buf, len);
    test_buf[len] = 0;

    set_bit(FROM_STORE, &thread->flags);
    wake_up(&thread->wqueue);

    return count;
}

static DEVICE_ATTR(test1, 0644, test_show, test_store);

static int test_init(void)
{
    int ret;

    ret = alloc_chrdev_region(&test_devno, 0, 255, "test");
    if (ret) {
        printk(KERN_INFO "alloc_chrdev_region failed, ret=%d\n", ret);
        return ret;
    }

    cdev_init(&test_cdev, &test_fops);
    test_cdev.owner = THIS_MODULE;
    ret = cdev_add(&test_cdev, test_devno, 1);
    if (ret) {
        printk(KERN_INFO "cdev_add failed, ret=%d\n", ret);
        goto free_devno;
    }

    test_class = class_create(THIS_MODULE, "test_class");
    if (IS_ERR(test_class)) {
        ret = PTR_ERR(test_class);
        printk(KERN_INFO "class_create failed, ret=%d\n", ret);
        goto free_cdev;
    }

    test_device = device_create(test_class, NULL, test_devno, NULL, "test");
    if (IS_ERR(test_device)) {
        ret = PTR_ERR(test_device);
        printk(KERN_INFO "device_create failed, ret=%d\n", ret);
        goto free_class;
    }

    ret = device_create_file(test_device, &dev_attr_test1);
    if (ret) {
        printk(KERN_INFO "device_create_file failed, ret=%d\n", ret);
        goto free_device;
    }

    init_waitqueue_head(&test_thread.wqueue);
    test_thread.timeout = MAX_SCHEDULE_TIMEOUT;
    test_thread.flags = 0;
    dev_set_drvdata(test_device, &test_thread);
    test_thread.tsk = kthread_run(test_fun, &test_thread, "test_thread");
    if (IS_ERR(test_thread.tsk)) {
        ret = PTR_ERR(test_thread.tsk);
        printk(KERN_INFO "kthread_run failed, ret=%d\n", ret);
        goto free_file;
    }

    return 0;

free_file:
    device_remove_file(test_device, &dev_attr_test1);
free_device:
    device_destroy(test_class, test_devno);
free_class:
    class_destroy(test_class);
free_cdev:
    cdev_del(&test_cdev);
free_devno:
    unregister_chrdev_region(test_devno, 255);
    return ret;
}

void test_exit(void)
{
    kthread_stop(test_thread.tsk);
    device_remove_file(test_device, &dev_attr_test1);
    device_destroy(test_class, test_devno);
    class_destroy(test_class);
    cdev_del(&test_cdev);
    unregister_chrdev_region(test_devno, 255);
}
MODULE_LICENSE("GPL");
module_init (test_init);
module_exit (test_exit);

2010年11月1日星期一

在加载驱动时自动创建设备节点

在网上找了些自动创建设备节点的办法，但由于内核接口的变化，已经无法使用了。下面这个程序是可以在2.6.32内核上使用的：

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/fs.h>
#include <linux/cdev.h>
#include <linux/device.h>

static struct cdev test_cdev;
static dev_t test_devno;
static struct class *test_class;
struct device *test_device;

static const struct file_operations test_fops =
{
   .owner          = THIS_MODULE,
};

static int test_init(void)
{
   int ret;

   ret = alloc_chrdev_region(&test_devno, 0, 255, "test");
   if (ret) {
       printk(KERN_INFO "alloc_chrdev_region failed, ret=%d\n", ret);
       return ret;
   }
   cdev_init(&test_cdev, &test_fops);
   test_cdev.owner = THIS_MODULE;
   ret = cdev_add(&test_cdev, test_devno, 1);
   if (ret) {
       printk(KERN_INFO "cdev_add failed, ret=%d\n", ret);
       goto free_devno;
   }
   test_class = class_create(THIS_MODULE, "test_class");
   if (IS_ERR(test_class)) {
       ret = PTR_ERR(test_class);
       printk(KERN_INFO "class_create failed, ret=%d\n", ret);
       goto free_cdev;
   }
   test_device = device_create(test_class, NULL, test_devno, NULL, "test");
   if (IS_ERR(test_device)) {
       ret = PTR_ERR(test_device);
       printk(KERN_INFO "device_create failed, ret=%d\n", ret);
       goto free_class;
   }
   return 0;
free_class:
   class_destroy(test_class);
free_cdev:
   cdev_del(&test_cdev);
free_devno:
   unregister_chrdev_region(test_devno, 255);
   return ret;
}

void test_exit(void)
{
   device_destroy(test_class, test_devno);
   class_destroy(test_class);
   cdev_del(&test_cdev);
   unregister_chrdev_region(test_devno, 255);
}
MODULE_LICENSE("GPL");
module_init (test_init);
module_exit (test_exit);

2010年10月24日星期日

linux 内核 hash table 的使用

很早以前就想学习一下如何使用linux内核中的散列函数。google了几次，发现
网上有大量介绍有关hlist的东西。可找来找去也找不到究竟该如何使用散列。
原来，我先入为主的认为内核中的散列函数会像那些高级语言中实现的散列功能
类似：我提供一对对的(key value)给内核，然后再调用某一个api，传给它一个
key，就可以得到对应的value。

后来参考了一些内核中应用散列的实例才发现，原来根本不是这么回事。实际
上，对于如何将输入数据散列到一个指定范围的算法，需要使用散列的人自己决
定。内核只提供了一个发射碰撞时把碰撞的项链接到一起的hlist结构。

例如，你创建了一个长度为m的散列表，并且已经选择了一个将输入数据映射到
范围0 ~ m-1的散列函数。接下来，你就要在这个长度为m的散列表的每个表项内
放上一个hlist_head结构体。然后在每个输入数据的结构体中定义一个
hlist_node的结构体。每当把一个输入通过散列函数映射到0 ~ m-1的范围内时，就
把这个输入的hlist_node挂到散列表对应的槽的hlist_head上面。当给定一个
key，想获取它的value的时候，就先用散列函数算出这个key对应的槽的位置，
然后遍历这个槽的hlist_node链表，找到与key相等的项。把它的value返回。

例如，有这样一个数组：
0x01, 0x02, 0x04, 0x08,0x10, 0x20, 0x40, 0x80, 0x1d, 0x3a, 0x74, 0xe8,
0xcd, 0x87, 0x13,
其中每个元素对应的索引号为：
1, 2, 3, 4, 5, ... 15
也就是说，当输入0x01时，我希望得到索引号1，当输入0x08时，得到4，当输入
0x3a时，得到10...
这种从数值到索引号的转换，可通过散列来实现。

下面是实现该功能的一个内核代码，散列函数我选择的是：
value = ((104 * key + 52) % 233) % 15
(实际上，对于输入固定的情况，使用完全散列可以获得完全固定的访问时间，
上面这个散列函数就是我想使用完全散列时搜索一个全域散列族得到的第一级散
列函数，但我发先这个散列函数已经足够好，总共才只有一次碰撞。所以就没有
必要像完全散列那样使用二级散列了。)

#include <linux/init.h>
#include <linux/module.h>
#include <linux/list.h>

struct q_coef
{
    u8 coef;
    u8 index;
    struct hlist_node hash;
};

#define HASH_NUMBER 15
u8 coef[HASH_NUMBER] = {
    0x01, 0x02, 0x04, 0x08,0x10, 0x20, 0x40, 0x80, 0x1d, 0x3a, 0x74, 0xe8, 0xcd, 0x87, 0x13,
};
struct q_coef q_coef_list[HASH_NUMBER];

struct hlist_head hashtbl[HASH_NUMBER];

static inline int hash_func(u8 k)
{
    int a, b, p, m;
    a = 104;
    b = 52;
    p = 233;
    m = HASH_NUMBER;
    return ((a * k + b) % p) % m;
}

static void hash_init(void)
{
    int i, j;
    for (i = 0 ; i < HASH_NUMBER ; i++) {
        INIT_HLIST_HEAD(&hashtbl[i]);
        INIT_HLIST_NODE(&q_coef_list[i].hash);
        q_coef_list[i].coef = coef[i];
        q_coef_list[i].index = i + 1;
    }
    for (i = 0 ; i < HASH_NUMBER ; i++) {
        j = hash_func(q_coef_list[i].coef);
        hlist_add_head(&q_coef_list[i].hash, &hashtbl[j]);
    }
}

static void hash_test(void)
{
    int i, j;
    struct q_coef *q;
    struct hlist_node *hn;
    for (i = 0 ; i < HASH_NUMBER ; i++) {
        j = hash_func(coef[i]);
        hlist_for_each_entry(q, hn, &hashtbl[j], hash)
            if (q->coef == coef[i])
                printk("found: coef=0x%02x index=%d\n", q->coef, q->index);
    }
}
static int htest_init (void)
{
    hash_init();
    hash_test();
    return -1;
}

static void htest_exit (void)
{
}

module_init(htest_init);
module_exit(htest_exit);

MODULE_LICENSE("Dual BSD/GPL");

订阅：博文 (Atom)