[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] blktap wedges when block-attached to dom0



Any chance this will be refreshed for 2.6.18? I very much enjoy being
able to block-attach in domain 0, but am less enamoured of the
frequent hangs when I fsck those devices...

On Tuesday, 02 January 2007 at 17:37, jake wrote:
> blktap devices attached to dom0 are liable to wedge during IO transfers.
> The problem does not occur in typical usage scenarios (i.e., virtual
> devices attached to guest domains); it is unique to the unanticipated
> case in which virtual devices are attached to dom0. 
> 
> The problem arises when processes in dom0 generate a large number of
> dirty pages while writing to a block-attached device.  Once the number
> of dirty pages reaches a certain threshold, the dom0 kernel begins
> throttling IO in balance_dirty_pages; processes traversing the buffered
> IO path will block in this function until the number of dirty pages
> decreases. 
> 
> This is bad for the tapdisk process, which is responsible for servicing
> IO requests from the blktap driver.  The tapdisk process normally
> performs direct IO, but if it writes to a hole in a sparse file, it
> falls into the buffered IO path.  If the tapdisk process blocks in
> balance_dirty_pages, it will do so indefinitely, because it is the only
> process that cleans the pages dirtied by the processes writing to the
> virtual device.  Thus dirty pages continue to amass in dom0 as IO is
> performed on the virtual device, but none of them make it to the
> physical devices because the tapdisk process is unable to service the
> requests. 
> 
> Note that when used as originally intended, blktap does not suffer from
> this problem: when blktap devices are attached to guest domains,
> performing IO on them dirties pages in the guest domain, not in dom0, so
> the tapdisk process doesn't get throttled in balance_dirty_pages.
> 
> Attached is a patch that eschews the dom0 problem by exempting the
> tapdisk process from blocking in balance_dirty_pages.  tapdisk processes
> servicing dom0-attached devices are granted special status using a
> modified setpriority syscall; a check in balance_dirty_pages ensures
> that such processes do not block indefinitely. 
> 
> This is clearly a hacky solution; any suggestions for improvement are
> welcome.

> # HG changeset patch
> # User Jake Wires <jwires@xxxxxxxxxxxxx>
> # Date 1166551978 28800
> # Node ID 34c6a9a2983ae46fad5dbba7e4b49520fb639a8c
> # Parent  df1e7ae878b4badf4e5555df12a1c4d233170fb9
> [BLKTAP] prevent tapdisk processes from blocking in balance_dirty_pages
> 
> This patch mods the setpriority syscall to enable marking processes as special
> IO processes.  IO processes are exempted from blocking in balance_dirty_pages.
> This patch is intended to avoid deadlocks when block-attaching a blktap VDI to
> dom0.
> 
> diff -r df1e7ae878b4 -r 34c6a9a2983a patches/linux-2.6.16.33/series
> +++ b/patches/linux-2.6.16.33/series  Tue Dec 19 10:12:58 2006 -0800
> @@ -5,6 +5,7 @@ git-4bfaaef01a1badb9e8ffb0c0a37cd2379008
>  git-4bfaaef01a1badb9e8ffb0c0a37cd2379008d21f.patch
>  linux-2.6.19-rc1-kexec-move_segment_code-x86_64.patch
>  blktap-aio-16_03_06.patch
> +blktap-ioprio.patch
>  device_bind.patch
>  fix-hz-suspend.patch
>  fix-ide-cd-pio-mode.patch
> diff -r df1e7ae878b4 -r 34c6a9a2983a tools/blktap/drivers/blktapctrl.c
> +++ b/tools/blktap/drivers/blktapctrl.c       Tue Dec 19 10:12:58 2006 -0800
> @@ -51,6 +51,7 @@
>  #include <xs.h>
>  #include <printf.h>
>  #include <sys/time.h>
> +#include <sys/resource.h>
>  #include <syslog.h>
>                                                                       
>  #include "blktaplib.h"
> @@ -535,6 +536,14 @@ int blktapctrl_new_blkif(blkif_t *blkif)
>                       goto fail;
>               }
>  
> +             /* exempt tapdisk from flushing when attached to dom0 */
> +             if (blkif->domid == 0) 
> +                     if (setpriority(PRIO_PROCESS, 
> +                                     blkif->tappid, PRIO_SPECIAL_IO)) {
> +                             DPRINTF("Unable to prioritize tapdisk proc\n");
> +                             goto fail;
> +                     }
> +
>               /* Both of the following read and write calls will block up to 
>                * max_timeout val*/
>               if (write_msg(blkif->fds[WRITE], CTLMSG_PARAMS, blkif, ptr) 
> diff -r df1e7ae878b4 -r 34c6a9a2983a tools/blktap/lib/blktaplib.h
> +++ b/tools/blktap/lib/blktaplib.h    Tue Dec 19 10:12:58 2006 -0800
> @@ -57,6 +57,8 @@
>  #define BLKTAP_QUERY_ALLOC_REQS      8
>  #define BLKTAP_IOCTL_FREEINTF             9
>  #define BLKTAP_IOCTL_PRINT_IDXS      100   
> +
> +#define PRIO_SPECIAL_IO             -9999
>  
>  /* blktap switching modes: (Set with BLKTAP_IOCTL_SETMODE)             */
>  #define BLKTAP_MODE_PASSTHROUGH      0x00000000  /* default            */
> diff -r df1e7ae878b4 -r 34c6a9a2983a 
> patches/linux-2.6.16.33/blktap-ioprio.patch
> +++ b/patches/linux-2.6.16.33/blktap-ioprio.patch     Tue Dec 19 10:12:58 
> 2006 -0800
> @@ -0,0 +1,81 @@
> +diff -pruN ../orig-linux-2.6.16.33/include/linux/sched.h 
> ./include/linux/sched.h
> +--- ../orig-linux-2.6.16.33/include/linux/sched.h    2006-12-18 
> 18:42:00.000000000 -0800
> ++++ ./include/linux/sched.h  2006-12-18 18:46:07.000000000 -0800
> +@@ -706,6 +706,7 @@ struct task_struct {
> +     prio_array_t *array;
> + 
> +     unsigned short ioprio;
> ++    short special_prio;
> + 
> +     unsigned long sleep_avg;
> +     unsigned long long timestamp, last_ran;
> +diff -pruN ../orig-linux-2.6.16.33/include/linux/resource.h 
> ./include/linux/resource.h
> +--- ../orig-linux-2.6.16.33/include/linux/resource.h 2006-12-18 
> 18:42:00.000000000 -0800
> ++++ ./include/linux/resource.h       2006-12-18 18:44:35.000000000 -0800
> +@@ -44,6 +44,7 @@ struct rlimit {
> + 
> + #define     PRIO_MIN        (-20)
> + #define     PRIO_MAX        20
> ++#define PRIO_SPECIAL_IO -9999
> + 
> + #define     PRIO_PROCESS    0
> + #define     PRIO_PGRP       1
> +diff -pruN ../orig-linux-2.6.16.33/include/linux/init_task.h 
> ./include/linux/init_task.h
> +--- ../orig-linux-2.6.16.33/include/linux/init_task.h        2006-12-18 
> 18:42:00.000000000 -0800
> ++++ ./include/linux/init_task.h      2006-12-18 18:45:56.000000000 -0800
> +@@ -85,6 +85,7 @@ extern struct group_info init_groups;
> +     .lock_depth     = -1,                                           \
> +     .prio           = MAX_PRIO-20,                                  \
> +     .static_prio    = MAX_PRIO-20,                                  \
> ++        .special_prio   = 0,                                            \
> +     .policy         = SCHED_NORMAL,                                 \
> +     .cpus_allowed   = CPU_MASK_ALL,                                 \
> +     .mm             = NULL,                                         \
> +diff -pruN ../orig-linux-2.6.16.33/kernel/sys.c ./kernel/sys.c
> +--- ../orig-linux-2.6.16.33/kernel/sys.c     2006-12-18 18:42:00.000000000 
> -0800
> ++++ ./kernel/sys.c   2006-12-18 18:43:30.000000000 -0800
> +@@ -245,6 +245,11 @@ static int set_one_prio(struct task_stru
> +             error = -EPERM;
> +             goto out;
> +     }
> ++    if (niceval == PRIO_SPECIAL_IO) {
> ++            p->special_prio = PRIO_SPECIAL_IO;
> ++            error = 0;
> ++            goto out;
> ++    }
> +     if (niceval < task_nice(p) && !can_nice(p, niceval)) {
> +             error = -EACCES;
> +             goto out;
> +@@ -272,10 +277,15 @@ asmlinkage long sys_setpriority(int whic
> + 
> +     /* normalize: avoid signed division (rounding problems) */
> +     error = -ESRCH;
> +-    if (niceval < -20)
> +-            niceval = -20;
> +-    if (niceval > 19)
> +-            niceval = 19;
> ++    if (niceval == PRIO_SPECIAL_IO) {
> ++            if (which != PRIO_PROCESS)
> ++                    return -EINVAL;
> ++    } else {
> ++            if (niceval < -20)
> ++                    niceval = -20;
> ++            if (niceval > 19)
> ++                    niceval = 19;
> ++    }
> + 
> +     read_lock(&tasklist_lock);
> +     switch (which) {
> +diff -pruN ../orig-linux-2.6.16.33/mm/page-writeback.c ./mm/page-writeback.c
> +--- ../orig-linux-2.6.16.33/mm/page-writeback.c      2006-12-19 
> 10:03:59.000000000 -0800
> ++++ ./mm/page-writeback.c    2006-12-19 10:04:17.000000000 -0800
> +@@ -231,6 +231,9 @@ static void balance_dirty_pages(struct a
> +                     pages_written += write_chunk - wbc.nr_to_write;
> +                     if (pages_written >= write_chunk)
> +                             break;          /* We've done our duty */
> ++                    if (current->special_prio == PRIO_SPECIAL_IO)
> ++                            break;          /* Exempt IO processes */
> ++
> +             }
> +             blk_congestion_wait(WRITE, HZ/10);
> +     }

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.