WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] Greater than 16 xvd devices for blkfront

To: Chris Lalancette <clalance@xxxxxxxxxx>
Subject: Re: [Xen-devel] Greater than 16 xvd devices for blkfront
From: "Daniel P. Berrange" <berrange@xxxxxxxxxx>
Date: Wed, 7 May 2008 02:55:02 +0100
Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>
Delivery-date: Tue, 06 May 2008 18:55:25 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
In-reply-to: <48209705.4030005@xxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <48209705.4030005@xxxxxxxxxx>
Reply-to: "Daniel P. Berrange" <berrange@xxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mutt/1.4.1i
On Tue, May 06, 2008 at 01:36:05PM -0400, Chris Lalancette wrote:
> All,
>      We've had a number of requests to increase the number of xvd devices 
> that a
> PV guest can have.  Currently, if you try to connect > 16 disks, you get an
> error from xend.  The problem ends up being that both xend and blkfront assume
> that for dev_t, major/minor is 8 bits each, where in fact there are actually 
> 10
> bits for major and 22 bits for minor.
>      Therefore, it shouldn't really be a problem giving lots of disks to 
> guests.
>  The problem is in backwards compatibility, and the details.  What I am
> initially proposing to do is to leave things where they are for /dev/xvd[a-p];
> that is, still put the xenstore entries in the same place, and use 8 bits for
> the major and 8 bits for the minor.  For anything above that, we would end up
> putting the xenstore entry in a different place, and pushing the major into 
> the
> top 10 bits (leaving the bottom 22 bits for the minor); that way old guests
> won't fire when the entry is added, and we will add code to newer guests
> blkfront so that they will fire when they see that entry.  Does anyone see any
> problems with this setup, or have any ideas how to do it better?

Looking at the blkfront code I think we can increase the minor numbers
available for xvdX devices without requiring changes to the where stuff
is stored.

The key is that in blkfront we can reliably detect the overflow triggered
by the 16th disk, because the next major number 203 doesn't clash with
any of the other major numbers blkfront is looking for

Consider the 17th disk, which has name 'xvdq', this gives a device number
in xenstore of '51968'.

Upon seeing this, current blkfront code will use

   #define BLKIF_MAJOR(dev) ((dev)>>8)
   #define BLKIF_MINOR(dev) ((dev) & 0xff)

And so get back major number of 203 and minor number of '0'. 

In the xlbd_get_major_info(int vdevice) function, it has a switch on major
numbers and the xvdX case is handled as the default

        major = BLKIF_MAJOR(vdevice);
        minor = BLKIF_MINOR(vdevice);

        switch (major) {
        case IDE0_MAJOR: index = 0; break;
         ....snipped...
        case IDE9_MAJOR: index = 9; break;
        case SCSI_DISK0_MAJOR: index = 10; break;
        case SCSI_DISK1_MAJOR ... SCSI_DISK7_MAJOR:
                index = 11 + major - SCSI_DISK1_MAJOR;
                break;
        case SCSI_CDROM_MAJOR: index = 18; break;
        default: index = 19; break;
        }


So, the 17th disk in fact gets treated as 1st disk and the front end assigns
it the name 'xvda', and then promptly kernel panics because xvda already
exists in sysfs. 

kobject_add failed for xvda with -EEXIST, don't try to register things with the 
same name in the same directory.

Call Trace:
 [<ffffffff80336951>] kobject_add+0x16e/0x199
 [<ffffffff8025ce3c>] exact_lock+0x0/0x14
 [<ffffffff8029b271>] keventd_create_kthread+0x0/0xc4
 [<ffffffff802f393e>] register_disk+0x43/0x198
 [<ffffffff8029b271>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8032e453>] add_disk+0x34/0x3d
 [<ffffffff88074eb8>] :xenblk:backend_changed+0x110/0x193
 [<ffffffff803a4029>] xenbus_read_driver_state+0x26/0x3b

Now, this kernel panic isn't a huge problem (though it ought to handle the 
kobject_add gracefully), because we can never do anything to make existing
frontends deal with > 16 disks. If an admin tries to add more than 16 disks
to an existing guest they should already expect doom.

For future frontends though, it looks like we can adapt the switch(major)
in xlbd_get_major_info(), so that it detects the overflow of minor numbers,
and re-adjusts the major/minor numbers to their intended value:


eg   change

        case SCSI_CDROM_MAJOR: index = 18; break;
        default: index = 19; break;
        }


to

        case SCSI_CDROM_MAJOR: index = 18; break;
        default: 
              index = 19;
              if (major > XLBD_MAJOR_VBD_START) {
                  minor += 16 * (major - XLBD_MAJOR_VBD_START);
                  major = XLBD_MAJOR_VBD_START;
              }
              break;
        }


Now, I've not actually tested this, and there's a few other places in blkfront
needing similar tweaks but I don't see anything in the code which fundamentally
stops this overflow detection & fixup.

As far as the XenD backend is concerned, all we need todo is edit the XenD 
blkdev_name_to_number() function in tools/python/xen/util/blkif.py to relax
the regex to allow > xvdp. And adapt the math so it overflows onto the major
numbers following XVD's 202. In 

eg, change

    if re.match( '/dev/xvd[a-p]([1-9]|1[0-5])?', n):
        return 202 * 256 + 16 * (ord(n[8:9]) - ord('a')) + int(n[9:] or 0)

to

    if re.match( '/dev/xvd[a-z]([1-9]|1[0-5])?', n):
        return 202 * 256 + 16 * (ord(n[8:9]) - ord('a')) + int(n[9:] or 0)

gets you to 26 disks. This is how I got the gues to boot and front end to 
crash on the 17th disk 'xvdq'. It is a little more complex to cope with 
2-letter drives, but no show stopper there.

So, unless I'm missing something obvious we can keep compatability with
existing guests for the first 16 disks and still (indirectly) make use 
of a 22/12  dev_t split for the 17th+ disk, without needing to change 
how or where stuff is stored in XenStore. 

Regards,
Daniel.
-- 
|: Red Hat, Engineering, Boston   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel