[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Proposal: use disk sequence numbers to avoid races in blkback


  • To: Demi Marie Obenour <demi@xxxxxxxxxxxxxxxxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Wed, 11 May 2022 09:37:54 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=mG3BJ2lOouiYMf/wKmYraYk0XNuInJnwvHltPzJZYhc=; b=HomKU7ups08ugzIVEU3FYt4gMwBNqC/x64Q/gXCoU/RxB0g+nQbECzvq4Xfcccul3Mj3uNWk6VwWjYcCVRhmzc/ta0KBqoxoPJczRD7Y8Ba9BCSGJV39TZs0K8SvgtzLZA+3JxReGzxHN3HqWAiqG2tlwUTqw0wQcwP0Gf0QVNOokVKI9LvBcJlP6Ij/cL4GRYCOtyP2fxDu9/7IpV+cm855V5yu6UfpdIYjic/GH92+X+bTwb7bGbLpDvmmxQksSjEeO3nw76wCKzULWa4WQd7bM7c8NVuLCEH2Yr6wxlv82p9Q8FNP/uREDUpFLHje3+OkCtRkjfP4m2YHrP6cpA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=J6DPeso0XoNlxxzTVcNZ4IRqPB4zpu+83Ws2+mlqup4wlZFWN5+15PZi9UVHiPshuN4qSnupd5qCmp59ksbFAf3nGWYpzJr0blir2jfhYCnZ8pVagLRTZT20Fh5fmhJb1EE5Wpbv+VvbuRuqtedoZ+MXdK6ux6ZEFBcBW5cnkQrU4W8kXc1W4DUa4pooYpUrZnsMqKtSD8BqUhlZKbkn7XTH5cG9Jpbh7ufaGpfDMCO3j4Dr4IN7m3rIl6x/sToDd90iNqS1D2mxhqtLSapySsy3hZ4BVOMJ4ff17kC8gKPNGUPgKDQ/+vT+tRt8pQJPJP/SfB3NAk5ffsWXtJOnGw==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: Xen developer discussion <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>, Mariusz Zaborski <oshogbo@xxxxxxxxxxx>
  • Delivery-date: Wed, 11 May 2022 07:38:26 +0000
  • Ironport-data: A9a23:YDVnjahhomchQ0AL7X4SRGvOX161QBEKZh0ujC45NGQN5FlHY01je htvUWHSPa2NZmX9etsibIXlpkoB68SEytM2HVRkq3w0Fn8b9cadCdqndUqhZCn6wu8v7a5EA 2fyTvGacajYm1eF/k/F3oDJ9CU6jefSLlbFILas1hpZHGeIcw98z0M68wIFqtQw24LhXlrU4 YqaT/D3YzdJ5RYlagr41IrbwP9flKyaVOQw5wFWiVhj5TcyplFNZH4tDfjZw0jQG+G4KtWSV efbpIxVy0uCl/sb5nFJpZ6gGqECaua60QFjERO6UYD66vRJjnRaPqrWqJPwwKqY4tmEt4kZ9 TlDiXC/YSR1N6qSmfYBaRJ7Hyt3LZBrxp/1PXfq5KR/z2WeG5ft69NHKRhseKc+qqNwC2wI8 uEEIjcQaBzFn/ix3L+wVuhrgIIkMdXvO4Qc/HpnyFk1D95/GcyFH/qMuIIehWlh7ixNNa+2i 84xcz1gYQ6GexRSElwWFIg/jKGjgXyXnzhw9wvN+PpnuDK7IApZ/b/faoP/Z9K2YO523Ua1h U/B53bCDURPXDCY4X/fmp62vcfDkCb6cIMUCryj9/RujUGTx2ocExkfXx2wpvzRol6zXZdTJ lIZ/gIqrLMu7wq7Q9/lRRq6rXWY+BkGVLJ4FuQg7QiXx6n84gCHB3MFRDpMdNwnssAtQTUgk FSOmrvBAidvt7KfTlqT7LqZpyi+fy8PIgc/iTQsSAIE55zpptE1hxeWFNJ7Svfr35vyBC36x C2MoG4mnbIPgMUX1qK9u1fanzaroZuPRQkwjunKYl+YAspCTNbNT+SVBZLztJ6s8K7xooG9g UU5
  • Ironport-hdrordr: A9a23:qC8hha7tNgeyoa0yGwPXwVqBI+orL9Y04lQ7vn2ZFiY5TiXIra qTdaogviMc6Ax/ZJjvo6HkBEClewKlyXcT2/hrAV7CZniehILMFu1fBOTZowEIdxeOldK1kJ 0QCZSWa+eAcmSS7/yKhzVQeuxIqLfnzEnrv5a5854Ed3AXV0gK1XYcNu/0KDwVeOEQbqBJaa Z0q/A37gaISDAyVICWF3MFV+/Mq5nik4/nWwcPA1oC5BOVhT2lxbbmG1zAty1uGA9n8PMHyy zoggb57qKsv7WSzQLd7Xba69BzlMH6wtVOKcSQgow+KynqiCyveIN9Mofy9AwdkaWK0hIHgd PMqxAvM4Ba7G7QRHi8pV/X1wzpwF8Vmgvf4G7dpUGmjd3yRTo8BcYEr5leaAHl500pu8w5+L 5X3kqC3qAnQi/orWDY3ZzlRhtqnk27rT4JiugIlUFSVoMYdft4sZEfxkVIC50NdRiKpLzPKN MeTf002cwmMW9zNxvizypSKZ2XLzkO9y69MwY/Upf/6UkVoJh7p3FosfD30E1wsa7VcKM0lt gsAp4Y6o2mcfVmHZ6VJN1xNvdfWVa9Ny4lDgqpUCfaPZBCHU7xgLjKx5hwzN2WWfUzvekPcd L6IRlliVI=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Tue, May 10, 2022 at 12:22:51PM -0400, Demi Marie Obenour wrote:
> On Tue, May 10, 2022 at 12:57:48PM +0200, Roger Pau Monné wrote:
> > On Thu, May 05, 2022 at 08:30:17PM -0400, Demi Marie Obenour wrote:
> > > Proposal: Check disk sequence numbers in blkback
> > > ================================================
> > > 
> > > Currently, adding block devices to a domain is racy.  libxl writes the
> > > major and minor number of the device to XenStore, but it does not keep
> > > the block device open until blkback has opened it.  This creates a race
> > > condition, as it is possible for the device to be destroyed and another
> > > device allocated with the same major and minor numbers.  Loop devices
> > > are the most obvious example, since /dev/loop0 can be reused again and
> > > again, but the same problem can also happen with device-mapper devices.
> > > If the major and minor numbers are reused before blkback has attached to
> > > the device, blkback will pass the wrong device to the domain, with
> > > obvious security consequences.
> > > 
> > > Other programs on Linux have the same problem, and a solution was
> > > committed upstream in the form of disk sequence numbers.  A disk
> > > sequence number, or diskseq, is a 64-bit unsigned monotonically
> > > increasing counter.  The combination of a major and minor number and a
> > > disk sequence number uniquely identifies a block device for the entire
> > > uptime of the system.
> > 
> > Seems fine to me, this is just an extra check to make sure the block
> > device opened by blkback is the one that user space intended.  I would
> > see diskseq as a kind of checksum.
> 
> Ideally, diskseq would be the primary means of identifying a device, but
> that isn’t an option without more substantial changes, sadly.
> 
> > > I propose that blkback check for an unsigned 64-bit hexadecimal XenStore
> > > entry named “diskseq”.  If the entry exists, blkback checks that the
> > > number stored there matches the disk sequence number of the device.  If
> > > it does not exist, the check is skipped.  If reading the entry fails for
> > > any other reason, the entry is malformed, or if the sequence number is
> > > wrong, blkback refuses to export the device.
> > > 
> > > The toolstack changes are more involved for two reasons:
> > > 
> > > 1. To ensure that loop devices are not leaked if the toolstack crashes,
> > >    they must be created with the delete-on-close flag set.  This
> > >    requires that the toolstack hold the device open until blkback has
> > >    acquired a handle to it.
> > 
> > Does this work with loop devices?  I would expect that you need to
> > issue a losetup call to detach the device.
> 
> That is what the autoclear flag is for.  It will cause the device to be
> destroyed by the kernel as soon as the last handle to it has been
> closed.  This is why the toolstack needs to hold a file descriptor to
> the device.

What would happen if the backend closes the device (because the
connection is torn down) and then try to open it again (because the
guest has triggered a reconnection)?

> > Even more, the loop device is created by the block script, but there's
> > also a window between the block script execution and the toolstack
> > knowing about the device, which could also allow for a leak?
> 
> For this to work, either the toolstack or block script will need to open
> the file and perform loop(4) ioctls to assign the file descriptor to a
> loop device.  This cannot be done by a shell script, so I plan on using
> a C program to perform these tasks.  In Qubes OS, I expect this program
> to replace the block script entirely, as performance is critical and
> flexibility less so.  For upstream, I recommend having the block script
> be a script that calls this C program.

block scripts can be plain binary executables, so I think it would be
fine for libxl to just call the executable directly.

> > > 2. For block devices that are opened by path, the toolstack needs to
> > >    ensure that the device it has opened is actually the device it
> > >    intended to open.  This requires device-specific verification of the
> > >    open file descriptor.  This is not needed for regular files, as the
> > >    LOOP_CONFIGURE ioctl is called on an existing loop device and sets
> > >    its backing file.
> > > 
> > > The first is fairly easy in C.  It can be accomplished by means of a
> > > XenStore watch on the “status” entry.  Once that watch fires, blkback
> > > has opened the device, so the toolstack can safely close its file
> > > descriptor.
> > 
> > Does the toolstack really need to close the device?  What harm does it
> > do to keep the handle open until the domain is destroyed?
> 
> This would cause no harm, but it also would not help either, so I do not
> see any advantages to doing it.

Well, seems more complex because you need more synchronization between
blkback and the toolstack in order to detect when blkback has opened
the device.  If this is not strictly required I would rather avoid it:
more complexity just leads to more errors.

> > What about disk hotplug?  Which entity will keep the device opened in
> > that case?  Is xl block-attach going to block until the device
> > switches to the connected state?
> 
> Whichever program opens the file will need to do this.  

This is not trivial to implement with xl, as `xl block-attach` is a
short-lived command that just populates the xenstore entries for the
to be attached device, runs the hotplug script and exits after that.
I'm not sure we would want to change `xl block-attach` behavior to
wait until the backend has opened the device.

> This could be
> the program that is using libxl or the block script that libxl invokes.
> I am not familiar with xl block-attach as Qubes OS uses a custom wrapper
> around libvirt.
> 
> > > The second is significantly more difficult.  It requires the block
> > > script to be aware of at least device-mapper devices and LVM2 logical
> > > volumes.  The general technique is common to all block devices: obtain
> > > the sequence number (via the BLKGETDISKSEQ() ioctl) and its major and
> > > minor numbers (via fstat()).  Then open /sys/dev/block/MAJOR:MINOR to
> > > get a directory file descriptor, and use openat(2) and read(2) to get
> > > various sysfs attributes.  Finally, read the diskseq sysfs attribute and
> > > check that it matches the sequence number from BLKGETDISKSEQ().
> > > Alternatively, one can use device-specific methods, such as
> > > device-mapper ioctls.
> > > 
> > > Device-mapper devices can be detected via the ‘dm/name’ sysfs attribute,
> > > which must match the name under ‘/dev/mapper/’.  If the name is of the
> > > form ‘/dev/X/Y’, and the ‘dm/uuid’ attribute starts with the literal
> > > string “LVM-”, then the expected ‘dm/name’ attribute should be found by
> > > doubling all ‘-’ characters in X and Y, and then joining X and Y with
> > > another ‘-’.  This accounts for LVM2 logical volumes.  Alternatively,
> > > one can use device-mapper ioctls to both check if a device is a
> > > device-mapper device, and to obtain its name and UUID.  I plan on going
> > > with the latter route.
> > 
> > Likely a stupid remark, but needs obviously needs to be kept to Linux
> > only.
> 
> Indeed so.  I have CC’d Mariusz Zaborski to check if FreeBSD needs any
> similar changes.

So you know Mariusz, small world I guess, I've been to quite some BSD
conferences with him.  Hope you are doing fine Mariusz, long time no
see due to the covid mess.

FreeBSD blkback doesn't use loop devices because the kernel has an
interface to read files (so blkback can open raw files directly), so
I think this is all unneeded.

My comment was mostly iff this is implemented it needs to be contained
to Linux specific files (ie: libxl_linux.c).

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.