[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M

To: Ian Campbell <Ian.Campbell@xxxxxxxxxx>
From: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
Date: Fri, 14 Jun 2013 15:36:27 +0100
Cc: Yongjie Ren <yongjie.ren@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir@xxxxxxx>, Hanweidong <hanweidong@xxxxxxxxxx>, Xudong Hao <xudong.hao@xxxxxxxxx>, Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>, Tim Deegan <tim@xxxxxxx>, "qemu-devel@xxxxxxxxxx" <qemu-devel@xxxxxxxxxx>, Yanqiangjun <yanqiangjun@xxxxxxxxxx>, Wangzhenguo <wangzhenguo@xxxxxxxxxx>, YangXiaowei <xiaowei.yang@xxxxxxxxxx>, "Gonglei \(Arei\)" <arei.gonglei@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, YongweiX Xu <yongweix.xu@xxxxxxxxx>, Luonengjun <luonengjun@xxxxxxxxxx>, Paolo Bonzini <pbonzini@xxxxxxxxxx>, SongtaoX Liu <songtaox.liu@xxxxxxxxx>
Delivery-date: Fri, 14 Jun 2013 14:37:20 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 14/06/13 15:14, George Dunlap wrote:

On 14/06/13 12:34, Ian Campbell wrote:
On Fri, 2013-06-14 at 11:53 +0100, George Dunlap wrote:
On Thu, Jun 13, 2013 at 6:22 PM, Ian Campbell<Ian.Campbell@xxxxxxxxxx> wrote:
On Thu, 2013-06-13 at 17:55 +0100, Stefano Stabellini wrote:
We could have a xenstore flag somewhere that enables the oldbehaviourso that people can revert back to qemu-xen-traditional and makethe pcihole below 4G even bigger than 448MB, but I think that keepingthe oldbehaviour around is going to make the code more difficult tomaintain.
The downside of that is that things which worked with the oldscheme maynot work with the new one though. Early in a release cycle whenwe havetime to discover what has broken then that might be OK, but ispost rc4
really the time to be risking it?
Yes, you are right: there are some scenarios that would have worked
before that wouldn't work anymore with the new scheme.
Are they important enough to have a workaround, pretty difficult to
identify for a user?
That question would be reasonable early in the development cycle.At rc4the question should be: do we think this problem is so criticalthat we
want to risk breaking something else which currently works for people.

Remember that we are invalidating whatever passthrough testing people
have already done up to this point of the release.

It is also worth noting that the things which this change ends up
breaking may for all we know be equally difficult for a user toidentify
(they are after all approximately the same class of issue).

The problem here is that the risk is difficult to evaluate, we just
don't know what will break with this change, and we don't knowthereforeif the cure is worse than the disease. The conservative approach atthis
point in the release would be to not change anything, or to change the
minimal possible number of things (which would preclude changes which
impact qemu-trad IMHO).
WRT pretty difficult to identify -- the root of this threadsuggests the
guest entered a reboot loop with "No bootable device", that sounds
eminently release notable to me. I also not that it was changing the
size of the PCI hole which caused the issue -- which does somewhat
underscore the risks involved in this sort of change.
But that bug was a bug in the first attempt to fix the root problem.
The root problem shows up as qemu crashing at some point because it
tried to access invalid guest gpfn space; see
http://lists.xen.org/archives/html/xen-devel/2013-03/msg00559.html.

Stefano tried to fix it with the above patch, just changing the hole
to start at 0xe; but that was incomplete, as it didn't match with
hvmloader and seabios's view of the world.  That's what this bug
report is about.  This thread is an attempt to find a better fix.

So the root problem is that if we revert this patch, and someone
passes through a pci device using qemu-xen (the default) and the MMIO
hole is resized, at some point in the future qemu will randomly die.
Right, I see, thanks for explaining.
If it's a choice between users experiencing, "My VM randomly crashes"
and experiencing, "I tried to pass through this device but the guest
OS doesn't see it", I'd rather choose the latter.
All other things being equal, obviously we all would. But the point I've
been trying to make is that we don't know the other consequences of
making that fix -- e.g. on existing working configurations. So the
choice is "some VMs randomly crash, but other stuff works fine and we
have had a reasonable amount of user testing" and "those particular VMs
don't crash any more, but we don't know what other stuff no longer works
and the existing test base has been at least partially invalidated".

I think that at post rc4 in a release we ought to be being pretty
conservative about the risks of this sort of change, especially wrt
invalidating testing and the unknowns involved.

Aren't the configurations which might trip over this issue are going to
be in the minority compared to those which we risk breaking?
So there are the technical proposals we've been discussing, each ofwhich has different risks.
1. Set the default MMIO hole size to 0xe0000000.
2. If possible, relocate PCI devices that don't fit in the hole to the64-bit hole.- Here "if possible" will mean a) the device has a 64-bit BAR, and b)this hasn't been disabled by libxl (probably via a xenstore key).
3. If possible, resize the MMIO hole; otherwise refuse to map the device
- Currently "if possible" is always true; the new thing here would bemaking it possible for libxl to disable this, probably via a xenstorekey.
Each of these will have different risks for qemu-traditional andqemu-xen.
Implementing #3 would have no risk for qemu-traditional, because wewon't be changing the way anything works; what works will still work,what is broken (if anything) will still be broken.
Implementing #3 for qemu-xen only changes one kind of failure foranother. If you resize the MMIO hole for qemu-xen, then you *will*eventually crash. So this will not break existing workingconfigurations -- it will only change the failure from "qemu crashesat some point" to "the guest OS cannot see the device". This is auniform improvement.

I suppose this is not strictly true. If you resize the MMIO hole *suchthat it overlaps what was originally guest memory*, then it will crash.If you have a smaller guest with say, only 1 or 2GiB of RAM, then youcan probably resize the MMIO hole arbitrarily on qemu-xen and have noill effects. So as stated ("never resize MMIO hole"), this would causesome successes into "guest can't see the device" failures.


(Stefano, correct me if I'm wrong here.)

But hvmloader should know whether this is the case, however, because ifthere is memory there it has to relocate it. So we should change "ispossible" to mean, "if we don't need to relocate memory, or ifrelocating memory has been enabled by libxl".


 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

References:
- [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  - From: Xu, YongweiX
- Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  - From: George Dunlap
- Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  - From: Stefano Stabellini
- Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  - From: Hao, Xudong
- Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  - From: George Dunlap
- Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  - From: Stefano Stabellini
- Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  - From: Jan Beulich
- Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  - From: George Dunlap
- Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  - From: Stefano Stabellini
- Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  - From: George Dunlap
- Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  - From: Stefano Stabellini
- Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  - From: Ian Campbell
- Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  - From: Stefano Stabellini
- Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  - From: Ian Campbell
- Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  - From: George Dunlap
- Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  - From: Ian Campbell
- Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
  - From: George Dunlap

Prev by Date: Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
Next by Date: Re: [Xen-devel] [Hackathon Minutes] Xen 4.4 Planning
Previous by thread: Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
Next by thread: Re: [Xen-devel] [BUG 1747]Guest could't find bootable device with memory more than 3600M
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.