[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Migration memory corruption - PV backends need to quiesce



On 30/06/14 10:52, Ian Campbell wrote:
> On Mon, 2014-06-30 at 10:46 +0100, Andrew Cooper wrote:
>> On 30/06/14 10:21, Ian Campbell wrote:
>>> On Mon, 2014-06-30 at 10:02 +0100, Andrew Cooper wrote:
>>>> On 30/06/14 09:38, Ian Campbell wrote:
>>>>> On Fri, 2014-06-27 at 17:51 +0100, Andrew Cooper wrote:
>>>>>> Hello,
>>>>>>
>>>>>> After a long time fixing my own memory corruption bugs with migration
>>>>>> v2, I have finally tracked down (what I really really hope is) the last
>>>>>> of the corruption.
>>>>>>
>>>>>>
>>>>>> There appears to be a systematic problem affecting all PV drivers,
>>>>>> whereby a non-quiescent backend can cause memory corruption in the VM.
>>>>>>
>>>>>> Active grant mapped pages are only reflected in the dirty bitmap after
>>>>>> the grant has been unmapped, as mapping the ring read-only would be
>>>>>> catastrophic to performance, and remapping as read-only when logdirty is
>>>>>> enabled is (as far as I understand) impossible, as Xen doesn't track the
>>>>>> PTEs pointing at granted frames.
>>>>>>
>>>>>> PV backend drivers hold their mappings of the rings (and persistently
>>>>>> granted frames) open until the domain is destroyed, which is after the
>>>>>> memory image has been sent.  Therefore, any requests which are processed
>>>>>> after the migration code sending the ring frame on its first pass will
>>>>>> not be reflected in the resumed domain, as this frame will never be
>>>>>> marked as dirty in Xen.
>>>>>>
>>>>>> Furthermore, as the migration code uses memcpy() on the frames, it is
>>>>>> possible that a backed update intersects with the copy, and a corrupt
>>>>>> descriptor appears on the resumed side.
>>>>>>
>>>>>> In addition, after the domain has been paused, the backend might still
>>>>>> process requests.  The migration code excepts the guest be completely
>>>>>> quiesced after it has been suspended, so will only check the dirty
>>>>>> bitmap once.  Any requests which get processed and completed might still
>>>>>> be missed by the migration code.
>>>>>>
>>>>>> From a heavily instrumented Xen and migration code, I am fairly sure I
>>>>>> have confirmed that all pages corrupted on migration are a result of
>>>>>> still-active grant maps, grant copies which complete after domain
>>>>>> suspend, or the xenstore ring which xenstored has a magic mapping of,
>>>>>> and will never be reflected in the dirty bitmap.
>>>>>>
>>>>>>
>>>>>> Overall, it would appear that there needs to be a hook for all PV
>>>>>> drivers to force quiescence.  In particular, a backend must guarantee to
>>>>>> unmap all active grant maps (so the frames get properly reflected in the
>>>>>> dirty bitmap), and never process subsequent requests (so no new frames
>>>>>> appear dirty in the bitmap after the guest has been paused).
>>>>>>
>>>>>> Thoughts/comments?
>>>>> I thought PV drivers were already (supposed to be) handling this in the
>>>>> frontend.
>>>>>
>>>>> For reasons of checkpoint performance I think Linux's net and blkfront
>>>>> are handling this on resume rather than on suspend by tearing down on
>>>>> resume and then requeueing any outstanding I/O after they reattach to
>>>>> the new backend. In the blkfront case this is explicit, whereas IIRC
>>>>> netfront just discards any active requests and relies on L3+
>>>>> retransmition to get the job done. (see netfront_resume and
>>>>> blkfront_resume/blkif_recover).
>>>>>
>>>>> Part of the tear down and reconnect should involve invalidating any
>>>>> inflight descriptors, whether or not they were partially completed or
>>>>> have corrupted replies in them etc. This ought to be happening before
>>>>> the new backend sees the ring at all.
>>>>>
>>>>> Can you give an example of an instance of the corruption which you've
>>>>> seen?
>>>> The intersection of requests processed by the backends and memcpy() in
>>>> the migration code means that the migration code might send a frame
>>>> which is actively being written by the backend at the same time.
>>> The I/O will either complete before or after the migration.
>>>
>>> If it is before then there is no problem, I think
>> IO completed before vm suspend is fine.
>>
>> However, there is a window of time between the vm suspend (after which
>> the frontend is quiesced) and the migration completing during which IO
>> is still being processed by the backend.
>>
>> After the vm suspend, the migration code does not expect the contents of
>> VM memory to be changing, and it only consults the dirty bitmap once more.
>>
>> Grants which are active until after this time (mappings of rings,
>> persistent areas) are never identified as dirty, which means that the
>> contents being sent will be whatever contents was present during the
>> first iteration of migration.
> Correct. And then that random contents will be fixed by the replay of
> the I/O on the resume side.
>
>> Grant copy operations do get correctly reflected in the dirty bitmap,
>> but any grant copies which happen after the final consultation of the
>> dirty bitmap will still be ignored.
> That's fine since there is no way to signal that the I/O has completed
> to the frontend, since it is no longer running. As long as the I/O is
> incomplete then the frontend will replay it.
>
>>> , the associated grant
>>> unmap will mark the page dirty (unless e.g. persistent grants or
>>> something has broken one of the assumptions here?)
>> All testing was done without persistent grant support, but persistent
>> grants will probably make the problem much more obvious, given a much
>> larger range of guest memory which will be out-of-date.
>>
>>>  before the reply hits
>>> the ring (i.e. before the guest can be aware of it).
>>>
>>> If the I/O completes after the migration then it won't actually complete
>>> (at least not anywhere which matters). Instead it will be reissued by
>>> the frtonend and the new attempt will overwrite whatever partial data
>>> was there during the migration with the correct data.
>> Correct, which is why this problem goes mostly unnoticed.  So long as
>> there is no intersection of the backend writing a frame and the
>> migration code copying that frame, resulting a corrupt frame being sent,
>> the replay of the ring will completely hide the issue.
> It's not "mostly unnoticed", it is working as designed.
>
> I think you need to consider the content of a frame which is under I/O
> at the time of migration to be "undefined" rather than looking for a
> hard "correct" vs. "corrupt" distinction.
>
> Ian.
>

There is no guarentee which direction the ring frame gets memcpy()'d. 
Even with strict ordering of changes to the ring and to the pointers,
you can still end up in a situation where the producer index has been
incremented but the underlying descriptor has not been written (as far
as the frontend is can see).

Fundamentally, it doesn't matter if the ring is out date when it is sent
(as replaying the ring will fix this), but it *must* be consistent,
which means there needs to be a guarantee that the backend does not
write into the ring frame at the same time that it is being sent by the
migration code.

This necessitates quiescing of the backend, or a guarantee that after
the vm suspend it will not update the ring page further, even if
outstanding IO subsequently completes.  This allows a consistent ring
page to be sent in the stream.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.