This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


[Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.

On Wed, 2010-11-17 at 16:02 -0500, Jeremy Fitzhardinge wrote:
> On 11/17/2010 12:21 PM, Daniel Stodden wrote:
> > And, like all granted frames, not owning them implies they are not
> > resolvable via mfn_to_pfn, thereby failing in follow_page, thereby gup()
> > without the VM_FOREIGN hack.
> Hm, I see.  Well, I wonder if using _PAGE_SPECIAL would help (it is put
> on usermode ptes which don't have a backing struct page).  After all,
> there's no fundamental reason why it would need a pfn; the mfn in the
> pte is what's actually needed to ultimately generate a DMA descriptor.

The kernel needs the page structs at least for locking and refcounting.

There's also a some trickier stuff in there. Like redirtying disk-backed
user memory after read completion, in case it's been laundered. (So that
an AIO on unpinned user memory doesn't subsequently get flashed back
when cycling through swap, if I understood that thing correctly.)

Doesn't apply for blktap (it's all reserved pages). All I mean is: I
wouldn't exactly see some innocent little dio hack or so shape up in

Kernel allowing to DMA into a bare pfnmap -- From the platform POV, I'd
agree. E.g. there's a concept of devices DMA-ing into arbitrary I/O
memory space, not host memory, on some bus architectures. PCI would come
to my mind (the old shared medium stuff, unsure about those newfangled
P-t-P topologies). But not in Linux, so I presently don't see anybody
upstream bothering to make block-I/O request addressing more forgiving
than it is.

PAGE_SPECIAL -- to the kernel, that means the opposite: page structs
which aren't backed by 'real' memory, so gup(), for example, is told to
fail (how nasty). In contrast, VM_FOREIGN is non-memory backed by page

> > Correct me if I'm mistaken. I used to be quicker looking up stuff on
> > arch-xen kernels, but I think fundamental constants of the Xen universe
> > didn't change since last time.
> No, but Linux has.

Not in that respect.

There's certainly a way to get VM_FOREIGN out of the mainline code. It
would involve an unlikely() branch in .pte_val(=xen_pte_val) to fall
back into a private local m2p hash lookup. Assuming that kind of thing
gets nowhere inlined. Not nice, but still more upstreamable than

> > [
> > Part of the reason why blktap *never* frees those pages, apart from
> > being slightly greedy, are deadlock hazards when writing those nodes in
> > dom0 through the pagecache, as dom0 might. You need memory pools on the
> > datapath to guarantee progress under pressure. That got pretty ugly
> > after 2.6.27, btw.
> > ]
> That's what mempools are intended to solve.

That's why the blktap frame pool is now a mempool, indeed.

> > In any case, let's skip trying what happens if a thundering herd of
> > several hundred userspace disks tries gfp()ing their grant slots out of
> > dom0 without without arbitration.
> I'm not against arbitration, but I don't think that's something that
> should be implemented as part of a Xen driver.

Uhm, maybe I'm misunderstanding you, isn't the whole thing a Xen driver?
What do you suggest?

> >>> I guess we've been meaning the same thing here, unless I'm
> >>> misunderstanding you. Any pfn does, and the balloon pagevec allocations
> >>> default to order 0 entries indeed. Sorry, you're right, that's not a
> >>> 'range'. With a pending re-xmit, the backend can find a couple (or all)
> >>> of the request frames have count>1. It can flip and abandon those as
> >>> normal memory. But it will need those lost memory slots back, straight
> >>> away or next time it's running out of frames. As order-0 allocations.
> >> Right.  GFP_KERNEL order 0 allocations are pretty reliable; they only
> >> fail if the system is under extreme memory pressure.  And it has the
> >> nice property that if those allocations block or fail it rate limits IO
> >> ingress from domains rather than being crushed by memory pressure at the
> >> backend (ie, the problem with trying to allocate memory in the writeout
> >> path).
> >>
> >> Also the cgroup mechanism looks like an extremely powerful way to
> >> control the allocations for a process or group of processes to stop them
> >> from dominating the whole machine.
> > Ah. In case it can be put to work to bind processes allocating pagecache
> > entries for dirtying to some boundary, I'd be really interested. I think
> > I came across it once but didn't take the time to read the docs
> > thoroughly. Can it?
> I'm not sure about dirtyness - it seems like something that should be
> within its remit, even if it doesn't currently have it.
> The cgroup mechanism is extremely powerful, now that I look at it.  You
> can do everything from setting block IO priorities and QoS parameters to
> CPU limits.

Thanks. I'll keep it under my pillow then.


Xen-devel mailing list