On Wed, 2010-11-17 at 16:02 -0500, Jeremy Fitzhardinge wrote:
> On 11/17/2010 12:21 PM, Daniel Stodden wrote:
> > And, like all granted frames, not owning them implies they are not
> > resolvable via mfn_to_pfn, thereby failing in follow_page, thereby gup()
> > without the VM_FOREIGN hack.
> Hm, I see. Well, I wonder if using _PAGE_SPECIAL would help (it is put
> on usermode ptes which don't have a backing struct page). After all,
> there's no fundamental reason why it would need a pfn; the mfn in the
> pte is what's actually needed to ultimately generate a DMA descriptor.
The kernel needs the page structs at least for locking and refcounting.
There's also a some trickier stuff in there. Like redirtying disk-backed
user memory after read completion, in case it's been laundered. (So that
an AIO on unpinned user memory doesn't subsequently get flashed back
when cycling through swap, if I understood that thing correctly.)
Doesn't apply for blktap (it's all reserved pages). All I mean is: I
wouldn't exactly see some innocent little dio hack or so shape up in
Kernel allowing to DMA into a bare pfnmap -- From the platform POV, I'd
agree. E.g. there's a concept of devices DMA-ing into arbitrary I/O
memory space, not host memory, on some bus architectures. PCI would come
to my mind (the old shared medium stuff, unsure about those newfangled
P-t-P topologies). But not in Linux, so I presently don't see anybody
upstream bothering to make block-I/O request addressing more forgiving
than it is.
PAGE_SPECIAL -- to the kernel, that means the opposite: page structs
which aren't backed by 'real' memory, so gup(), for example, is told to
fail (how nasty). In contrast, VM_FOREIGN is non-memory backed by page
> > Correct me if I'm mistaken. I used to be quicker looking up stuff on
> > arch-xen kernels, but I think fundamental constants of the Xen universe
> > didn't change since last time.
> No, but Linux has.
Not in that respect.
There's certainly a way to get VM_FOREIGN out of the mainline code. It
would involve an unlikely() branch in .pte_val(=xen_pte_val) to fall
back into a private local m2p hash lookup. Assuming that kind of thing
gets nowhere inlined. Not nice, but still more upstreamable than
> > [
> > Part of the reason why blktap *never* frees those pages, apart from
> > being slightly greedy, are deadlock hazards when writing those nodes in
> > dom0 through the pagecache, as dom0 might. You need memory pools on the
> > datapath to guarantee progress under pressure. That got pretty ugly
> > after 2.6.27, btw.
> > ]
> That's what mempools are intended to solve.
That's why the blktap frame pool is now a mempool, indeed.
> > In any case, let's skip trying what happens if a thundering herd of
> > several hundred userspace disks tries gfp()ing their grant slots out of
> > dom0 without without arbitration.
> I'm not against arbitration, but I don't think that's something that
> should be implemented as part of a Xen driver.
Uhm, maybe I'm misunderstanding you, isn't the whole thing a Xen driver?
What do you suggest?
> >>> I guess we've been meaning the same thing here, unless I'm
> >>> misunderstanding you. Any pfn does, and the balloon pagevec allocations
> >>> default to order 0 entries indeed. Sorry, you're right, that's not a
> >>> 'range'. With a pending re-xmit, the backend can find a couple (or all)
> >>> of the request frames have count>1. It can flip and abandon those as
> >>> normal memory. But it will need those lost memory slots back, straight
> >>> away or next time it's running out of frames. As order-0 allocations.
> >> Right. GFP_KERNEL order 0 allocations are pretty reliable; they only
> >> fail if the system is under extreme memory pressure. And it has the
> >> nice property that if those allocations block or fail it rate limits IO
> >> ingress from domains rather than being crushed by memory pressure at the
> >> backend (ie, the problem with trying to allocate memory in the writeout
> >> path).
> >> Also the cgroup mechanism looks like an extremely powerful way to
> >> control the allocations for a process or group of processes to stop them
> >> from dominating the whole machine.
> > Ah. In case it can be put to work to bind processes allocating pagecache
> > entries for dirtying to some boundary, I'd be really interested. I think
> > I came across it once but didn't take the time to read the docs
> > thoroughly. Can it?
> I'm not sure about dirtyness - it seems like something that should be
> within its remit, even if it doesn't currently have it.
> The cgroup mechanism is extremely powerful, now that I look at it. You
> can do everything from setting block IO priorities and QoS parameters to
> CPU limits.
Thanks. I'll keep it under my pillow then.
Xen-devel mailing list