Xen project Mailing List

On Jan 10, 2018, at 11:39, Ian Jackson <ian.jackson@xxxxxxxxxxxxx> wrote:

Jan Beulich writes ("Re: Radical proposal v2: Publish Amazon's verison now, Citrix's version soon"):

There are a couple of instances of "a branch", and I'm not really

clear on which one that would be, yet in part my opinion depends

on that, as this will affect what state certain branches will be in

for subsequent work. As I agree with the PVH shim being the

better baseline for work going forward, in particular I wouldn't like

to see the Vixen series becoming the base of any branch going to

be maintained going forward.

Anthony Liguori writes ("Re: [Xen-devel] Radical proposal v2: Publish Amazon's verison now, Citrix's version soon"):

What I would suggest is the following:

1) Merge Vixen into staging

2) Backport Vixen into stable-4.10 and cut a release

We do not have time any longer (if we had time to start with) to

reconcile these divergent views.

[ Disclaimer: non-technical, PM-oriented message ahead. Opinions expressed are those of this individual and not former/current employers or clients. ]

Having worked with Xen since 2005, helping to ship XenServer, XenClient and OpenXT, I would like to challenge assertions that the community of Xen (or $HW_vendor or $OS_vendor or $APP_vendor) developers and users must settle for a non-consensus long-term mitigation.

Across the computer industry, it is clear that a small subset of specialists have known about this issue for some time: developers who worked on candidate fixes ahead of the public announcement, experts who warned about microarchitecture risks years ago, and any adversaries who acted on their warnings. Some people had advance information & time to consider candidate solutions, most [1] of the world did not.

As a customer of $HW_vendor / Xen / $OS_vendor / $APP_vendor, the last thing I want to hear is that world-class specialists who have had weeks/months to evaluate candidate fixes have been unable to reach agreement and propose to delegate the decision TO CUSTOMERS (?!) That would be customers with only days of exposure to the CVE details, who still have to keep their regular business running, while trying to understand a complex security issue that eluded experts for decades.

As a general-purpose open-source hypervisor not tied to one operating system or use case, Xen has always been susceptible to fragmentation. It can seem easier to make private modifications vs. upstreaming/revising changes for acceptance to a public codebase that serves many stakeholders. The non-upstreamed Xen forks of Amazon EC2, Citrix XenClient and Bromium have made the Xen community less strong, by reducing public Xen contributions, including but not limited to security.

Yet… a once-in-decades (?) security issue has brought forth a public contribution from the private, parallel Xen universe that is Amazon EC2, accompanied by engineering resources to review and test a long-term solution that could be acceptable to the public Xen codebase. If merged, this contribution could reduce fragmentation, increase dev/test resources and expand the risk pool of Xen customers sharing a common, battle-tested mitigation. Reconciliation of private/public Xen universes is never easy, but in this unique instance it would make the Xen community stronger.

Notes:

1) PVH is widely acknowledged as the long-term future of Xen. The recent security issue makes it even more important that PVH be well designed and widely tested with resources from the broader Xen community. Strategic PVH improvements need not be rushed to solve a tactical security issue. There are a number of Xen and security companies who have not yet contributed to recent public Xen design discussions, including Bromium. If there is lack of design consensus, we can call upon additional Xen contributors to help achieve consensus.

2) Security: large swathes of customers in many markets have neither the time nor expertise to make complex security decisions which involve functional tradeoffs and non-obvious interactions among opaque operating systems, microcode and hardware. This category of customer wants to know that, (a) they are no worse off than "most" other customers, (b) they are on a supported path that will lead to a widely deployed long-term solution, and (c) they have clear documentation on operational constraints during their journey from temporary fix to long-term solution.

3) Community: given the unprecedented nature of this Amazon code contribution, it is in the interest of the Xen community for Amazon EC2 to migrate to a solution that is used by upstream Xen customers. This requires an incremental, bisectable path between already-deployed EC2 Vixen and in-development PVH shim. The reason for the Xen community to support this approach, however difficult, is to un-fork Amazon's version of Xen, in exchange for expanding the pool of Xen deployments which share a common risk profile.

Henry Baker posted [2] to the Cryptography mailing list about optimistic concurrency control:

"… Speculation is an extremely common, and an entirely human, reaction to *latency*. If the latency of some operation is too long, we pretend that the most common case is occurring and try to fix things later if/when we find out that we've been wrong … Only in the 20th and 21st centuries have we had the luxury of speed-of-light communications and sub-second latencies, so that we can often replace *optimistic* concurrency control with *pessimistic* concurrency control. When an ancient Roman general took his army over the horizon, he might be out of contact for *months*, so pessimistic concurrency control simply wasn't an option. If he screwed up, the only option was to send *another general and another army* over the horizon to repair the damage that the first general and army had caused."

Some teams working with Xen appear to have developed independent (?) mitigations for Spectre/Meltdown, in advance of public disclosure.  Some of those mitigations have already shipped to commercial customers.  By definition, each parallel, private effort addressed a narrower set of requirements than those of the broader Xen community.  We have now moved onto real-time coordination where early private assumptions can be revisited in public, as we seek consensus on a unified, long-term solution and timeline for addressing every individual PV feature regression.

Each privately developed mitigation may be useful to a subset of Xen users.  Each can be hosted in a short-lived branch, with documentation of tradeoffs and an index of all mitigation branches.  Organizations with the time and resources to evaluate these short-lived branches may adopt one that matches their constraints.  But the broader Xen community needs consensus on the path to a unified solution that can be merged to release trees *and widely deployed*.

However long it takes to realize broad consensus on a solution for release branches, consensus is what customers expect in a critical fix from a trusted open-source provider of general-purpose virtualization.  If customers were satisfied with less, they would be using a more narrowly-focused hypervisor.  We can educate users that for many software stacks, Spectre/Meltdown is not a "patch and forget" security issue, but one that will have industry-wide operational and economic impact for months/years.

Rich

[1] https://techcrunch.com/2018/01/06/how-tier-2-cloud-vendors-banded-together-to-cope-with-spectre-and-meltdown/

[2] http://www.metzdowd.com/pipermail/cryptography/2018-January/033541.html

[Xen-devel] Consensus in Parallel Universe Responses to Spectre/Meltdown