Re: AW: AW: [Xen-API] SG_IO for iscsi targets in XCP

To:	Uli Stärk <Uli.Staerk@xxxxxxxxxxxxxx>, "xen-api@xxxxxxxxxxxxxxxxxxx" <xen-api@xxxxxxxxxxxxxxxxxxx>
Subject:	Re: AW: AW: [Xen-API] SG_IO for iscsi targets in XCP
From:	George Shuklin <george.shuklin@xxxxxxxxx>
Date:	Wed, 20 Jul 2011 01:42:24 +0400
Cc:
Delivery-date:	Tue, 19 Jul 2011 14:42:12 -0700
Dkim-signature:	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=OsDypeLT+8HNxzm1KbYAlK8uULBX1OmcuqbHrRjVqUI=; b=HWqFtUS8bE4XOIkX1oP21AdKALRquZH7JEZ4YjFa0dCnv+CeEr24ye+onoOUoR66Km Rq/X4lugPTP1bhS6Tsz6qBgOylvtcV+qLTHrkFMuFNsBkFS0XdjUxKOvknr9MYcFGYan QjHD+g+v6i7fMW5BZXL9oJ+1wN/S0QGyN38Fs=
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<BD4874944A68BE4C92E66740829DD31619E03C@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
List-help:	<mailto:xen-api-request@lists.xensource.com?subject=help>
List-id:	Discussion of API issues surrounding Xen <xen-api.lists.xensource.com>
List-post:	<mailto:xen-api@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-api>, <mailto:xen-api-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-api>, <mailto:xen-api-request@lists.xensource.com?subject=unsubscribe>
References:	<1311073090.32638.1624.camel@mabase> <81A73678E76EA642801C8F2E4823AD21BC2D12C591@xxxxxxxxxxxxxxxxxxxxxxxxx> <1311078822.32638.1722.camel@mabase> <BD4874944A68BE4C92E66740829DD31619DFAE@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> <1311084564.32638.1851.camel@mabase> <BD4874944A68BE4C92E66740829DD31619E03C@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
Sender:	xen-api-bounces@xxxxxxxxxxxxxxxxxxx
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18) Gecko/20110626 Icedove/3.1.11

Well..

I'm not very well understand the reasons. You talking about online oroffline split brain? As I say early, online split brain could beprevented by using same network adapter (if link lost - there is noreplication, no new writing operations, no 'insonsistent' readingoperations).

Offline split brain could be prevented by manual startup (host bootswithout active DRBD and iscsi service). If only one server has beenrebooted, than clients are served by second server. If both of them godown, you need to find most recent node (manually, with help from DRBDsync process) and bring them up after resync (you already got down, solittle more time will not make a drastic changes).

The main reason I wants primary/primary DRBD is doubled amount ofreading devices - this will really reduce load. I expect some verysignificant difference... And one more little part: in primary/primarymode some XCP host go to one target, other to second. If one of the nodewill fail only half of customers will get a pretty long lag beforeswitching.


On 19.07.2011 19:10, Uli Stärk wrote:

An SAN-replication is not good enough, because of the giant raidsets. There is so much 
(random) workload on the disks, that a re-sync wont exceed 100 MB/s. We usually have 
about 50 MB/s if we don’t want to affect the running applications. Our raidsets 
would take more than a week to synchronize/verify :( We must have the possibility to 
replicate smaller sets of data. So we use DRBD for replicating data like you suggested 
for SANs.

Due to our experience, there are service several  service interruptions on redundant 
wan connections. You cant avoid this! Usually the service interruptions are very short 
(less than 5 minutes). Each interruption would trigger a failover process for a 
master-master setup and go into a split brain mode. In this case you will lose data, 
if you discard the changes on one node. Losing data is usually the worst thing that 
can happen. A merge is usually not cost-effective possible (database-duplicate key 
entries, etc). A short service-interruption is within the SLA and we don’t lose 
data. If we can predict that a service interruption will take more than a few minutes, 
we fail over to the second site. Usually this happens if the datacenter burns to the 
ground or a redundant server or networking component fails. This usually this happens 
less than once a year ;)

IMHO a master-master setup can only be recommended if you have no real 
networking between the nodes and use it for higher performance as a single node 
can offer. In all other cases, use it for backup and a backup should be a 
master-slave setup.


-----Ursprüngliche Nachricht-----
Von: George Shuklin [mailto:george.shuklin@xxxxxxxxx]
Gesendet: Dienstag, 19. Juli 2011 16:09
An: Uli Stärk
Betreff: Re: AW: [Xen-API] SG_IO for iscsi targets in XCP

There is two types of split-brain: online and offline.

Offline split-brain:

two primary/primary (p/p) are online
first go down, second primary operates some time second go down firts go up 
[stage1] second go up and found that one conflicts with first. [stage2]

This situation is somehow bad. In stage2 we will need to dischange every data 
second and problem actually starts at stage1, when we 'go to the past' by 
bringing up older machine.

In this situation we can:  go down again and replicate all data from second to 
first (we loosing 'time fork' we created during second StandAlone operation).
OR
simply replicate second from first and continue to operate in 'past fork', 
polling back state to moment 'first go down' and forgetting all second efforts.

All those problems can be solved by manual disaster recovery. If one of the 
servers go down, when it came back it must be stated manually. In normal 
datacenter downtime usually assisted by staff.

The second case is 'online' split-brain.

DRBD do require link between 'heads'. If this link go down, both heads have 
starting to think that remote node is down and continue operates independently. 
(If we say 'go down if remote disconnected', that means we kill any Fault 
Tolerance in DRBD - no reason to do p/p DRBD at all).
In this case we will met a horrible completely data loss - some data going to 
one, someone to second, and if we using load balancing, we can shutdown storage 
and says 'oops, sorry guys, no more data'.

Even a dedicated cord between DRBD hosts does not save from constant fear of 
online split brain.
If some asshole plug it out?
... or simply pull by moving equipment (drop something heavy?) If network card 
or cord die?
If someone say 'ethX down' by mistake on one of the servers?

All those cases is not a 'sorry, we have 36hr downtime', it all 'sh.t, 
everything is lost'.

And there is simple and elegant solution to all fears: use SAN for replication 
(same interface for replication and iscsi serving).

If you have enough bandwidth (10G usually do), this solve everything:

If some link, cord, network card and so on goes down, this host stops to serve 
clients. No IO, no new data, no problems with data corruption.


So I think dual head is possible in case of XCP. Specific architecture allow 
this. (I hope, I'll test and report later).

В Втр, 19/07/2011 в 12:50 +0000, Uli Stärk пишет:

My 5 cents: In real-world applications a split-brain will cause so
much work/trouble (and even service-interruption) that most admins
here will not consider using a dual-primary configuration ;)

-----Ursprüngliche Nachricht-----
Von: xen-api-bounces@xxxxxxxxxxxxxxxxxxx
[mailto:xen-api-bounces@xxxxxxxxxxxxxxxxxxx] Im Auftrag von George
Shuklin
Gesendet: Dienstag, 19. Juli 2011 14:34
An: Dave Scott
Cc: xen-api@xxxxxxxxxxxxxxxxxxx
Betreff: RE: [Xen-API] SG_IO for iscsi targets in XCP

Thank you very much.

I feel more safe now with dual primary DRBD configuration. I'll report results 
of practical deployment with real-life load later.

В Втр, 19/07/2011 в 12:21 +0100, Dave Scott пишет:

Hi George,

XCP just uses shared LVM over iSCSI as a generic block device. This is only safe because 
(i) we modified LVM to run in a "read-only" mode on slaves; and (ii) we 
co-ordinate all LVM metadata updates across the pool in the XCP storage layer.

I'm researching if XCP by anyway is issuing some SCSI commands
like reservation or persistent reservation. I done 'greping' via
source code for SG_IO ioctl() and found just few innocent inquiry/id requests.

Just to be sure: Is any SCSI-specific features used in XCP for
cluster management or resource locking? Or iscsi used only as
generic block device with LVM?



_______________________________________________
xen-api mailing list
xen-api@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/mailman/listinfo/xen-api


_______________________________________________
xen-api mailing list
xen-api@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/mailman/listinfo/xen-api

WARNING - OLD ARCHIVES

xen-api

Re: AW: AW: [Xen-API] SG_IO for iscsi targets in XCP