[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Recent upgrade of 4.13 -> 4.14 issue


  • To: Dario Faggioli <dfaggioli@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Frédéric Pierret <frederic.pierret@xxxxxxxxxxxx>
  • Date: Mon, 26 Oct 2020 20:10:40 +0100
  • Arc-authentication-results: i=1; mx.zohomail.com; dkim=pass header.i=qubes-os.org; spf=pass smtp.mailfrom=frederic.pierret@xxxxxxxxxxxx; dmarc=pass header.from=<frederic.pierret@xxxxxxxxxxxx> header.from=<frederic.pierret@xxxxxxxxxxxx>
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1603739448; h=Content-Type:Cc:Date:From:In-Reply-To:MIME-Version:Message-ID:References:Subject:To; bh=GnzzDDnsedpzJAW4j4XsruJLatQ217ifyjkLgNGOKSc=; b=kwoX9itmZs0E54F3xzFiia4dYPqLSmC1X+XpXPo4DoGhahS0wLxafQEjfVYxx1E+4n7KSUpXLJkuN22Fpno/z7cayDlQlC3r6wlVuznzZRoFqLDoIn0iG0siWwkR1I3cGaKDBCv/mm1bvh9SDlDMjLtyKoJU0yKiitg3SyzROd8=
  • Arc-seal: i=1; a=rsa-sha256; t=1603739448; cv=none; d=zohomail.com; s=zohoarc; b=GDCeM5L6YvedesYSiQfvGBgHopYSJ6bojLWlnkx/GuDgt3IdUHhXgKnyCSqF3A6bY+Ba6GLUnsxL8rYXkOTfeMdvh2xgHPfsWNSmAwSo58/ckgC+bp5v/tiKxZtwL8pRwItwb7SOLY9KMkqICcKlnjW+T8P/J0oZfjcMLxhTJvc=
  • Autocrypt: addr=frederic.pierret@xxxxxxxxxxxx; keydata= xsFNBFwkq3EBEADcfyaOkeuf+g96S1ieq05tJ8vTGsQrNXQ5RDE7ffagL0+EpfIP3x73x5Q0 Dy2rUVQ+oN1DHcueNL70RtNs9BFnoW0KZnskbT4nEJ9wQCQa22lQaIk9kCNVddh2HJKljtd8 vtovi97sWIjtzxx5Qwc2md0DY9AHhNC4KqKIW3tSPC17UsI8fASoNAHItYtyn2bO67p8pCIv ltoBrYnElD1Pyp5IGWiD2/YD325iPl2+qHVkUSWmb92hRRU19Rg+Uds8bVHqhz4cOqIE7jpX gYzTN/kq8sxBMh2OrQ/bSxLaccaNApIVSZVSAasVJfdscNDL9fjkHERK/AiSTleHrsgLf4PL w5koqPs/6JEIVI+t0pyg+Pa8uwFoeYTPrLSlw0f7bXSmlVfv8g7M7RWmk3T5QIpeHA0j3lEZ NbYRXzkI91HCt40X2bTb2jTKgvB9jQjEarpk6euvGs2Ig/U4MlUy3pG5Ehd2Ebn8Rz31JXpa A/GPaJ5DjzV0q9mkYkGDLYI3J/J+s2u0Kr0VswLaIN3WJn7kKEDwfc4s2kaAYfblE/p0zVir EVBum723MFH4DxhTrOoWgta2nyRHOoi0z0EVhYA+D86mFPWKb9roWvtnmFlssggGmqbJEMvt LbYnlSt3v32nfUXh12aQPwU/LCGIzq4oFNVrNp3aWPnSajLPpQARAQABzTxGcsOpZMOpcmlj IFBpZXJyZXQgKGZlcGl0cmUpIDxmcmVkZXJpYy5waWVycmV0QHF1YmVzLW9zLm9yZz7CwXgE EwECACIFAlwkq3ECGwMGCwkIBwMCBhUIAgkKCwQWAgMBAh4BAheAAAoJEEhAELXNxXbiPLkQ AI6kEDyLl0TpvRDOanuD5YkVHLEYVuG62CJNwMjFoFRgZJnl+Fb5HBgthU9lBdMqNySg+s8y ekM9KRlUHKYjwAsyjPIjRtca4bH3V11/waKpvPBgPsC75CxSZ9uITprfEqX7V2OLbrYW94qw R8jX+n/wlEGG3pbfXG7FTnjxQWM0E0aSvO0Yb5EkjiJ7cwEiqvL04Uekt5I2Zc8iRDF9kneI NiNhzRtvrR1UN6KtiZNSk2NsLOptrUQ/1AU5jwH4mnQQymtYDsWddlRoDRC/bsAow7cBudj+ lekM3cNRZOazKZx5UPnN8nqvD7FqeAcZBVyrHZ4hcWqABaJEPv6CCHRiLQnGR9ze2O5Yh+/B unrOJdjdsib1ZECH9GtIcj4mmPAN84NO4r8a6Sn9jsXkd2Wj2N5wNrZMPslhfiaW2VHTfLmA Ot+wRwLRsFfqLykF8hMlNXXE4frxotwa6+PTd48Ws9H9aalSs0lebsG0623b4mBjy1coxFUw eclPInXsPEdu/Yu2r7xrgGouXH8KgDhqlqq60UaA5n/0XhIeZ8tBTYs+1B5/C9TjvNAUsBko b1EpfW3J4Gq14GqwK+eodOTL5t2f2PWN/IQyop/j0FMgVU5/PUS0pciz5ybyIJBLhbsJBvKb xM/NyxHrmNwGEknpoeq+XT8rEJ+/Ag8Wnjl0zsFNBFwkq3EBEADAPJdyFy4KeYpuGATWwWCN He8XNVqBplV0yVlT5pSiCyA3UK34JlGX9YJOj/FlMZGgh61vbiK+piRjm/lyb128wpMjnoOm qpbSLbra8NP8Mu5FZMcv8OxrSIr/RHq2heFg1j11QOMGwe6vPC918qpzmiaYj2qpKY/RYsG8 V+9+dpLEU75+mpHU7GlECfPmHYbnsismL/4+xH+8BG56yg0UFbfrNYonIQFSn5k/w6i7jt7M ++ZmWfEV5nCP2qvzeYDGAL6BbWVOjuDhrKsAIKnomCyy+MjcVP955PVdN2+OlPJng07oKtQr 5aNCaNpv/i4gLO1IScdfDwm6gdfB2Zg/7jTJrKw0kWPFl9rHfN7dLAR28u3uT8Rhicjdd7hg YlDWdbImhNL/Z7iL3eayH7T9qAVNU587MhWvIREyE1gj22cs0e1m6qMFpbFYG0709N2UwlpA H+Pd35bTi9q2o1pH91xBYH6QvvrwsuVYHwuc3xXLRVRXWXY8xvNFSlY1LB8A46JOtV/ZodYD yhxVGbeWp820cb0s1f689XCXqFYAzTfCit+EeboYORN5CGioXzS+z0S9IhPbdUuvqs7xvC24 8bM7nm84YdgVM7HWybOtpRpWpycwGs73IvbxyLE9aPe/Zw4PTKWvbJlcFioofLwTQE1XvWom FPD9LLrBl5NUjQARAQABwsFfBBgBAgAJBQJcJKtxAhsMAAoJEEhAELXNxXbilSkP/2NcazvU DGyQLm7tFp4HNqSQfFJ3+chzxfOOdNtdWE+RFetyx9R8DBGrPX8hjITWD9ZA2bbZZ+J+a39v yY7bNZkCGbWzPGK//O1cInL4Ecmj7Xm8DXjk3E2Xzv1YrZk/GBz9xK8mWXwhn90SHNadEf28 ghMXcmUJSqT+KTxQQjUVaEtQDdzQnYQKh/dHxs760QSAnXkWr0YVYxk8q8aa+G8iAkNJcb+W x5gWEw4ft3HpKMRq74OQvWayy0fXpTlusdnvZs0VVMeRpCW6iCt9UmsbfG6Nyf2MKKbWRJnt jy8mjJiFjiJ2j9s4yNIookRv8IfocULuhnx5FWsvIzX2Vwcd7G5objnY1DlCNQrhJUs/geoC UBjBJp7sfbHakWfTKxZjFsuCXT1dCEN7JXX6ABOshzDTwB0kq7Bq/EkOzPDQGfOPoX2h1KjH uvGWw5cBe8WLnEuhIyf/DWfMS1LbjFB4JlMUEcood5xvE4owpfZog+0a9gpBS6cg9bMgRUex 1C+w3fudJdPQwIRAjJgac0jTT6uDY8re9RhBDv83PRSM7AzxqEFvDj8K46dg1XvJcKs7K5PX pm5Pw4stVEAxIks5uR62wxygImkdvgjQRzJe4JWwAniBWsZG+cNYj6xcItqkupIb4PeOWgNQ QMhGv8DnbAdOOOnumAXWq0+wl5uP
  • Cc: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Mon, 26 Oct 2020 19:10:56 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>



Le 10/26/20 à 6:54 PM, Dario Faggioli a écrit :
On Mon, 2020-10-26 at 17:11 +0100, Frédéric Pierret wrote:
Le 10/26/20 à 2:54 PM, Andrew Cooper a écrit :
If anyone would have any idea of what's going on, that would be
very
appreciated. Thank you.

Does booting Xen with `sched=credit` make a difference?

~Andrew

Thank you Andrew. Since your mail I'm currently testing this on
production and it's clearly more stable than this morning. I will not
say yet it's solved because yesterday I had some few hours of
stability too. but clearly, it's encouraging because this morning it
was just hell every 15/30 minutes.

Ok, yes, let us know if the credit scheduler seems to not suffer from
the issue.


Yes unfortunately, I had few hours of stability but it just end up to:

```
[15883.967829] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[15883.967868] rcu:     12-...0: (75 ticks this GP) 
idle=5c6/1/0x4000000000000000 softirq=139356/139357 fqs=14879
[15883.967884]  (detected by 0, t=60002 jiffies, g=460221, q=89)
[15883.967901] Sending NMI from CPU 0 to CPUs 12:
[15893.970590] rcu: rcu_sched kthread starved for 9994 jiffies! g460221 f0x0 
RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=9
[15893.970622] rcu: RCU grace-period kthread stack dump:
[15893.970631] rcu_sched       R  running task        0    10      2 0x80004008
[15893.970645] Call Trace:
[15893.970658]  ? xen_hypercall_xen_version+0xa/0x20
[15893.970670]  ? xen_force_evtchn_callback+0x9/0x10
[15893.970679]  ? check_events+0x12/0x20
[15893.970687]  ? xen_restore_fl_direct+0x1f/0x20
[15893.970697]  ? _raw_spin_unlock_irqrestore+0x14/0x20
[15893.970708]  ? force_qs_rnp+0x6f/0x170
[15893.970715]  ? rcu_nocb_unlock_irqrestore+0x30/0x30
[15893.970724]  ? rcu_gp_fqs_loop+0x234/0x2a0
[15893.970732]  ? rcu_gp_kthread+0xb5/0x140
[15893.970740]  ? rcu_gp_init+0x470/0x470
[15893.970748]  ? kthread+0x115/0x140
[15893.970756]  ? __kthread_bind_mask+0x60/0x60
[15893.970764]  ? ret_from_fork+0x35/0x40
[16063.972793] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[16063.972825] rcu:     12-...0: (75 ticks this GP) 
idle=5c6/1/0x4000000000000000 softirq=139356/139357 fqs=57364
[16063.972840]  (detected by 5, t=240007 jiffies, g=460221, q=6439)
[16063.972855] Sending NMI from CPU 5 to CPUs 12:
[16243.977769] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[16243.977802] rcu:     12-...0: (75 ticks this GP) 
idle=5c6/1/0x4000000000000000 softirq=139356/139357 fqs=99504
[16243.977817]  (detected by 11, t=420012 jiffies, g=460221, q=6710)
[16243.977830] Sending NMI from CPU 11 to CPUs 12:
[16253.980496] rcu: rcu_sched kthread starved for 10001 jiffies! g460221 f0x0 
RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=9
[16253.980528] rcu: RCU grace-period kthread stack dump:
[16253.980537] rcu_sched       R  running task        0    10      2 0x80004008
[16253.980550] Call Trace:
[16253.980563]  ? xen_hypercall_xen_version+0xa/0x20
[16253.980575]  ? xen_force_evtchn_callback+0x9/0x10
[16253.980584]  ? check_events+0x12/0x20
[16253.980592]  ? xen_restore_fl_direct+0x1f/0x20
[16253.980602]  ? _raw_spin_unlock_irqrestore+0x14/0x20
[16253.980613]  ? force_qs_rnp+0x6f/0x170
[16253.980620]  ? rcu_nocb_unlock_irqrestore+0x30/0x30
[16253.980629]  ? rcu_gp_fqs_loop+0x234/0x2a0
[16253.980637]  ? rcu_gp_kthread+0xb5/0x140
[16253.980645]  ? rcu_gp_init+0x470/0x470
[16253.980653]  ? kthread+0x115/0x140
[16253.980661]  ? __kthread_bind_mask+0x60/0x60
[16253.980669]  ? ret_from_fork+0x35/0x40
[16423.982735] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[16423.982789] rcu:     12-...0: (75 ticks this GP) 
idle=5c6/1/0x4000000000000000 softirq=139356/139357 fqs=139435
[16423.982820]  (detected by 10, t=600017 jiffies, g=460221, q=7354)
[16423.982842] Sending NMI from CPU 10 to CPUs 12:
[16433.984844] rcu: rcu_sched kthread starved for 10001 jiffies! g460221 f0x0 
RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=3
[16433.984875] rcu: RCU grace-period kthread stack dump:
[16433.984885] rcu_sched       R  running task        0    10      2 0x80004000
[16433.984897] Call Trace:
[16433.984910]  ? xen_hypercall_xen_version+0xa/0x20
[16433.984922]  ? xen_force_evtchn_callback+0x9/0x10
[16433.984931]  ? check_events+0x12/0x20
[16433.984939]  ? xen_restore_fl_direct+0x1f/0x20
[16433.984949]  ? _raw_spin_unlock_irqrestore+0x14/0x20
[16433.984960]  ? force_qs_rnp+0x6f/0x170
[16433.984967]  ? rcu_nocb_unlock_irqrestore+0x30/0x30
[16433.984976]  ? rcu_gp_fqs_loop+0x234/0x2a0
[16433.984984]  ? rcu_gp_kthread+0xb5/0x140
[16433.984992]  ? rcu_gp_init+0x470/0x470
[16433.985000]  ? kthread+0x115/0x140
[16433.985007]  ? __kthread_bind_mask+0x60/0x60
[16433.985015]  ? ret_from_fork+0x35/0x40
[16603.987677] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[16603.987710] rcu:     12-...0: (75 ticks this GP) 
idle=5c6/1/0x4000000000000000 softirq=139356/139357 fqs=179313
[16603.987725]  (detected by 0, t=780022 jiffies, g=460221, q=7869)
[16603.987740] Sending NMI from CPU 0 to CPUs 12:
[16783.992658] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[16783.992710] rcu:     12-...0: (75 ticks this GP) 
idle=5c6/1/0x4000000000000000 softirq=139356/139357 fqs=219106
[16783.992741]  (detected by 13, t=960027 jiffies, g=460221, q=8300)
[16783.992768] Sending NMI from CPU 13 to CPUs 12:
[16793.995873] rcu: rcu_sched kthread starved for 10000 jiffies! g460221 f0x0 
RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=4
[16793.995906] rcu: RCU grace-period kthread stack dump:
[16793.995915] rcu_sched       R  running task        0    10      2 0x80004000
[16793.995930] Call Trace:
[16793.995948]  ? xen_hypercall_xen_version+0xa/0x20
[16793.995963]  ? xen_force_evtchn_callback+0x9/0x10
[16793.995972]  ? check_events+0x12/0x20
[16793.995979]  ? xen_restore_fl_direct+0x1f/0x20
[16793.995992]  ? _raw_spin_unlock_irqrestore+0x14/0x20
[16793.996004]  ? force_qs_rnp+0x6f/0x170
[16793.996012]  ? rcu_nocb_unlock_irqrestore+0x30/0x30
[16793.996021]  ? rcu_gp_fqs_loop+0x234/0x2a0
[16793.996029]  ? rcu_gp_kthread+0xb5/0x140
[16793.996037]  ? rcu_gp_init+0x470/0x470
[16793.996046]  ? kthread+0x115/0x140
[16793.996054]  ? __kthread_bind_mask+0x60/0x60
[16793.996062]  ? ret_from_fork+0x35/0x40
```

I'm curious about another thing, though. You mentioned, in your
previous email (and in the subject :-)) that this is a 4.13 -> 4.14
issue for you?

This is indeed happening since I've updated xen-4.14 from 4.13 and 4.13 was 
totally stable for me. Server was running for months without any issue.
Does that mean that the problem was not there on 4.13?

I'm asking because Credit2 was already the default scheduler in 4.13.

So, unless you were configuring things differently, you were already
using it there.

Normally, there is a new custom patch for S3 resume from Marek (in CC) and he 
would be much more able than me to precise some very specific changes with 
respect to 4.13.

If this is the case, it would hint at the fact that something that
changed between .13 and .14 could be the cause.

Regards


Thank you again for your help.

Attachment: OpenPGP_0x484010B5CDC576E2.asc
Description: application/pgp-keys

Attachment: OpenPGP_signature
Description: OpenPGP digital signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.