Hello, everybody:
 
   In my institute, there are hundreds of 
computers running Xen virtual machines. Every virtual machine runs ntpdate to 
sync time with the ntp servers once 
an hour independently(echo 1 > /proc/sys/xen/independent_wallclock 
in DomU and Dom0). I found that there is one virtual machine will have time sync 
error in one week or two weeks. Everytime the misbehavior VM is different (the 
physical machine is different too), however, the time sync error is the same: 
the VM's time go ahead 36 minutes, and then its timer stops until 36 minutes 
later. This time sync error makes the applications fails everytime. 
   There dmesg log in D0 and xm dmesg do 
not have anything unnormal, and the dmesg log in the misbehavior VM is list 
below:
 
May  5 17:56:49  kernel: Badness in 
tcp_verify_wq at net/ipv4/tcp_ipv4.c:221
May  5 17:56:49 
 kernel:  [<c044c34d>] tcp_verify_wq+0x239/0x27b
May  5 
17:56:49  kernel:  [<c0441db3>] tcp_ack+0x65/0x187f
May  
5 17:56:49  kernel:  [<c045fff9>] 
ipt_do_table+0x1e7/0x322
May  5 17:56:49  kernel:  
[<c0460108>] ipt_do_table+0x2f6/0x32
......
May  5 17:56:50  kernel:  
[<c012d8c4>] autoremove_wake_function+0x0/0x3d
May  5 
17:56:50  kernel:  [<c015d932>] vfs_write+0x8a/0xdd
May  
5 17:56:50  kernel:  [<c015dea1>] 
sys_write+0x3f/0x6
May  5 17:56:50  kernel:  
[<c0104c8d>] 
syscall_call+0x7/0xb          
Above warning dues to lack of memory, and 
kernel kills some processes.
May  5 17:57:01  
/usr/sbin/cron[14544]:                                       
At this time, timer goes ahead 36 
minutes, and then the timer stops.
When 36 minutes later, timer works 
again.
May  5 18:33:05  
/usr/sbin/cron[14598]:                                       
May  5 18:33:05  
sshd[2790]: 
May  5 18:33:05  sshd[14659]:
 
   The version of xen is 
xen-3.2.0-16718-14-0.4, and the version of linux is SUSE-2.6.16.60-0.21. The CPU 
is Intel xeon E5405, memory is 2G, and there are 4 DomU VMs in one physical 
machine. Although the CPU is 64-bit, the Xen and Linux is 32-bit 
version.
   36 minutes are 2,160,000,000 ms, 
and 2160000000 = 0X 80BEFC00. In 32-bit system, does it caused by overflowing of 
some time-keeping variables?
   For some reasons, I could not update the 
version of xen or linux to the latest one.
   Could anybody help me to deal 
with this time sync error? Thank you very much for your help!
 
Best Wishes!
 
Xiang Zhang
National Research Center for Intelligent Computing 
Systems
Institute of Computing Technology
Chinese Academy of 
Sciences
Jun 18th, 2009