[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Performance issues after upgrading to Zimbra 8.0.7



We applied patch P2 last night and that seems to have resolved the problem, as it's 9am here now and the phones aren't ringing off the hook. Thanks Tony and Quanah for bringing this to our attention! Not installing P2 was an oversight on my part. The bug we hit is this one:  https://bugzilla.zimbra.com/show_bug.cgi?id=89504. It's mentioned in the release notes as "89504 – Soap - GetInfoRequest has improved processing time" -- that's an understatement!

It only affects sites with lots of distribution lists (we have thousands - we use them only as ACLs)


Hi Matt!

Have you guys applied the 8.0.7 patch that was released a couple weeks ago yet?  If not, I strongly suggest that you do.  There are definitely some potential issues out there with running unpatched, as Steve found out yesterday, unfortunately.  The one strange thing about it is we haven't really seen a reason why some sites seem to immediately see it and others never do.

Tony


From: "Matt Mencel" <MR-Mencel@wiu.edu>
To: "zimbra-hied-admins" <zimbra-hied-admins@sfu.ca>
Sent: Tuesday, August 26, 2014 11:27:48 AM
Subject: Re: Performance issues after upgrading to Zimbra 8.0.7

We upgraded to 8.0.7 from 7.2.X in May after the students had left for the summer.  Shortly thereafter I converted our two LDAP servers from master/slave to MMR.  We didn't notice any issues before going to MMR, but it was summer so...light usage.

Yesterday was the first day of classes and we had no problems.  Our LDAPs are both VMs with 4 vCPUs and 4GB of RAM.  We have about 15,000 active (within the last 7 days) accounts, and we are running around 4500 HTTP connections and 170 IMAP connections through two proxy servers at any given time.

All nodes list both LDAP servers but in different orders.  Odd numbered hosts have ldap1 listed first and even numbered hosts have ldap2 listed first.  These are the two keys from conf/localconfig.xml on our odd numbered MTA.

<key name="ldap_master_url">
<value>ldap://ldap01.here.com:389 ldap://ldap02.here.com:389</value>
</key>


<key name="ldap_url">

<value>ldap://ldap01.here.com:389 ldap://ldap02.here.com:389</value>
</key>

 


And here is our 15 minute load graph for the last 48 hours from our two ldap nodes....  I don't have graph data from back before we moved to MMR to compare to unfortunately.




From: "Tony Publiski" <tonster@tonster.com>
To: "Steve Hillman" <hillman@sfu.ca>
Cc: "zimbra-hied-admins" <zimbra-hied-admins@sfu.ca>
Sent: Monday, August 25, 2014 9:23:22 PM
Subject: Re: Performance issues after upgrading to Zimbra 8.0.7

Hey Steve, did you install the ZCS 8.0.7 patch as well?  This issue sounds like one of the problems some sites encountered with some new delegated admin code that came into 8.0.7 and was resolved by the recently released patch.


From: "Steve Hillman" <hillman@sfu.ca>
To: "zimbra-hied-admins" <zimbra-hied-admins@sfu.ca>
Sent: Monday, August 25, 2014 10:07:04 PM
Subject: Performance issues after upgrading to Zimbra 8.0.7

Over the weekend, we upgraded our system from Zimbra 7.2.6 to 8.0.7. Everything seemed to go fine after the upgrade, and performance was fine for the rest of the weekend.

This this morning came. By 9am the service was unusable. Mailbox logs showed numerous "lock timeout" errors as threads gave up waiting to secure a lock on a mailbox. The "elapsed=" time at the end of each call logged in mailbox.log was in the 5-6 digit range -- i.e. 10s to 100s of seconds to process a call. Most users couldn't even get logged in as it would just hang at the "Loading..." screen.

We have two LDAP servers -- a primary and a secondary, and we found that load on the secondary was really high. It was a 4 CPU VM and all were 100% busy. We increased the vCPUs to 8 and that seemed to bring down the load, but the lock timeout errors never went away. This continued all day. Early in the afternoon we started adding a third LDAP server to the cluster, but it was after 3:30pm by the time it was up. By then the load on the overall system had dropped a bit as staff were starting to give up and go home. I added the third LDAP server to two of our 4 mailbox servers to they'd prefer that one. Gradually the load on that one climbed and fell on the other one.. But now it was 4pm and things had settled down. Users who tried to login got sluggish response but at least they could get in.

Now it's 7pm and everything is fine, but I fear it will all melt down again in the morning.

The weird thing was that the lock errors were all coming from inside LDAP code, which is something we hadn't seen before -- it appears as though Zimbra now arbitrates locks via LDAP. 
 
We are also preparing a big physical server to put back in as an LDAP server -- we migrated the LDAP servers to VMs in May and I did see some unexplained load spikes after that, but nothing like today. 
 
Did anyone else see a massive load spike on their LDAP servers after upgrading? Do you run master/slave or multi-master? Zimbra Support is recommending we switch to multi-master but I don't see that making a huge difference to performance, and they haven't so far explained to me the basis for their recommendation.


--
Steve Hillman        IT Architect
hillman@sfu.ca       IT Services
778-782-3960         Simon Fraser University





--
Steve Hillman        IT Architect
hillman@sfu.ca       IT Services
778-782-3960         Simon Fraser University