[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Performance issues after upgrading to Zimbra 8.0.7



Over the weekend, we upgraded our system from Zimbra 7.2.6 to 8.0.7. Everything seemed to go fine after the upgrade, and performance was fine for the rest of the weekend.

This this morning came. By 9am the service was unusable. Mailbox logs showed numerous "lock timeout" errors as threads gave up waiting to secure a lock on a mailbox. The "elapsed=" time at the end of each call logged in mailbox.log was in the 5-6 digit range -- i.e. 10s to 100s of seconds to process a call. Most users couldn't even get logged in as it would just hang at the "Loading..." screen.

We have two LDAP servers -- a primary and a secondary, and we found that load on the secondary was really high. It was a 4 CPU VM and all were 100% busy. We increased the vCPUs to 8 and that seemed to bring down the load, but the lock timeout errors never went away. This continued all day. Early in the afternoon we started adding a third LDAP server to the cluster, but it was after 3:30pm by the time it was up. By then the load on the overall system had dropped a bit as staff were starting to give up and go home. I added the third LDAP server to two of our 4 mailbox servers to they'd prefer that one. Gradually the load on that one climbed and fell on the other one.. But now it was 4pm and things had settled down. Users who tried to login got sluggish response but at least they could get in.

Now it's 7pm and everything is fine, but I fear it will all melt down again in the morning.

The weird thing was that the lock errors were all coming from inside LDAP code, which is something we hadn't seen before -- it appears as though Zimbra now arbitrates locks via LDAP. 

We are also preparing a big physical server to put back in as an LDAP server -- we migrated the LDAP servers to VMs in May and I did see some unexplained load spikes after that, but nothing like today. 

Did anyone else see a massive load spike on their LDAP servers after upgrading? Do you run master/slave or multi-master? Zimbra Support is recommending we switch to multi-master but I don't see that making a huge difference to performance, and they haven't so far explained to me the basis for their recommendation.


--
Steve Hillman        IT Architect
hillman@sfu.ca       IT Services
778-782-3960         Simon Fraser University