Strange issue with LDAP queries hanging
Zendesk issue: https://gitlab.zendesk.com/agent/tickets/14082
@patricio and I spent quite a lot of time troubleshooting this strange issue with a customer tonight. It came on suddenly and caused a majority (but not all) LDAP users to be unable to sign in. We suspected it was related to LDAP because we started seeing read timeouts in the LDAP callback URL. Requests were timing out at 60 seconds and the workers were getting killed. Raising the timeout or number of workers had no effect. Also, whether a user could sign in or not was entirely consistent.
We started adding debug logging, starting in the omniauth controller. We quickly got down in to
lib/gitlab/ldap/access.rb and traced it to the following code.
def allowed? if Gitlab::LDAP::Person.find_by_dn(user.ldap_identity.extern_uid, adapter) <--- This works fine! return true unless ldap_config.active_directory # Block user in GitLab if he/she was blocked in AD if Gitlab::LDAP::Person.disabled_via_active_directory?(user.ldap_identity.extern_uid, adapter) user.block false else user.activate if user.blocked? && !ldap_config.block_auto_created_users true end else # Block the user if they no longer exist in LDAP/AD user.block false end rescue false end def update_admin_status admin_group = Gitlab::LDAP::Group.find_by_cn(ldap_config.admin_group, adapter) <--- This works fine! admin_user = Gitlab::LDAP::Person.find_by_dn(user.ldap_identity.extern_uid, adapter) <--- This does *not* work fine! (Notice it's the same query from earlier). if admin_group && admin_group.has_member?(admin_user) unless user.admin? user.admin = true user.save end else if user.admin? user.admin = false user.save end end end
At this point we noticed we could optimize those 2 queries in to one. The
access.rb already has an
ldap_user method that
||= and does the query once. After we changed both locations to
ldap_user their users were all able to log in again.
However, this is only existing users. New users that had never signed in before were unable to sign in. We created a test LDAP user and notice the GL user is created in the DB, as is the LDAP identity. Again we traced it down to a hanging query immediately after the admin group query. Except, since we already optimized and were using
ldap_user it now hung at the next LDAP query where is checked group membership.
def update_admin_status admin_group = Gitlab::LDAP::Group.find_by_cn(ldap_config.admin_group, adapter) <--- This works fine! if admin_group && admin_group.has_member?(ldap_user) <--- Now hangs here (member query in `group.rb`) unless user.admin? user.admin = true user.save end else if user.admin? user.admin = false user.save end end end
So the questions are:
- Why does it hang for some users and not others. We couldn't draw any similarities between users.
- Why does the query immediately after the
Gitlab::LDAP::Group.find_by_cnquery fail each time.
- Is something causing the
adapterto become unusable?
@patricio and I are going to debug again Thursday. The emergent part of the issue is over for now. However, they say they add new users all the time so this cannot wait. Any help or ideas that others can give us is greatly appreciated.