Strange issue with LDAP queries hanging

Zendesk issue: https://gitlab.zendesk.com/agent/tickets/14082

@patricio and I spent quite a lot of time troubleshooting this strange issue with a customer tonight. It came on suddenly and caused a majority (but not all) LDAP users to be unable to sign in. We suspected it was related to LDAP because we started seeing read timeouts in the LDAP callback URL. Requests were timing out at 60 seconds and the workers were getting killed. Raising the timeout or number of workers had no effect. Also, whether a user could sign in or not was entirely consistent.

We started adding debug logging, starting in the omniauth controller. We quickly got down in to lib/gitlab/ldap/access.rb and traced it to the following code.

      def allowed?
        if Gitlab::LDAP::Person.find_by_dn(user.ldap_identity.extern_uid, adapter) <--- This works fine!
          return true unless ldap_config.active_directory

          # Block user in GitLab if he/she was blocked in AD
          if Gitlab::LDAP::Person.disabled_via_active_directory?(user.ldap_identity.extern_uid, adapter)
            user.block
            false
          else
            user.activate if user.blocked? && !ldap_config.block_auto_created_users
            true
          end
        else
          # Block the user if they no longer exist in LDAP/AD
          user.block 
          false
        end
      rescue
        false
      end

      def update_admin_status
        admin_group = Gitlab::LDAP::Group.find_by_cn(ldap_config.admin_group, adapter) <--- This works fine!
        admin_user = Gitlab::LDAP::Person.find_by_dn(user.ldap_identity.extern_uid, adapter) <--- This does *not* work fine! (Notice it's the same query from earlier).

        if admin_group && admin_group.has_member?(admin_user)
          unless user.admin?
            user.admin = true
            user.save
          end
        else
          if user.admin?
            user.admin = false
            user.save
          end
        end
      end

At this point we noticed we could optimize those 2 queries in to one. The access.rb already has an ldap_user method that ||= and does the query once. After we changed both locations to ldap_user their users were all able to log in again.

However, this is only existing users. New users that had never signed in before were unable to sign in. We created a test LDAP user and notice the GL user is created in the DB, as is the LDAP identity. Again we traced it down to a hanging query immediately after the admin group query. Except, since we already optimized and were using ldap_user it now hung at the next LDAP query where is checked group membership.

      def update_admin_status
        admin_group = Gitlab::LDAP::Group.find_by_cn(ldap_config.admin_group, adapter) <--- This works fine!

        if admin_group && admin_group.has_member?(ldap_user) <--- Now hangs here (member query in `group.rb`)
          unless user.admin?
            user.admin = true
            user.save
          end
        else
          if user.admin?
            user.admin = false
            user.save
          end
        end
      end

So the questions are:

  1. Why does it hang for some users and not others. We couldn't draw any similarities between users.
  2. Why does the query immediately after the Gitlab::LDAP::Group.find_by_cn query fail each time.
  3. Is something causing the adapter to become unusable?

@patricio and I are going to debug again Thursday. The emergent part of the issue is over for now. However, they say they add new users all the time so this cannot wait. Any help or ideas that others can give us is greatly appreciated.