Closed Bug 976729 Opened 10 years ago Closed 10 years ago

[apk] memcache for signer

Categories

(Cloud Services :: Operations: Marketplace, task, P1)

x86
macOS
task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kumar, Assigned: jason)

References

Details

Attachments

(1 file)

Let's add memcache to the APK Signer so we can check nonce values (bug 963141)
Blocks: 963141
Assignee: server-ops-amo → jthomas
Priority: -- → P1
Memcache is configured on -dev. This will be enabled on stage and prod on the next push.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Jason, let us know when you're able to stabilize memcache on dev. I'm re-opening this so we don't lose track of it.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Sorry, disregard comment 4.
It looks like django.core.cache.backends.memcached.MemcachedCache is not playing nicely with uwsgi. After each worker handles 210 requests the worker locks up and stops responding and uwsgi claims that the worker is 'busy'. Here is part of gdb backtrace, I've attached the whole backtrace.

    Frame 0x3a73f30, for file /opt/apk-signer/apk-signer/venv/lib/python2.6/site-packages/django/dispatch/dispatcher.py, line 270, in _remove_receiver (self=<Signal(use_caching=False, lock=<thread.lock at remote 0x7fc6c20bf6a8>, providing_args=set([]), sender_receivers_cache={}, receivers=[((38228048, 7195216), <weakref at remote 0x247db50>), (((49751120, 49833936), 7195216), <BoundMethodWeakref(weakFunc=<weakref at remote 0x2f87050>, deletionMethods=[<instancemethod at remote 0x2f6c460>], weakSelf=<weakref at remote 0x2f87158>, funcName='close', selfName='<django.core.cache.backends.memcached.MemcachedCache object at 0x2f72450>', key=(49751120, 49833936)) at remote 0x2f25750>), (((57907344, 49833936), 7195216), <BoundMethodWeakref(weakFunc=<weakref at remote 0x3735a48>, deletionMethods=[<instancemethod at remote 0x3731500>], weakSelf=<weakref at remote 0x3735af8>, funcName='close', selfName='<django.core.cache.backends.memcached.MemcachedCache object at 0x3739890>', key=(57907344, 49833936)) at remote 0x37398d0>...(truncated)

Strangely if I run the worker as root and it is able to create bytecode files the issue disappears. 

I've tested with pylibmc and that works without a issue. https://github.com/mozilla/apk-signer/pull/19 for it.
Attached file gdb.txt
gdb bt full
This is the pylibc patch https://github.com/mozilla/apk-signer/commit/ba248a9ab116e018188137596ef94ef421e14c0b

This should fix the issue.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
The issues in comment 6 look related to this django bug https://code.djangoproject.com/ticket/21952
Unfortunately pylibmc only seems to delay the issue in comment 6. After about 2-3 days of uptime the worker becomes unresponsive.
Re-opening since it's still not fixed.

Here are some solutions I see:
- get a dev to spend time backporting the Django 1.7 code to 1.6. It looks like this got started by someone else but it probably will take a good chunk of time.
- try to use redis instead of memcache
- disable caching altogether. This opens us up to replay attacks in Hawk but that should be ok for what the signer does
- downgrade to Django 1.5. This will be tricky since the signer was built for 1.6
- periodically restart the workers :) I guess you're already doing that
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Another workaround might be to switch to a multi-process wsgi setup rather than multithreaded. I guess it would use more memory.
nevermind comment #12, we were already running the apk signer was processes
(In reply to Kumar McMillan [:kumar] (needinfo for quickness) from comment #11)

> - periodically restart the workers :) I guess you're already doing that

Yes this is what we have in place right now and seems to be working okay.
Spoke with :kumar yesterday about this. I am going to configure heka to push nginx logs to kibana so we have an idea about how many ISE 500 requests are being served due to uwsgi harakiri. We are also aiming to move APK Signer to Django 1.7 once it is available.
Status: REOPENED → ASSIGNED
Summary: Memcache for APK Signer → [apk] memcache for signer
Component: Server Operations: AMO Operations → Operations: Marketplace
Product: mozilla.org → Mozilla Services
Version: other → unspecified
controller.apk.firefox.com nginx logs are available at https://kibana.shared.us-west-2.prod.mozaws.net/#/dashboard/elasticsearch/PROD%20-%20APK%20HTTP%20Status

signer nginx logs are available at https://kibana.shared.us-west-2.prod.mozaws.net/#/dashboard/elasticsearch/PROD%20-%20APK%20Signer%20HTTP%20Status

I don't see any ISE 500 requests but there are a few 400 and 499 requests in the controller nginx logs. The signer nginx logs look okay.
signer nginx logs look okay, no 500s. closing this request out.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: