InterLex setup
Table of Contents
Setting up a new InterLex
Requirements
See the ebuild in tgbugs-overlay. https://github.com/tgbugs/tgbugs-overlay/blob/master/dev-python/interlex/interlex-9999.ebuild
Setup for pguri
postgres module
- Gentoo
A
dev-db/pguri-9999.ebuild
is available in https://github.com/tgbugs/tgbugs-overlaylayman -a tgbugs-overlay emerge pguri
- Ubuntu
sudo apt-get install build-essential pkg-config liburiparser-dev postgresql-server-dev-all export PKG_CONFIG_PATH=/usr/lib/x86_64-linux-gnu/pkgconfig git clone https://github.com/petere/pguri.git cd pguri make PG_CONFIG=/usr/bin/pg_config sudo make PG_CONFIG=/usr/bin/pg_config install
Set up a virtual environment
In the working directory of this repository (usually the 'outer' interlex folder)
- run
pipenv install
(may complain about version issues) - run
pipenv shell
NOTE: if you have a local development version of pyontutils installed inPYTHONPATH
pipenv will point to those files for the source, HOWEVER the command line scripts installed by pipenv will point to the venv which could cause major confusion. The simplest solution is to only run interlex and interlex commands inside the venv and run any pyontutils commands outside the venv.
Set up the database
- Set up postgres for your operating system.
- From the working directly run
interlex-dbsetup $PORT $DBNAME
NOTE: if you change$DBNAME
frominterlex_test
make sure you also change it in the python code since this has not been abstracted yet. - You will need to set the passwords for
interlex-admin
andinterlex-user
manually and then runinterlex-dbsetup $PORT $DBNAME
again. To accomplish this run the following.su postgres
psql
\password interlex-admin
\password interlex-user
. - Add the passwords for
interlex-admin
andinterlex-user
to~/.pgpass
. - Run
interlex dbsetup
to add an initial 'authenticated' user.
Set up the message broker
- Install
rabbitmq
and add it andepmd
to default services. - See
interlex-mqsetup
Bootstrapping a development environment on Gentoo
It is possible to run InterLex as a service as the interlex
user and
use the PYTHONPATH
option in file:///etc/conf.d/interlex (open as root)
to use the development codebase to run the daemon. There are a few steps required
to get everything working smoothly. ONLY DO THIS IF YOU TRUST ${DEV_USER}
.
# become interlex (probably via root) su interlex cd # point interlex to dev packages export PYTHONPATH=~DEV_USER/git/pyontutils:~DEV_USER/git/ontquery:~DEV_USER/git/interlex:~DEV_USER/git/pyontutils/ttlser:~DEV_USER/git/pyontutils/htmlfn: # note that ~${DEV_USER} will not expand # set the location of devconfig.yaml export PYONTUTILS_DEVCONFIG=~/devconfig.yaml # create a new devconfig for the interlex user /home/${DEV_USER}/bin/ontutils devconfig --write
Run tests
- From the working directory run
python -m unittest test/test_constraints.py
(this is also run byinterlex-dbsetup
) - From the working directory run
pytest
(you may need to install it first e.g. viapipenv install pytest
) NOTE: this will run stress tests.
Load content
Sync with mysql
At this point you should be able to synchronize the database with the existing mysql interlex installation. WARNING There is a bug in the current loading process and the loaded records do not match those generated by the alt server via MysqlExport.
- Make sure you create a
~/.mypass
file that conforms to the syntax of~/.pgpass
i.e. each line should look likeserver.url.org:port:dbname:user:password
. - If you do not have direct access to the mysql database servers you may need to
set up ssh forwarding in which case you should add the hostname of your devbox
to
config.dev_remote_hosts
and forward to port33060
to make use of core.py. - Inside the venv run
interlex sync
- Once you drop into the IPython embed shell run
self.load()
and the load should commence. NOTE: there is no user auth at the moment so the code pretends to betgbugs
.
Start the uri server
For development
run interlex uri
in the venv.
For production
run interlex-uri
in the venv.
WARNING: if you run in this way you will not be able to use embed
to debug and you will
get strange errors.
Load ontologies
If you are running interlex via interlex-uri
replace the -o
from these commands with -c
.
- In the venv run
interlex post resource http://purl.obolibrary.org/obo/bfo.owl -o -u $YOURUSERNAME
- Repeat for as many ontologies as you want, for example
http://ontology.neuinfo.org/NIF/ttl/nif.ttl
NOTE: currently this does not pull in the transitive closure.
Load curies
- In the venv run
interlex post curies -o base
and theninterlex post curies -o $YOURUSERNAME
Performance notes
On orpheus
the primary bottleneck seems to be the number of gunicorn workers.
For total failures to respond in within 5 seconds when confronted 8 workers
set at 50hz full blast. What is very strange is that the same set of failures
shows up for every worker on output, so I think something is funky with how
errors are getting passed back out. A different set do fail when looking at the
printout. HyperThreading doesn't seem to help here. Load seems split evenly between
the guni workers and postgres. Failures seem to happen in bursts at higher guni worker
counts.
workers | avg failure % | cpu % sat all cores | effective rate Hz |
---|---|---|---|
2 | 50 | 25 | 10 |
4 | 4 | 60 | 16 |
4 | 9 | 60 | 15 |
5 | 5 | 80 | 18 |
8 | 4.5 | 100 | 19 |
8 | 4 | 100 | 19.5 |
Checking the logs, the ~20 Hz over 8 workers is indeed translating to about 160 requests per second, which still seems really low I should be able to generate way more requests than 20/worker.
urlblaster is a … bad piece of code.
for id in {0100000..0120000}; do echo -e $id; done | xargs -P 50 -r -n 1 curl -s "http://localhost:8606/base/ilx_${id}" > /dev/null
hits nearly 800 rps of 404s and
for id in {0100000..0101000}; do echo -e "http://localhost:8606/base/ilx_${id}"; done | xargs -L 1 -P 100 curl -s > /dev/null
hits 180 rps running guni and db on the same server with 8 workers (when requesting from not the server) hits 140 rps running guni and db on the same server with 4 workers
tornado seems pretty fast for 8 as well? who knows
measuring with time
from both the server and a remote shows that
we are hitting between 100 and 140 rps
who knows, maybe a materialized memory view would help for some of this, though somehow I think the issue is probably in the python
pypy3 with sync worker has roughly the same performance, gevent is monstrously slow gthread is about 20 rps slower than sync (1s over 1k requests), sync can get up to ~150rps, don't forget the cold boot effect on the first run which adds a second to everything eventlet is about ~12rps or so slower than sync (all for 8 workers, 4 workers is ~25rps slower for sync, 6 workers for sync seems to get fairly close to performance with 8 and the total cpu usage is fairly close as well) tornado with 6 workers seems to push the limits and is a bit faster than sync at ~155rps taking it to 8 shows a slowdown to ~145 rps 4 workers drops it to 133rps 5 hits 150rps so it seems that tornado with 6 is about the best for pypy3
pypy3 clearly faster with tornado than anything running 3.6, bonus is that rdflib will
be way faster too if we can get the memory leak during serialization worked out is now
way faster since fixing the "turns out that allocating hundreds of thousands of empty
lists just looks like a memory leak" bug. pypy3 is also about 4x faster when dumping nt
straight from the database, peaking at about 80MBps to disk on the same computer while
python3.6 hits ~20MBps.
most of the pypy3 numbers are tainted by the fact that they were tested from the server remotely there seems to be some cycling in the cpu usage, not sure why, but tornado at 8 seems like the best setup, eventlet might be ok too, more systematic testing would be needed
turning –log-level to critical gives maybe an extra second over 1000 requests
tested bjoern but got issues with hung processes and there is still quite high cpu usage best approach seems like it will be to cache things since the issue is likely that we are hitting python code to retrieve mostly static content anyway