Commit dc1616e7 authored by Tomasz Walkowiak's avatar Tomasz Walkowiak

Initial commit

parents
# Zawartość katalogu
## budowanie kontenerów
### pliki Dockerfile
- base.Dockerfile - bazowy obraz
- gtpprmc.Dockerfile - obraz z narzędziem gtpprmc, na podst. obrazu bazowego,
kompilacja ze źródeł
- gtpprmc-precomp.Dockerfile - obraz z narzędziem gtpprmc, na podst. obrazu
bazowego, instalacja ze skompilowanej wcześniej paczki
- kwazon.Dockerfile - obraz z narzędziem kwazon, na podst. obrazu gtpprmc,
używany przez docker-compose
### Pomocnicze skrypty
- build.sh - skrypt ułatwiający budowanie powyższych obrazów
- install_resources.sh - skrypt ułatwiający instalowanie dużych zasobów
- pobranie
- instalacja w odpowiednim katalogu
- modyfikacja pliku konfiguracyjnego (wskazanie na zainstalowane zasoby)
### Zasoby do skryptów
- resources-0.9.install - plik ze URLami do zasobów instalowanych w czasie
budowania kwazon (używany przez install_resources.install)
## skrypty konfuguracyjne kontenera kwazon
- init.sh - startuje narzędzia wewnątrz kontenera
- clean.sh - czyści logi
## pliki do usługi
- kwazon_worker.py - worker dla kwazon
- config.ini - config do workera
# Budowanie narzędzia kwazon
## Krótka instrukcja
1. Zbudowanie zależności
```
./build.sh base
./build.sh gtpprmc-precomp
```
lub
```
./build.sh base --no-cache
./build.sh gtpprmc-precomp --no-cache
```
lub
```
./build.sh base
./build.sh gtpprmc-precomp
```
jeśli chcemy korzystać z pośrednich obrazów z ostatniego budowania.
Jeśli chcemy kompilować ze źródeł to należy użyć *gtpprmc*
zamiast *gtpprmc-precomp*:
```
./build.sh gtpprmc --no-cache
```
2. budowanie narzędzia
```
cd ..
docker-compose build kwazon
```
3. Uruchomienie narzędzia
```
docker-compose up -d kwazon
```
## Budowanie obrazu
Budowanie obrazu dla narzędzia kwazon jest podzielone na trzy etapy.
Przygotowano cztery Dockerfile kolejno dla obrazu: podstawowego, dwie
wersje Dockerfile dla customowej wersji graph_toola (prekompilowana paczka oraz
pełna kompilacja) oraz dla narzędzia kwazon. Każdy kolejny obraz jest
budowany na podstawie poprzedniego.
Przygotowano skrypt build.sh ułatwiający budowanie obrazów.
Zbudowanie wszystkich obrazów (pierwszego i wszystkich kolejnych):
./build.sh base --all
Zbudowanie wybranego obrazu:
./build.sh kwazon
### Fazy budujące poszczególne obrazy (dla skryptu build.sh):
- base
- gtpprmc - wersja z kompilacją ze źródeł
- gtpprmc-precomp - budowanie ze skompilowanej paczki
- kwazon
### Obrazy są budowane z tagami:
- base_image:"azon/kwazon-base"
- gtpprmc_image:"azon/kwazon-gtpprmc"
- complete_image:"azon/kwazon"
### Zasoby wymagane do poszczególnych wersji kwazon
- Plik 'resources-0.9.install' zawiera listę zasobów wymaganych dla wersji 0.9.
- Plik 'resources-0.8.install' zawiera listę zasobów wymaganych dla wersji 0.8.
- Plik 'resources-without-vecs.install' zawiera listę zasobów wymaganych dla wersji 0.6.
- Plik 'resources.install' zawiera listę zasobów wymaganych dla wersji 0.7.
FROM ubuntu:16.04
#standard ubuntu packages
RUN apt-get -y update && \
apt-get install -y apt-utils && \
apt-get install -y iputils-ping && \
apt-get install -y iputils-tracepath && \
apt-get install -y cmake && \
apt-get install -y build-essential && \
apt-get install -y git && \
apt-get install -y subversion && \
apt-get install -y libboost-all-dev && \
apt-get install -y swig && \
apt-get install -y python-dev && \
apt-get install -y wget && \
apt-get install -y software-properties-common python-software-properties &&\
apt-get install -y nano mc zip unzip &&\
apt-get install -y locales locales-all && \
apt-get install -y apt-transport-https
RUN locale-gen pl_PL.UTF-8
ENV LANG='pl_PL.UTF-8' LANGUAGE='pl_PL:pl' LC_ALL='pl_PL.UTF-8'
RUN apt-get update && apt-get install -y --no-install-recommends \
libicu-dev \
libxml++2.6-dev \
bison \
flex \
libloki-dev \
libcppunit-dev \
libantlr-dev \
default-jdk \
build-essential \
autotools-dev \
python \
python-setuptools \
python-stdeb \
python-pip \
python-all-dev \
python-pyparsing \
devscripts \
acl \
antlr \
build-essential \
libssl-dev \
libffi-dev
RUN pip2 install wheel
RUN pip install pyinstaller
RUN apt-get install -y libgmp3-dev \
libcgal-dev \
python-numpy \
libcairomm-1.0-dev \
python-cairo-dev \
libsparsehash-dev
# Install NLPWorkers structure
RUN mkdir /samba
RUN mkdir /samba/requests
RUN mkdir /samba/requests/dir
RUN mkdir /samba/requests/div
RUN mkdir /samba/requests/kwazon
RUN mkdir /home/work
RUN mkdir /home/work/models
RUN mkdir /home/work/nlpworkers
#python libs for workers
RUN apt-get -y update && \
apt-get install -y python3-pip && \
pip3 install --upgrade pip && \
pip2 install --upgrade pip
RUN pip2 install pika==0.10.0
RUN pip3 install pika==0.10.0
##CORPUS2
WORKDIR /home/install
RUN git clone http://nlp.pwr.wroc.pl/corpus2.git && \
mkdir corpus2/bin && \
cd corpus2/bin && \
cmake .. && \
make -j && \
make install && \
ldconfig
RUN apt-get remove -y python-pip
RUN apt-get install -y libxml2-utils
#WOSEDON
WORKDIR /home/install
# RUN rm -r wosedon_pub
RUN git clone http://nlp.pwr.edu.pl/wosedon_pub.git && \
cd wosedon_pub/wosedon_current/tools/PLWNGraphBuilder && \
python setup.py install && \
cd ../../../wosedon_current && \
python setup.py install
#CCLUTILS
RUN pip2 install --extra-index-url https://pypi.clarin-pl.eu/ corpus_ccl
#BASICUTILS
RUN pip2 install --extra-index-url https://pypi.clarin-pl.eu/ basicutils
RUN pip2 install requests
RUN pip3 install requests
#cleaning
ADD clean.sh /etc/cron.hourly/clean.sh
RUN ["chmod", "+x", "/etc/cron.hourly/clean.sh"]
ADD init.sh /init.sh
RUN ["chmod", "+x", "/init.sh"]
# Define default command
CMD ["/init.sh"]
ENTRYPOINT ["/init.sh"]
#!/bin/bash
USAGE="USAGE: $0 STAGE_IMAGE [--all] [--no-cache]
STAGE_IMAGE == {base|gtpprmc|gtpprmc-precomp|kwazon}
Use '--all' to build all SUCCESSIVE images."
base_image="azon/kwazon-base"
gtpprmc_image="azon/kwazon-gtpprmc"
complete_image="azon/kwazon"
dockerfile_name=""
image_name=""
build_to_end=""
next_image=""
no_cache=""
if [[ -z "$1" ]] || [[ $1 =~ "-h" ]];then
echo -e "$USAGE"
exit 1
fi
if [[ ! -z "$2" ]] && [[ $2 =~ ^--all$ ]];then
echo "Image for stage $1 and all successive will be built ..."
build_to_end='y'
elif [[ ! -z "$3" ]] && [[ $3 =~ ^--all$ ]];then
echo "Image for stage $1 and all successive will be built ..."
build_to_end='y'
fi
if [[ ! -z "$2" ]] && [[ $2 =~ ^--no-cache$ ]];then
echo "Image for stage $1 will be built without using cache ..."
no_cache='--no-cache'
elif [[ ! -z "$3" ]] && [[ $3 =~ ^--no-cache$ ]];then
echo "Image for stage $1 will be built without using cache ..."
no_cache='--no-cache'
fi
stage_name="$1"
echo "stage_name : $stage_name"
if [[ "$stage_name" == "base" ]];then
dockerfile_name="base.Dockerfile"
image_name="$base_image"
echo "Calling "'docker build . '"$no_cache"' -t='"$image_name"' -f '"$dockerfile_name"
if [[ -z "$no_cache" ]];then
docker build . -t="$image_name" -f "$dockerfile_name"
else
docker build . "$no_cache" -t="$image_name" -f "$dockerfile_name"
fi
if [ "$build_to_end" ];then
next_image="y"
fi
fi
if [[ "$stage_name" == "gtpprmc" ]] || [[ ! -z "$next_image" ]];then
dockerfile_name="gtpprmc.Dockerfile"
image_name="$gtpprmc_image"
echo "Calling "'docker build . '"$no_cache"' -t='"$image_name"' -f '"$dockerfile_name"
if [[ -z "$no_cache" ]];then
docker build . -t="$image_name" -f "$dockerfile_name"
else
docker build . "$no_cache" -t="$image_name" -f "$dockerfile_name"
fi
if [ "$build_to_end" ];then
next_image="y"
fi
fi
if [[ "$stage_name" == "gtpprmc-precomp" ]] || [[ ! -z "$next_image" ]];then
dockerfile_name="gtpprmc-precomp.Dockerfile"
image_name="$gtpprmc_image"
echo "Calling "'docker build . '"$no_cache"' -t='"$image_name"' -f '"$dockerfile_name"
if [[ -z "$no_cache" ]];then
docker build . -t="$image_name" -f "$dockerfile_name"
else
docker build . "$no_cache" -t="$image_name" -f "$dockerfile_name"
fi
if [ "$build_to_end" ];then
next_image="y"
fi
fi
if [[ "$stage_name" == "kwazon" ]] || [[ ! -z "$next_image" ]];then
dockerfile_name="kwazon.Dockerfile"
image_name="$complete_image"
echo "Calling "'docker build . '"$no_cache"' -t='"$image_name"' -f '"$dockerfile_name"
if [[ -z "$no_cache" ]];then
docker build . -t="$image_name" -f "$dockerfile_name"
else
docker build . "$no_cache" -t="$image_name" -f "$dockerfile_name"
fi
fi
if [[ -z "$image_name" ]];then
echo "Nothing was done."
fi
#!/bin/sh
#find /samba/requests/*/*/* -mmin +200 -exec rm -Rf -- {} \;
find /samba/requests/*/* -mmin +200 -exec rm -Rf -- {} \;
find /samba/users/default/* -mmin +200 -exec rm -Rf -- {} \;
[service]
tool = kwazon
#root = /requests/
root = /samba/requests/
rabbit_host = rabbit.clarin.ws
rabbit_user = clarin
rabbit_password = clarin123
[tool]
workers_number = 4
kwazon_cfg_file = /home/install/kwazon/keyword_assignment_tool/config/config.ini
#workers_number = 4
[logging]
port = 9996
local_log_level = INFO
[logging_levels]
__main__ = INFO
kwazon_worker = INFO
version: '3.7'
services:
kwazon:
container_name: clarin_kwazon
build:
context: ./
dockerfile: kwazon.Dockerfile
working_dir: /home/work/nlpworkers/kwazon-worker
entrypoint:
- /usr/bin/python
- kwazon_worker.py
volumes:
- .samba:/samba
- ./config.ini:/home/work/nlpworkers/kwazon-worker/config.ini
- ./kwazon_worker.py:/home/work/nlpworkers/kwazon-worker/kwazon_worker.py
FROM azon/kwazon-base
RUN apt-get install -y autotools-dev && apt-get install -y automake && apt-get install -y libboost-all-dev
#GRAPH-TOOL pprmc
WORKDIR /home/install
# COPY A PRECOMPILED GRAPH-TOOL LIBRARY
RUN git clone -b current https://gitlab.clarin-pl.eu/team-semantics/gtpprmc.git && \
cd gtpprmc && unzip graph_tool.zip && \
mv graph_tool /usr/lib/python2.7/dist-packages/ && ldconfig
FROM azon/kwazon-base
RUN apt-get install -y autotools-dev && apt-get install -y automake && apt-get install -y libboost-all-dev
#GRAPH-TOOL pprmc
WORKDIR /home/install
# RUN svn co http://svn.clarin-pl.eu/svn/gtpprmc/branches/devel && \
# cd devel && \
# ./configure && \
# make -j && make install && \
# ldconfig
# COMPILING GRAPH-TOOL (takes some time...)
RUN git clone -b current https://gitlab.clarin-pl.eu/team-semantics/gtpprmc.git && \
cd gtpprmc && \
./configure && \
make -j && make install && \
ldconfig
#!/bin/sh
cd /home/work/nlpworkers/kwazon
nohup ./kwazon_worker.py </dev/null >/dev/null 2>&1 &
\ No newline at end of file
#!/bin/bash
USAGE="Script installs resources in package directory and modifies properties paths in config.ini.
USAGE: $0 INSTALLATION_PATHS DEST_DIR CONFIG_FILE_PATH
where INSTALLATION_PATHS row: dest_file_name url"
installation_paths=""
dest_dir=""
if [[ -z "$1" ]] || [[ -z "$2" ]] || [[ -z "$3" ]];then
echo -e "$USAGE"
exit 1
fi
installation_paths="$1"
dest_dir=$(echo "$2" | sed 's/\/$//g')
config_path="$3"
echo "installation_paths: $installation_paths"
echo "dest_dir: $dest_dir"
echo "config_path: $config_path"
while read -r prop_name file_name url; do
echo "Installing file $file_name into $dest_dir ..."
dest_path="$dest_dir/$file_name"
echo "wget $url -O $dest_path"
wget "$url" -O "$dest_path"
if [ $? -ne 0 ];then
echo "Cannot download resource $file_name from $url or store it in $dest_path"
exit 1
fi
# echo "Modifying $config_path ..."
sed -i 's|^'"$prop_name"' *= *.*$|'"$prop_name = $dest_path"'|' "$config_path"
done <"$installation_paths"
FROM azon/kwazon-gtpprmc
WORKDIR /home/install
RUN pip install gensim urllib3==1.23 sklearn
#fetch from git
RUN git clone -b current https://gitlab.clarin-pl.eu/team-semantics/kwazon.git && \
cd kwazon && \
python setup.py install
ADD install_resources.sh /home/install/install_resources.sh
ADD resources-0.9.install /home/install/resources.install
RUN ["chmod", "+x", "/home/install/install_resources.sh"]
WORKDIR /home/install
RUN ["./install_resources.sh", "/home/install/resources.install", "/home/install/kwazon/keyword_assignment_tool/data", "/home/install/kwazon/keyword_assignment_tool/config/config.ini"]
WORKDIR /home/install
## NLP_WS
RUN svn co http://svn.clarin-pl.eu/svn/nlpservices/src/nlp_ws &&\
pip install -e nlp_ws
#FIX for problems with config.ini
# get latest version number
RUN KWAZON_VERSION=$(echo "$(find "/usr/local/lib/python2.7/dist-packages" -maxdepth 1 -type d -name "kwazon-*" | sed -r -e 's/.*kwazon-(([0-9]\.).+)-py.*/\1/' | sort | tail -n 1)") && \
cp "/home/install/kwazon/keyword_assignment_tool/config/config.ini" "/usr/local/lib/python2.7/dist-packages/kwazon-""$KWAZON_VERSION""-py2.7.egg/keyword_assignment_tool/config/config.ini"
# RUN ["cp","/home/install/kwazon/keyword_assignment_tool/config/config.ini","/usr/local/lib/python2.7/dist-packages/kwazon-0.8.1-py2.7.egg/keyword_assignment_tool/config/config.ini"]
#FIX for problems with 'tok_in_sent_index'
#CCLUTILS
RUN pip uninstall -y corpus_ccl
RUN pip install --force-reinstall --extra-index-url https://pypi.clarin-pl.eu/ corpus_ccl==0.93
#BASICUTILS
RUN pip uninstall -y basicutils
RUN pip install --force-reinstall --extra-index-url https://pypi.clarin-pl.eu/ basicutils
#!/usr/bin/python
# -*- coding: utf-8 -*-
import nlp_ws
import shutil, os
import subprocess
from keyword_assignment_tool import kwazon_plugin
import logging
import corpus2
from contextlib import contextmanager
from corpus2 import read_chunks_from_utf8_string as read_chunks_from_string
_log = logging.getLogger(__name__)
@contextmanager
def str_to_doc(string, tagset='nkjp', iformat='ccl'):
string = str(ElinkerWorker._deunicodify(string))
# tagset = corpus2.get_named_tagset(tagset)
doc = corpus2.Document()
for chunk in read_chunks_from_string(string, tagset, iformat):
doc.add_paragraph(chunk)
yield doc
class KwazonWorker(nlp_ws.NLPWorker):
@classmethod
def static_init(cls, config):
_log.info("Worker started loading models %s", "AS")
cls.configtool = config['tool']
cls.model = kwazon_plugin.KwazonPlugin(load_complete=True)
#,cfg_file_path=config['tool']['kwazon_cfg_file']
# cls.model = kwazon_plugin.KwazonPlugin()
_log.info("Worker finished loading models ")
# def init(self):
# self.model = wosedon_plugin.WoSeDonPlugin(
# self.configtool['elinker_cfg_file'])
def process(self, inputFile, taskOptions, outputFile):
if os.path.isdir(inputFile):
shutil.copytree(inputFile, outputFile)
ifn = inputFile + "/text.ccl"
else:
os.makedirs(outputFile)
ifn = inputFile
shutil.copy2(inputFile, outputFile + "/text.ccl")
ofn = outputFile + "/kwazon.json"
# _validate_xml(inputFile)
# self.model.run_kwazon(str(ifn), str(ofn))
self.model.run_kwazon_prepared(str(ifn), str(ofn))
@staticmethod
def _deunicodify(string):
# Python2, SWIG and Unicode, aka the Hateful Three
try:
return string.encode("UTF-8")
except UnicodeEncodeError:
return string
class _InvalidXMLInRequest(Exception):
pass
def _validate_xml(xmlfile):
"""
This will do nothing if XML is valid and raise exception if it's not.
"""
lint_call = subprocess.Popen(
('xmllint', '--nonet', '--noout', xmlfile),
stderr=subprocess.PIPE,
)
xml_err = lint_call.communicate()[1]
if lint_call.returncode != 0:
raise _InvalidXMLInRequest("Wrong XML in input data")
if __name__ == '__main__':
nlp_ws.NLPService.main(KwazonWorker)
graph_file_path graph-2018-11-13-categories-broader-narrower-weighted-001-filtered-meaningless.graphml https://nextcloud.clarin-pl.eu/index.php/s/JmsdkpwOc8F27Wt/download
concepts_categories_mapping concept_category_index_all_sources_extended_reduced_at_least_one_v7.bin.zip https://nextcloud.clarin-pl.eu/index.php/s/EaFQ0M4C0WMSMD1/download
categories_vectors cats_keyed_vectors_v2.bin.zip https://nextcloud.clarin-pl.eu/index.php/s/6CsD9UGXawFTS9I/download
concepts_vectors concepts_keyed_vectors.bin.zip https://nextcloud.clarin-pl.eu/index.php/s/Hcnn2zq7ZEPU9sN/download
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment