Commit 1a301cec authored by Anton Melser's avatar Anton Melser 💬

Initial commit

parents
# EditorConfig helps developers define and maintain consistent
# coding styles between different editors and IDEs
# http://editorconfig.org
root = true
[*]
indent_style = space
indent_size = 4
end_of_line = lf
charset = utf-8
trim_trailing_whitespace = true
insert_final_newline = true
# Created by https://www.gitignore.io
### OSX ###
.DS_Store
.AppleDouble
.LSOverride
# Icon must end with two \r
Icon
# Thumbnails
._*
# Files that might appear on external disk
.Spotlight-V100
.Trashes
# Directories potentially created on remote AFP share
.AppleDB
.AppleDesktop
Network Trash Folder
Temporary Items
.apdisk
### Python ###
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
# C extensions
*.so
# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.cache
nosetests.xml
coverage.xml
# Translations
*.mo
*.pot
# Sphinx documentation
docs/_build/
# PyBuilder
target/
### Django ###
*.log
*.pot
*.pyc
__pycache__/
local_settings.py
# Anton
prod_settings.py
.env
db.sqlite3
/static/
/tmp/
[submodule "transcrobes/anki-bundled"]
path = transcrobes/anki-bundled
url = https://github.com/dae/anki.git
# Contributor Covenant Code of Conduct
## Our Pledge
In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socio-economic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment
include:
* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.
## Scope
This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community. Examples of
representing a project or community include using an official project e-mail
address, posting via an official social media account, or acting as an appointed
representative at an online or offline event. Representation of a project may be
further defined and clarified by project maintainers.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at "not.ok at transcrob.es" . All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.
Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
[homepage]: https://www.contributor-covenant.org
For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq
This diff is collapsed.
# Transcrobes
Transcrobes is the central project for the Transcrobes project (https://transcrob.es) and houses the central API for interacting with the system.
Applications with independent lifecycles (web extensions, media player plugins, the SRS system, etc.) live in other repositories in the project group (also called transcrobes).
As the nexus point for the project, it contains various cross-concern elements and serves to house issues and discussions that don't clearly belong to one of the other sub-projects, in addition to the central API.
Documentation
=============
See https://transcrob.es
Status
======
This is PRE-alpha software with KNOWN DATALOSS BUGS. It works, sorta, if you hold it right. There are not yet any tests and you should not attempt to use this project for anything like "production" until at least tests have been added.
Installation
============
Come back soon
Configuration
=============
Come back soon
Development
===========
If you are a (Python) developer learning Chinese, or you really want this to be compatible with learning other languages then your help is needed!
## Developer Certificate of Origin
Please sign all your commits by using `git -s`. In the context of this project this means that you are signing the document available at https://developercertificate.org/. This basically certifies that you have the right to make the contributions you are making, given this project's licence. You retain copyright over all your contributions - this sign-off does NOT give others any special rights over your copyrighted material in addition to the project's licence.
## Contributing
See [the website](https://transcrob.es/page/contribute) for more information. Please also take a look at our [code of conduct](https://transcrob.es/page/code_of_conduct) (or CODE\_OF\_CONDUCT.md in this repo).
beautifulsoup4==4.6.3
bs4==0.0.1
certifi==2018.10.15
chardet==3.0.4
decorator==4.3.0
Django==2.1.3
idna==2.7
Markdown==3.0.1
pkg-resources==0.0.0
psycopg2-binary==2.7.6.1
PyAudio==0.2.11
pytz==2018.7
requests==2.20.1
Send2Trash==1.5.0
urllib3==1.24.1
Subproject commit b5785f7ec8b3f95f88ba63cc43f9ee7ce829241a
# -*- coding: utf-8 -*-
import logging
import requests
import json
import sys, os, io
from django.conf import settings
import anki.consts
# MUST be set before importing anki.sync or it gets overriden!!!!!!!
anki.consts.SYNC_BASE = settings.ANKROBES_ENDPOINT
from anki.utils import checksum
from anki.sync import RemoteServer
logger = logging.getLogger(__name__)
# TODO: this module really needs a lot of work - it's dumb, and needs to be really smart
class AnkrobesServer(RemoteServer):
def __init__(self, username, password):
anki.sync.RemoteServer.__init__(self, None, None)
def _word_known(self, word):
self.postVars = dict(
k=self.hkey,
s=self.skey,
)
bys = io.BytesIO(json.dumps(dict(word=word)).encode("utf8"))
ret = self.req("word_known", bys, badAuthRaises=True).decode("utf8")
logger.debug("The query ret is: {}".format(ret))
return 1 if int(ret) else 0
def is_known(self, token):
# TODO: This method should take into account the POS but we'll need to implement that in Anki somehow
# a tag is probably the logical solution
return self.is_known_chars(token)
def is_known_chars(self, token):
return self._word_known(token['word'])
def add_ankrobes_note(self, simplified, pinyin, meanings, tags):
self.postVars = dict(
k=self.hkey,
s=self.skey,
)
note = dict(
simplified=simplified.replace('"', "'"),
pinyin=pinyin.replace('"', "'"),
meanings=[x.replace('"', "'") for x in meanings],
tags=tags
)
bys = io.BytesIO(json.dumps(note).encode("utf8"))
ret = self.req("add_ankrobes_note", bys, badAuthRaises=True).decode("utf8")
logger.debug("The add note ret is: {}".format(ret))
return 1 if int(ret) else 0
def get_word(self, word, deck_name='transcrobes'):
self.postVars = dict(
k=self.hkey,
s=self.skey,
)
bys = io.BytesIO(json.dumps(dict(word=word)).encode("utf8"))
ret = self.req("get_word", bys, badAuthRaises=True).decode("utf8")
return json.loads(ret)
def set_word_known(self, simplified, pinyin, meanings=[], tags=[], review_in=7):
self.postVars = dict(
k=self.hkey,
s=self.skey,
)
# TODO: validate input data properly...
if not (simplified and pinyin and meanings):
raise Exception("Missing obligatory input date. Simplified, Pinyin and meanings fields are required")
note = dict(
simplified=simplified.replace('"', "'"),
pinyin=pinyin.replace('"', "'"),
meanings=[x.replace('"', "'") for x in meanings],
tags=tags,
review_in=review_in,
)
logger.debug("Sending to ankrobes to set_word_known date: {}".format(note))
bys = io.BytesIO(json.dumps(note).encode("utf8"))
ret = self.req("set_word_known", bys, badAuthRaises=True).decode("utf8")
return json.loads(ret)
# -*- coding: utf-8 -*-
import os
import sys
# This way of doing things is modelled on the original ankisyncd
# there may well be a better way!
sys.path.insert(0, "/usr/share/anki")
sys.path.insert(0, os.path.join(os.path.dirname(os.path.dirname(__file__)), "anki-bundled"))
default_app_config = 'enrich.apps.EnrichConfig'
# -*- coding: utf-8 -*-
from django.contrib import admin
# Register your models here.
# -*- coding: utf-8 -*-
from django.apps import AppConfig
from django.core.cache import cache
import logging
import re
logger = logging.getLogger(__name__)
class EnrichConfig(AppConfig):
name = 'enrich'
def ready(self):
pass
# -*- coding: utf-8 -*-
import requests
import json
import sys
import re
import http.client, urllib.parse, uuid
import logging
import collections
import unicodedata
from django.conf import settings
from enrich.nlp.provider import OpenNLPProvider
from enrich.translate.translator import BingTranslator, CCCedictTranslator, Translator, ABCDictTranslator
from enrich.translate.translator import hsk_dict, subtlex
from ankrobes import AnkrobesServer
logger = logging.getLogger(__name__)
has_chinese_chars = re.compile('[\u4e00-\u9fa5]+')
def _get_transliteratable_sentence(tokens):
t_sent = ""
for t in tokens:
w = t['originalText']
t_sent += w if has_chinese_chars.match(w) else " {}".format(w)
return t_sent
def _add_transliterations(sentence, transliterator):
tokens = sentence['tokens']
clean_text = _get_transliteratable_sentence(tokens)
trans = transliterator.transliterate(clean_text)
clean_trans = " "
i = 0
while i < len(trans):
char_added = False
if not unicodedata.category(trans[i]).startswith('L') or not unicodedata.category(trans[i-1]).startswith('L'):
clean_trans += " "
clean_trans += trans[i]
i += 1
clean_trans = " ".join(list(filter(None, clean_trans.split(' '))))
deq = collections.deque(clean_trans.split(' '))
for t in tokens:
w = t['originalText']
pinyin = []
i = 0
nc = ""
while i < len(w):
if unicodedata.category(w[i]) == ('Lo'): # it's a Chinese char
pinyin.append(deq.popleft())
else:
if not nc:
nc = deq.popleft()
if w[i] != nc[0]:
raise Exception("{} should equal {} for '{}' and tokens '{}' with original {}".format(
w[i], nc, clean_trans, tokens, clean_text))
pinyin.append(w[i])
if len(nc) > 1:
nc = nc[1:]
else:
nc = ""
i += 1
t['pinyin'] = pinyin
def _enrich_model(model):
# TODO: the settings shouldn't be here, they should probably be done in the class
server = AnkrobesServer(settings.ANKROBES_USERNAME, settings.ANKROBES_PASSWORD)
server.hostKey(settings.ANKROBES_USERNAME, settings.ANKROBES_PASSWORD)
online_translator = BingTranslator()
cedict = CCCedictTranslator()
abcdict = ABCDictTranslator()
for s in model['sentences']:
_add_transliterations(s, online_translator)
logger.debug("Looking for tokens to translate in {}".format(s))
clean_sentence_text = ""
for t in s['tokens']:
# logger.info("Here is my tokie: {}".format(t))
w = t['word']
# logger.debug("Starting to check: {}".format(w))
if w.startswith('<') and w.endswith('>'): # html
logger.debug("Looks like '{}' only has html, not adding to translatables".format(w))
continue
# From here we keep the words for the cleaned sentence (e.g., for translation)
clean_sentence_text += w
if t['pos'] in ['PU', 'OD', 'CD', 'NT', 'URL']:
logger.debug("'{}' has POS '{}' so not adding to translatables".format(w, t['pos']))
continue
# TODO: decide whether to continue removing if doesn't contain any Chinese chars?
# Sometimes yes, sometimes no!
if not has_chinese_chars.match(w):
continue
# From here we attempt translation and create ankrobes entries
ank_entry = _sanitise_ankrobes_entry(server.get_word(w))
t["ankrobes_entry"] = ank_entry
t["definitions"] = {
'best': online_translator.get_standardised_defs(t),
'second': abcdict.get_standardised_defs(t),
'third': cedict.get_standardised_defs(t),
'fallback': online_translator.get_standardised_fallback_defs(t)
}
t["normalized_pos"] = Translator.NLP_POS_TO_SIMPLE_POS[t["pos"]]
# TODO: decide whether we really don't want to make a best guess for words we know
# this might still be very useful though probably not until we have a best-trans-in-context SMT system
# that is good
# logger.debug("my ank_entry is {}".format(ank_entry))
if not ank_entry or not ank_entry[0]["Is_Known"]: # FIXME: Just using the first for now
# get the best guess for the definition of the word given the context of the sentence
_set_best_guess(s, t)
t["stats"] = {
'hsk': hsk_dict[w] if w in hsk_dict else None,
'freq': subtlex[w] if w in subtlex else None,
}
s['cleaned'] = clean_sentence_text
s['translation'] = online_translator.translate(clean_sentence_text)
def _sanitise_ankrobes_entry(entries):
# we don't need the HTML here - we'll put the proper html back in later
for entry in entries:
entry['Simplified'] = re.sub("(?:<[^>]+>)*", '', entry['Simplified'], flags=re.MULTILINE)
entry['Pinyin'] = re.sub("(?:<[^>]+>)*", '', entry['Pinyin'], flags=re.MULTILINE)
entry['Meaning'] = re.sub("(?:<[^>]+>)*", '', entry['Meaning'], flags=re.MULTILINE)
return entries
def _set_best_guess(sentence, token):
# TODO: do something intelligent here
# ideally this will translate the sentence using some sort of statistical method but get the best
# translation for each individual word of the sentence, not the whole sentence, giving us the
# most appropriate definition to show to the user
best_guess = None
others = []
all_defs = []
for t in ["best", "second", "third", "fallback"]: # force the order
for def_pos, defs in token["definitions"][t].items():
if not defs:
continue
all_defs += defs
if Translator.NLP_POS_TO_SIMPLE_POS[token["pos"]] == def_pos or Translator.NLP_POS_TO_ABC_POS[token["pos"]] == def_pos:
# get the most confident for the right POs
sorted_defs = sorted(defs, key = lambda i: i['confidence'], reverse=True)
best_guess = sorted_defs[0]
break
elif def_pos == "OTHER":
others += defs
if best_guess:
break
if not best_guess and len(others) > 0:
# it's bad
logger.debug("No best_guess found for '{}', using the best 'other' POS defs {}".format(token["word"], others))
best_guess = sorted(others, key = lambda i: i['confidence'], reverse=True)[0]
if not best_guess and len(all_defs) > 0:
# it's really bad
best_guess = sorted(all_defs, key = lambda i: i['confidence'], reverse=True)[0]
logger.debug("No best_guess found with the correct POS or OTHER for '{}', using the highest confidence with the wrong POS all_defs {}".format(
token["word"], all_defs))
logger.debug("Setting best_guess for '{}' POS {} to best_guess {}".format(token["word"], token["pos"], best_guess))
token["best_guess"] = best_guess # .split(',')[0].split(';')[0]
def enrich_to_json(html):
# TODO: make this OOP with a factory method controlled from the settings
model = OpenNLPProvider().parse(html)
logging.debug("Attempting to enrich: '{}'".format(html))
_enrich_model(model)
return model
# Generated by Django 2.1.1 on 2018-09-10 06:46
from django.db import migrations, models
class Migration(migrations.Migration):
initial = True
dependencies = [
]
operations = [
migrations.CreateModel(
name='BingRequest',
fields=[
('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
('source_text', models.CharField(max_length=200)),
('response_json', models.CharField(max_length=25000)),
],
),
]
# Generated by Django 2.1.1 on 2018-09-10 07:24
from django.db import migrations, models
class Migration(migrations.Migration):
dependencies = [
('enrich', '0001_initial'),
]
operations = [
migrations.CreateModel(
name='BingAPITranslation',
fields=[
('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
('source_text', models.CharField(max_length=200)),
('response_json', models.CharField(max_length=25000)),
],
options={
'abstract': False,
},
),
migrations.RenameModel(
old_name='BingRequest',
new_name='BingAPILookup',
),
]
# Generated by Django 2.1.1 on 2018-09-21 06:00
from django.db import migrations, models