License Apache 2.0 Python 3.6

Automatic spelling correction component

Automatic spelling correction component is based on An Improved Error Model for Noisy Channel Spelling Correction by Eric Brill and Robert C. Moore and uses statistics based error model, a static dictionary and an ARPA language model to correct spelling errors.
We provide everything you need to build a spelling correction module for russian and english languages and some hints on how to collect appropriate datasets for other languages.

Usage

Config parameters:

A working config could look like this:

{
  "model": {
    "name": "spelling_error_model",
    "save_path": "../download/error_model/error_model.tsv",
    "load_path": "../download/error_model/error_model.tsv",
    "train_now": true,
    "window": 1,
    "dictionary": {
      "name": "wikitionary_100K_vocab"
    },
    "lm_file": "/data/data/enwiki_no_punkt.arpa.binary"
  }
}

Usage example

This model expects a sentence string with space-separated tokens in lowercase as its input and returns the same string with corrected words. Here’s an example code that will read input data from stdin line by line and output resulting text to the output.txt file:

import json
import sys

from deeppavlov.core.commands.infer import build_model_from_config
from deeppavlov.core.commands.utils import set_usr_dir

CONFIG_PATH = 'models/spellers/error_model/config_ru_custom_vocab.json'
set_usr_dir(CONFIG_PATH)

with open(CONFIG_PATH) as config_file:
    config = json.load(config_file)

model = build_model_from_config(config)
with open('output.txt', 'w') as f:
    for line in sys.stdin:
        print(model.infer(line), file=f, flush=True)

if we save it as example.py then it could be used like so:

cat input.txt | python3 example.py

Training

Error model

For the training phase config file needs to also include these parameters:

A working training config could look something like:

{
  "model": {
    "name": "spelling_error_model",
    "save_path": "../download/error_model/error_model.tsv",
    "load_path": "../download/error_model/error_model.tsv",
    "window": 1,
    "train_now": true,
    "dictionary": {
      "name": "wikitionary_100K_vocab"
    }
  },
  "dataset_reader": {
    "name": "typos_wikipedia_reader"
  },
  "dataset": {
    "name": "typos_dataset"
  }
}

And a script to use this config:

from deeppavlov.core.commands.train import train_model_from_config
from deeppavlov.core.commands.utils import set_usr_dir

MODEL_CONFIG_PATH = 'deeppavlov/models/spellers/error_model/config_en.json'
set_usr_dir(MODEL_CONFIG_PATH)
train_model_from_config(MODEL_CONFIG_PATH)

Language model

This model uses KenLM to process language models, so if you want to build your own, we suggest you consult with its website. We do also provide our own language models for english (5.5GB) and russian (3.1GB) languages.

Comparison

We compared this module with Yandex.Speller and GNU Aspell on the test set for the SpellRuEval competition on Automatic Spelling Correction for Russian:

Correction method Precision Recall F-measure
Yandex.Speller 83.09 59.86 69.59
Our model with the provided language model 51.92 53.94 52.91
Our model with no language model 41.42 37.21 39.20
GNU Aspell, always first candidate 27.85 34.07 30.65

Ways to improve