Integrate AfroLID with python code

(1) Install AfroLID

pip install git+https://github.com/UBC-NLP/afrolid.git --q

Initial AfroLID object

Import related packges

import os, sys
import logging
from afrolid.main import classifier
logging.basicConfig(
      format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
      datefmt="%Y-%m-%d %H:%M:%S",
      level=os.environ.get("LOGLEVEL", "INFO").upper(),
      force=True, # Resets any previous configuration
)
logger = logging.getLogger("afrolid")

Create turjuman object

cl = classifier(logger, model_path=/path/to/model)

Get language prediction(s)

## Gold label = dip
 text="6Acï looi aya në wuöt dït kɔ̈k yiic ku lɔ wuöt tɔ̈u tëmec piny de Manatha ku Eparaim ku Thimion , ku ɣään mec tɔ̈u të lɔ rut cï Naptali"
 predicted_langs = cl.classify(text) # default max_outputs=3
 print("Predicted languages:")
 for lang in predicted_langs:
 print("     |-- ISO: {}\tName: {}\tScript: {}\tScore: {}%".format(
                 lang,
                 predicted_langs[lang]['name'],
                 predicted_langs[lang]['script'],
                 predicted_langs[lang]['score']))

Integrate with Pandas

wget https://raw.githubusercontent.com/UBC-NLP/afrolid/main/examples/examples.tsv -O examples.tsv
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
df = pd.read_csv("examples.tsv", sep="\t")

def get_afrolid_prediction(text):
      predictions = cl.classify(text, max_outputs=1)
      for lang in predictions:
            return lang, predictions[lang]['score'], predictions[lang]['name'], predictions[lang]['script']

df['predict_iso'], df['predict_score'], df['predict_name'], df['predict_script'] = zip(*df['content'].progress_apply(get_afrolid_prediction))
{'source': 'As US reaches one million COVID deaths, how are Americans coping?', 'target': ['وبما أن الولايات المتحدة تصل إلى مليون حالات وفاة بسبب كوفيد-19 ، كيف يعالج الأميركيون الأمر ؟']}

Read and translate text from file

  • -f or --input_file: import the text from file. The translation will saved on the JSON format file

  • -bs or --batch_size: The maximum number of source examples utilized in one iteration (default value is 25)

  • gen_options: Generation options

gen_options = {"search_method":"beam", "seq_length": 300, "num_beams":5, "no_repeat_ngram_size":2, "max_outputs":1}
torj.translate_from_file("samples.txt", batch_size=25, **gen_options)