Simply create (N)grams: N ~ Bi | Tri ...
Welcome to bigrams, a Python project that provides a non-intrusive way to connect tokenized sentences in (N)grams. This tool is designed to work with tokenized sentences, and it is focused on a single task: providing an efficient way to merge tokens from a list of tokenized sentences.
It's non-intrusive as it leaves tokenisation, stopwords removal and other text preprocessing out of its flow.
Source Code: https://github.com/proteusiq/bigrams
PyPI: https://pypi.org/project/bigrams/
pip install -U bigrams
To use bigrams, import it into your Python script, and use scikit-learn
-ish API to transform your tokens.
from bigrams import Grams
# expects tokenised sentences
in_sentences = [["this", "is", "new", "york", "baby", "again!"],
["new", "york", "and", "baby", "again!"],
g = Grams(window_size=2, threshold=2)
out_sentences = g.fit_transform(in_sentences)
# [["this", "is", "new_york", "baby_again!"],
# ["new_york", "and", "baby_again!"],
# ]
- Clone this repository
- Requirements:
- Poetry
- Python 3.7+
- Create a virtual environment and install the dependencies
poetry install
- Activate the virtual environment
poetry shell
Pre-commit hooks run all the auto-formatters (e.g. black
, isort
), linters (e.g. mypy
, flake8
), and other quality
checks to make sure the changeset is in good shape before a commit/push happens.
You can install the hooks with (runs for each commit):
pre-commit install
Or if you want them to run only for each push:
pre-commit install -t pre-push
Or if you want e.g. want to run all checks manually for all files:
pre-commit run --all-files
Contributing are welcome
- [ ] Create Scikit-learn compatible transformer
- [ ] ~~create a save & load function~~
- [ ] compare it with gensim Phrases
- [ ] write replacer in Rust - PyO3