Skip to content

(N)Grams

bigrams

Simply create (N)grams: N ~ Bi | Tri ...

PyPI PyPI - Python Version PyPI - License HitCount

Welcome to bigrams, a Python project that provides a non-intrusive way to connect tokenized sentences in (N)grams. This tool is designed to work with tokenized sentences, and it is focused on a single task: providing an efficient way to merge tokens from a list of tokenized sentences.

It's non-intrusive as it leaves tokenisation, stopwords removal and other text preprocessing out of its flow.


Source Code: https://github.com/proteusiq/bigrams

PyPI: https://pypi.org/project/bigrams/


Installation

pip install -U bigrams

Usage

To use bigrams, import it into your Python script, and use scikit-learn-ish API to transform your tokens.

from bigrams import Grams

# expects tokenised sentences
in_sentences = [["this", "is", "new", "york", "baby", "again!"],
              ["new", "york", "and", "baby", "again!"],
            ]
g = Grams(window_size=2, threshold=2)

out_sentences = g.fit_transform(in_sentences)
print(out_sentences)
# [["this", "is", "new_york", "baby_again!"],
#   ["new_york", "and", "baby_again!"],
#  ]

Development

  • Clone this repository
  • Requirements:
  • Poetry
  • Python 3.7+
  • Create a virtual environment and install the dependencies
poetry install
  • Activate the virtual environment
poetry shell

Testing

pytest

Pre-commit

Pre-commit hooks run all the auto-formatters (e.g. black, isort), linters (e.g. mypy, flake8), and other quality checks to make sure the changeset is in good shape before a commit/push happens.

You can install the hooks with (runs for each commit):

pre-commit install

Or if you want them to run only for each push:

pre-commit install -t pre-push

Or if you want e.g. want to run all checks manually for all files:

pre-commit run --all-files

Contributing are welcome

ToDo:

  • [ ] Create Scikit-learn compatible transformer
  • [ ] ~~create a save & load function~~
  • [ ] compare it with gensim Phrases
  • [ ] write replacer in Rust - PyO3