skip to content
Ben Lau statistics . machine learning . programming . optimization . research

Natural Language Processing

1 min read Updated:

Techniques

  • Lowercasing: Convert text to lowercase.
  • Punctuation Removal: Remove punctuation marks from text.
  • Stemming: Remove the suffixes of words to get the root word. For example, “changing”, “changed”, “change” all become “chang”.
  • Lemmatization: Similar to stemming, but the root word is a lemmatized word, which is a valid word in the dictionary. For example, “changing”, “changed”, “change” all become “change”.
  • Stop Words: Common words like “the”, “is”, “and” that are removed from text because they don’t add much meaning.
  • Tokenization: Split text into words or sentences.
  • Bag of Words: Represent text as a set of words, ignoring grammar and word order.
  • TF-IDF: Term Frequency-Inverse Document Frequency. It measures how important a word is to a document in a collection of documents.
  • Word Embeddings: Represent words as vectors in a high-dimensional space. Words with similar meanings are closer together in this space.

Applications

  • Named Entity Recognition: Identify named entities like people, organizations, and locations in text.
  • Sentiment Analysis: Determine the sentiment of text, such as positive, negative, or neutral.
  • Topic Modeling: Discover topics in text documents.
  • Text Classification: Assign predefined categories or labels to text.
  • Machine Translation: Translate text from one language to another.
  • Text Summarization: Generate a concise summary of a text document.
  • Question Answering: Answer questions based on a given context or text.