Andrew Gurung
  • Introduction
  • Data Science
    • Natural Language Processing
      • Sentiment analysis using Twitter
    • Linear Algebra
      • Linear algebra explained in four pages
      • Vectors
        • Vector Basics
        • Vector Projection
        • Cosine Similarity
        • Vector Norms and Orthogonality
        • Linear combination and span
        • Linear independence and Basis vectors
      • Matrices
        • Matrix Arithmetic
        • Matrix Operations
        • Functions and Linear Transformations
        • Matrix types
      • Eigendecomposition, Eigenvectors and Eigenvalues
      • Principle Component Analysis (PCA)
      • Singular-Value Decomposition(SVD)
      • Linear Algebra: Deep Learning Book
    • Calculus
      • Functions, Limits, Continuity and Differentiability
      • Scalar Derivative and Partial Derivatives
      • Gradient
      • Matrix Calculus
      • Maxima and Minima using Derivatives
      • Gradient Descent and its types
    • Statistics and Probability
      • Probability Rules and Axioms
      • Types of Events
      • Frequentist vs Bayesian View
      • Random Variables
      • MLE, MAP, and Naive Bayes
      • Probability Distributions
      • P-Value and hypothesis test
    • 7 Step DS Process
      • 1: Business Requirement
      • 2: Data Acquisition
      • 3: Data Processing
        • SQL Techniques
        • Cleaning Text Data
      • 4: Data Exploration
      • 5: Modeling
      • 6: Model deployment
      • 7: Communication
    • Miscellaneous
      • LaTeX commands
  • Computer Science
    • Primer
      • Big O Notation
  • Life
    • Health
      • Minimalist Workout Routine
      • Reddit FAQ on Nootropics
      • Hiking/Biking Resources
    • Philosophy
      • Aristotle's Defense of Private Property
    • Self-improvement
      • 100 Mental Models
      • Don't break the chain
      • Cal Newport's 5 Productivity tips
      • Andrew Ng's advice on deliberate practice
      • Atomic Habits
      • Turn sound effects off in Outlook
    • Food and Travel
      • 2019 Guide to Pesticides in Produce
      • Recipe
        • Spicy Sesame Noodles
      • Travel
        • Hiking
    • Art
      • Scott Adams: 80% of the rules of good writing
      • Learn Blues Guitar
    • Tools
      • Software
        • Docker
        • Visual Studio Code
        • Terminal
        • Comparing Git Workflow
      • Life Hacks
        • DIY Deck Cleaner
  • Knowledge Vault
    • Book
      • The Almanack of Naval Ravikant
    • Media
    • Course/Training
Powered by GitBook
On this page

Was this helpful?

  1. Data Science
  2. Linear Algebra
  3. Vectors

Cosine Similarity

PreviousVector ProjectionNextVector Norms and Orthogonality

Last updated 6 years ago

Was this helpful?

The cosine similarity between two vectors is a measure that calculates the cosine of the angle between them.

Cosine similarity focuses on the direction instead of magnitude which is helpful in NLP and sentiment analysis. Longer document can have same theme as short sentences.

Three different angle between two vectors: i) 90 degree (Orthogonal): NOT similar or Independent ii) Less than 90 degree: Similar iii) Greater than 90 degree: Opposite

Note: - Words and sentences should be converted to vectors to calculate cosine similarity - Tools such as Word2Vec, bag of words with either TF(term frequency) or TF-IDF(term frequency-inverse document frequency) can be used

Calculate Cosine Similarity using Scikit-learn

documents = ("The sky is blue",
"The sun is bright",
"The sun in the sky is bright",
"We can see the shining sun, the bright sun"
)

# import TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity​

# Instantiate SKlearn TF-Vectorizer and transform document into TF-IDF matrix
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
tfidf_matrix.shape # Prints (4,11): 4 sentences, 11 unique words

# Calculate cosine similarity between the first document with each of the other document
display(cosine_similarity(tfidf_matrix[0:1], tfidf_matrix))
array([[1.        , 0.36651513, 0.52305744, 0.13448867]])

Link:

http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/
Cosine Similarity