New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Abhishek Notetaker

1BoW.pdf CSCI GA-2590

Abhishek Notetaker
GPA 3.7
View Full Document for 0 Karma

View Full Document


Unlock These Notes for FREE

Enter your email below and we will instantly email you these Notes for Natural Language Processing

(Limited time offer)

Unlock Notes

Already have a StudySoup account? Login here

Unlock FREE Class Notes

Enter your email below to receive Natural Language Processing notes

Everyone needs better class notes. Enter your email and we will send you notes for this class for free.

Unlock FREE notes

About this Document

Bag of Words methods for text mining.
Natural Language Processing
Dr. Ralph Grishman
Class Notes
NLP, Bag of Words




Popular in Natural Language Processing

Popular in ComputerScienence

This 5 page Class Notes was uploaded by Abhishek Notetaker on Sunday March 6, 2016. The Class Notes belongs to CSCI GA-2590 at NYU School of Medicine taught by Dr. Ralph Grishman in Fall 2016. Since its upload, it has received 21 views. For similar materials see Natural Language Processing in ComputerScienence at NYU School of Medicine.


Reviews for 1BoW.pdf


Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 03/06/16
Bag of Words methods for text mining • Do we really need elaborate linguistic analysis? • Look at text mining applications – Document retrieval – Opinion mining – Association mining • See how far we can get with document-level bag-of-words models – And introduce some of our mathematical approaches Information Retrieval: • Task: given query = list of keywords, identify and rank relevant documents from collection • Basic idea: find documents whose set of words most closely matches words in query Topic Vector: • Suppose the document collection has n distinct words, w , …, w 1 n th • Each document is characterized by an n-dimensional vector whose i component is the frequency of word w in the doiument Example • D1 = [The cat chased the mouse.] • D2 = [The dog chased the cat.] • W = [The, chased, dog, cat, mouse] (n = 5) • V1 = [2, 1, 0, 1, 1] • V2 = [2, 1, 1, 1, 0] Weighting the components: • Unusual words like elephant determine the topic much more than common words such as “the” or “have” • Can ignore words on a stop list or • eight e ch term requenc tf by itsiinverse docume t frequency idf i idf =log( ) N i ni • Where N = size of collection and n = number of documents containing term i i w =tf ×idf i i i Cosine Similarity metric: Define a similarity metric between topic vectors A common choice is cosine similarity (dot product): ∑ ai×b i sim(A,B)= i ∑ a × ∑ b 2 i i i i • The cosine simil€rity metric is the cosine of the angle between the term vectors: • For heterogeneous text collections, the vector space model, tf-idf weighting, and cosine similarity have been the basis for successful document retrieval for over 50 years • Stemming required for inflected languages • Limited resolution: returns documents, not answers Opinion Mining: – Task: judge whether a document expresses a positive or negative opinion (or no opinion) about an object or topic • Classification task • Valuable for producers/marketers of all sorts of products – Simple strategy: bag-of-words • Make lists of positive and negative words • See which predominate in a given document (and mark as ‘no opinion’ if there are few words of either type • Problem: hard to make such lists – Lists will differ for different topics Training a model: – Instead, label the documents in a corpus and then train a classifier from the corpus • Labeling is easier than thinking up words • In effect, learn positive and negative words from the corpus – Probabilistic classifier • Identify most likely class s = argmax P ( t | W) t є {pos, neg} argmax PtW|t)P(t) argmax Ptt|W)= P(W) =argmax PtW|t)P(t) =argmax Ptw ,..1,w |tnP(t) =argmax ∏ P(w |t)P(t) t i i € • The last step is based on the naïve assumption of independence of the word probabilities Training: • We now estimate these probabilities from the training corpus (of N documents) using maximum likelihood estimators • P ( t ) = count (docs labeled t) / N • P ( w |it ) = count ( docs labeled t containing w ) i count ( docs labeled t) • We now estimate these probabilities from the training corpus (of N documents) using maximum likelihood estimators • P ( t ) = count (docs labeled t) / N probability that a document is labeled t • P ( w |it ) = count ( docs labeled t containing w ) i count ( docs labeled t) probability that a document labeled t contains w i Flavors of Text categorization: • Bernoulli model: use presence (/ absence) of a term in a document as feature • Multinomial model: based on frequency of terms in documents: P ( t ) = total length of docs labeled t total size of corpus Probability that a word in the corpus is part of a doc labeled t P ( w | t ) = count (instances of w in docs labeled t) i i total length of docs labeled t probability that a word in a doc labeled t is w i better performance on long documents A problem: • Suppose a glowing review GR (with lots of positive words) includes one word, “mathematical”, previously seen only in negative reviews • P ( positive | GR ) = ? • P ( positive | GR ) = 0 • because P ( “mathematical” | positive ) = 0 • The maximum likelihood estimate is poor when there is very little data • We need to ‘smooth’ the probabilities to avoid this problem Laplace smoothing: • A simple remedy is to add 1 to each count – to keep them as probabilities ∑ P(t)=1 t we increase the denominator N by the number of outcomes (values of t) (2 for ‘positive’ and ‘negative’) – for the conditional probabilities P( w | t ) there are similarly two outcomes (w is present or absent) Ambiguous terms: • Is “low” a positive or a negative term? • “low” can be positive “low price” • or negative “low quality” Negation: • how to handle “the equipment never failed” • modify words following negation “the equipment never NOT_failed” • treat them as a separate ‘negated’ vocabulary (how far to go?) “the equipment never failed and was cheap to run” è “the equipment never NOT_failed NOT_and NOT_was NOT_cheap NOT_to NOT_run • have to determine scope of negation Verdict: Mixed • A simple bag-of-words strategy with a NB model works quite well for simple reviews referring to a single item, but fails – for ambiguous terms – for negation – for comparative reviews – to reveal aspects of an opinion • the car looked great and handled well, but the wheels kept falling off Association mining: • Goal: find interesting relationships among attributes of an object in a large collection … objects with attribute A also have attribute B – e.g., “people who bought A also bought B” • For text: documents with term A also have term B – widely used in scientific and medical literature Bag-of-words: • Simplest approach • look for words A and B for which frequency (A and B in same document) >> frequency of A x frequency of B • doesn’t work well • want to find names (of companies, products, genes), not individual words • interested in specific types of terms • want to learn from a few examples – need contexts to avoid noise Effective text association mining needs • name recognition • term classification • preferably: ability to learn patterns • preferably: a good GUI Conclusion: • Some tasks can be handled effectively (and very simply) by bag-of-words models, • but most benefit from an analysis of language structure


Buy Material

Are you sure you want to buy this material for

0 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

Amaris Trozzo George Washington University

"I made $350 in just two days after posting my first study guide."

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

Parker Thompson 500 Startups

"It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.