COMPUTERS &LANGUAGE LING 120
Popular in Course
Popular in Linguistics
This 5 page Class Notes was uploaded by Gretchen Schmidt on Saturday September 26, 2015. The Class Notes belongs to LING 120 at Iowa State University taught by Staff in Fall. Since its upload, it has received 41 views. For similar materials see /class/214431/ling-120-iowa-state-university in Linguistics at Iowa State University.
Reviews for COMPUTERS &LANGUAGE
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/26/15
LING 120 Computers and Language Topic Authorship Attribution Reading Farringdon Authorship Attribution o Ransom letters 0 Hate mail 0 Any other document Similarity Analysis Stylometry o Detecting plagiarism o Authorship Profiling 0 Determining author s cohort Authorship attribution or quotfingerprintingquot Matching a sample of writing to an attested corpus of an author s work 0 Try to nd author s unconscious signature 0 Can t use stylistic dimensions 0 Need statistical tests for signi cance Background 0 Authorship attribution goes back to the 1800s 0 First computerbased study Mosteller amp Wallace on The Federalist Papers 1964 0 Much controversy on statistical methods choice of features meaning of results and everything else 0 Forensic applications I Highprofile results by Donald Foster Vassar College Attribution of A Funerall Elegye to Shakespeare 1989 Matched Primary Colors to Joe Klein 1996 Matching the Unabomber s manifesto to other writing by Theodore Kaczynski 1996 I Big mess re the ransom note in the JonBenet Ramsey case 199799 I Law suit re linking Steven Hatf111to the 2001 anthrax scare Author s signature or fingerprint 0 Average sentence length typetoken ratio 0 Proportional pairs eg since vs because 0 Distinctiveness ratios e g use of some unusual words or common words like although 0 Wordlength frequency distribution e g twothreeletter words 0 Distribution of initialvowel words 0 Some combinations of these e g twothreeletter words initial vowel words a The Federalist Papers httpenwikipediaorgwikiFederalistiPapers Mosteller amp Wallace 1964 o 85 articles 73 known authorship 12 disputed authorship 0 Basic method Distinctiveness ratios of 30 wordtypes according also although always an apt both by commonly consequently considerablely direction enough innovations kind language matters of on particularly probability there this though to upon vigorous while whilst works 0 Measure how likely it is for one author to use the above words as frequently as they did and compare the likelihood ratios 0 Method gives correct answer for all attested papers Greatly favors Madison as author for all disputed papers 0 Other techniques eg cusum techniques 0 httpmembersaolcomqsumsqumIntroductionhtml o Authorship Attribution vs TC 0 Text categorization I Focuses on content words used to detect topic I Surface words reveal topic to a great extent 0 Stylometry genre identification gender identification authorship attribution amp authorship verification I Focus on form of text abstract away from content I Finding reliable features using shallow techniques is very hard 0 Authorship Attribution as TC 0 Components of a text categorization system I Document representation I Selecting suitable features I Representing each text as a vector of frequencies I Dimensionality reduction I Eliminating irrelevant features 0 Learning method I Constructing a model for each category I Testing protocol 0 Previously used features I Frequencies of function words function word de ned loosely I Word length and sentence length I Word tags and POS sequences Reference Farringdon M How to be a Literary Detective Authorship Attribution A brief introduction to cusum analysis available online at httpmembersaolcomqsumsqumIntroductionhtml Mosteller F and DL Wallace 1964 Inference and Disputed Authorship The Federalist Addison Wesley Reading Mass LING 120 Computers and Language Topic Information Retrieval Recommended Reading Tzoukermann et al 2003 Information retrieval The science of finding objects text picture Video etc in any media relevant to user query Text retrieval The science of nding text relevant to user query in a collection of documents 0 Query may be formulated by user or may be preformulated based on a user pro le Dates back to 1890 when Herman Hollerith invented a machine to tabulate US census data Most uses for years were in scienti c legal and medical elds Widespread use with the advent of the Worldwide Web Main components of any IR system 0 Document processing represent documents in a searchable form 0 Query processing represent queries in a searchable form 0 Matching and retrieval mechanism to measure document relevance to query Document processing 0 Documents are usually represented as indexed keywords inverted index build gt 12 school gt 85 The numbers show position of word in a document in characters An inverted index is built for each document 0 To save timespace stop words ie function words are removed rst 0 Also key words are sometimes stemmed ie pre xes and suf xes are removed before indexing There is still debate on stemming due to mixed results Stemming typically increases recall but reduces precision see notes on text categorization I Stemming could be done using c a traditional stemmer see Chapter 3 of textbook exercise 334 number 3 c or a full edged morphological analyzer Query processing 0 A number of models are generally used e g boolean vector space amp probabilistic o Boolean systems Queries are built based on the presence or absence of query terms in documents These systems are very ef cient and are widely used in bibliographic and other database searches Decisions are binary relevant or not relevant Therefore may miss documents that are somewhat relevant 0 Vector space models Documents are represented as vectors of TFIDFs I Documents are then ranked based on their similarity to the query I Similarity is measured based on the geometric positions of the query and documents in the vector space I These models are simple fast and popular Mostly used in Internet search engines 0 Probabilistic models Calculate probability of document being relevant given query Problem is the relevance of documents is not known at the time the query is formed Also this approach doesn39t take into account the frequencies of words in documents Having being devised in the mid1970s this approach is not popular these days Latent Semantic Indexing LSI Based on Singular Value Decomposition from linear algebra this technique allows for the retrieval of potentially relevant documents that contain semantically related words to query terms not necessarily those words per se That is LSI represents documents in a concept space rather than a term space 0 Increases recall substantially but reduces precision o Approaches to improving retrieval 0 Query expansion Add words from most relevant documents to the query e g the more like this feature of Google or add synonyms requires disambiguation Use popularity measures Give higher points to documents which are pointed to o frequently or which are viewed by users frequently 0 Use user profiles Give higher points to documents that are similar to what user has seensearched for before 0 Use Linguistically Motivated Indexing LMI Index based on linguistic phrases e g noun phrases or collocations References Tzoukermann E Klavans J amp Strzalkowski T 2003 Information Retrieval In Mitkov R ed The Oxford Handbook of Computational Linguistics pp 529544 Oxford Oxford University Press
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'