MIS 373 Final Exam Review
MIS 373 Final Exam Review MIS 373
Popular in Social Media Analytics
verified elite notetaker
Popular in Business, management
This 78 page Study Guide was uploaded by Christopher Notetaker on Sunday May 1, 2016. The Study Guide belongs to MIS 373 at University of Texas at Austin taught by Chakrabarti in Spring 2016. Since its upload, it has received 22 views. For similar materials see Social Media Analytics in Business, management at University of Texas at Austin.
Reviews for MIS 373 Final Exam Review
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 05/01/16
Project and Final Review 01/25/2016 1. Exam Structure a. Only based on material in the slides b. May include material before the midterm i. Including Gephi c. Give an answer, and a concise explanation d. Dates th i. May 12 2:00-5:00pm ii. BUR 136 2. Outline a. Experiments and Causality i. Causality 1. Can I figure out when someone is influencing their friends b. Similarity i. Shoes up in recommendations 3. Experiments and Causality a. Does Internet Advertising work b. What should the experiment look like i. Split customers into two groups, such that 1. Group 1 those who saw ads 2. Group 2 those who didn’t 3. And that is the only difference between the two groups! ii. Compare sales for the two groups 1. Hopefully Group 1 generates more sales c. The wrong experiement i. Show ads on Yahoo! For one week ii. After this week, split the set of customers into two groups 1. Group 1: those who saw ads 2. Group 2: those who didn’t a. Compare sales for the two groups d. Why is this experiment wrong i. User in Group 1 1. Saw the ad 2. Must be logged in to Yahoo! This week 3. Possibly logs in at least once a week ii. User in Group 2 1. Did not see the ad 2. Did not log in to Yahoo! This week 3. Includes folks who almost never use Yahoo! iii. Heavy Users of Yahoo! 1. Almost always in group 1 iv. Light users of Yahoo! 1. Almost always in Group 2 v. The two groups have different types of people! 1. Suppose we see differences in sales between the two groups 2. Is this difference due to a. The effect of ads b. The effect of the systematic differences between the groups? e. Heavy vs. Light Users i. Heavy users are more likely to purchase ii. Light users are less likely to purchase iii. iv. f. The right experiment i. Split customers into two groups randomly 1. Group 1: show ads when they log in to Yahoo! 2. Group 2: Do not show ads ii. The random split ensures that both groups are similar 1. Except for whether they saw the ad or not iii. Compare sales for the two groups iv. 1. There is something else in the Level of Yahoo! Usage that creates the likelihood to act on the ad v. Are people influenced by friends during product adoption? 1. a. Causality = Social Influence b. Some other demographic factor You adopt the iPhone in addition to your friends adopting the product c. I want to create two groups of people, where in one group the cause is true, and in the other group the cause if false i. Adoption and non-adoption vi. How do we test between these types of hypotheses? 1. Randomized Experiment a. Randomly assign users into two groups i. Group 1: sees ad ii. Group 2: does not iii. The only difference is here b. Compare results (here, sales) i. Does group 1 have higher sales vii. For a randomized experiment with friends adopting 1. For a randomized experiment a. Randomly split people into two groups i. Group 1: Friends adopt iPhone ii. Group 2: Friends to not adopt product 2. But we cannot force anyone’s friends to adopt a product a. A randomized experiment will not work g. Propensity Score Matching i. Find pairs of people 1. Whose friends are equally likely to adopt the iPhone 2. But one’s person’s friends did adopt 3. And the other’s did not 4. See if the first person adopts the iPhone ii. First person in the pair Group 1 iii. Second person in the pair Group 2 1. Difference from randomized experiment a. We could not force people to be in either group b. We selected them based on equal propensities i. Propensity are like identical twins, people who are likely identical in terms of their behavior h. Key point: if you want to test if A because B i. Do a randomized experiment if possible 1. Group 1: Force condition A to be true 2. Group 2: Force condition A to be false 3. Check if group 1 has more of condition B ii. If assigning people to groups is impossible 1. Do propensity score matching 4. Measuring Similarity a. Similarity Shows up in many situations i. Computing bridgeness of a link 1. Two nodes are connected by a bridge if their friend circles are not similar ii. Computing similarity between users for recommendations 1. Two users are similar if they liked similar products iii. Computing similarity between products 1. Two products are similar if the same set of users likes both b. Can measure this in multiple ways i. Jaccard Similarity 1. Useful when each link conveys only yes/no a. A movie is liked or not b. A product is bought or not c. Some person is a friend or not 2. o Correlation Similarity Useful when each link conveys ratings How much is a movie liked How close is a friendship What do we ant to find? We want to find deviations away from her mean score o Co-sign Similarity o Content Based Filtering (Multiple similarity measures) Similarity of keywords, similarity of directors between two movies ▯ ▯ Notes 20 ▯ Notes 19 1. Presentation Notes a. Tell us about the client b. How did you structure your campaign c. How did you start off the camping d. Money spent e. Targeting attributes f. Bidding g. What did you learn? How did you adapt your campaign h. These insights are the most interesting part i. Finish with a high level summary j. What worked as instead, what surprised you, what did you do differently 2. Influence or Homophily a. i. Influence 1. Viral marketing doesn’t matter o Homophily Viral marketing doesn’t matter o Summary Why should “friend adopts” depend on these factors? Perhaps we are friends because we share these factors, AND these factors cause production adoption separately in my friend and me. ▯ ▯ 1. Internet Advertising a. Multi-billion dollar industry, high growth i. Over $40 B last year (15% YoY growth) [IAB] ii. Overall advertising: $170B annually b. Why this will continue? i. Broadband is cheap ii. Mobile data is ubiquitous iii. Advertisers shifting dollars c. Differences from traditional marketing? i. Massive scale, automated, low marginal cost ii. Key: Monetize more and better, “learn from data” iii. New discipline “Computational Advertising” 2. Computational Advertising a. b. i. Four parties: User, Content Provider, Ad Network, and Advertisers 1. Ad Network might run an auction process 2. Ad Network and Content Provider could be the same (Yahoo, Facebook), so think of these as four roles 3. Each time you show an ad to the user counts as an “impression” 4. This is only half the picture: this shows how the ads flow through the system, but how does the money flow? 3. Advertising Revenue Models a. CPM i. Cost per mille 1. The cost per one thousand impressions ii. b. CPC i. c. CPA i. 4. Revenue Models Review a. Revenue models = who pays how much, and when? b. Who: the advertiser c. When does the advertiser pay? i. CPM è just on displaying the ad 1. Marginal costs for the ad network need to be extremely small to justify such low prices 2. ii. CPC è displaying the ad and generating a click 1. Advertiser pays for clicks 2. which are obviously not known beforehand 3. but the chance of clicks can be computed 4. Click-through rate (CTR) a. Probability that an impression will generate a click i. depends on the ad, the person seeing the ad, the content around the ad b. CTR will be higher when the ad is shown to a matching user c. Expected number of clicks = N * CTR 5. Technology to match users to ads becomes crucial a. Different ad networks could have different CTRs for the same ad! b. This is where Google thrives, for instance 6. iii. CPA è displaying the ad, generating a click, followed by an action 1. CPA: Budget = N * CTR * Conv. Rate * CPA d. Summary i. e. Pay how much? 5. Next topic a. 6. Advertising Setting a. Display i. Graphical display ads ii. Mostly for brand awareness iii. Revenue model is typically CPM iv. Examples 1. Netflix b. Content Match i. The user intent is unclear ii. Revenue model is typically CPC iii. Query (webpage content) is long and noisy iv. Match the context of the website v. Examples 1. You go to a website about photography and see adds related to photography c. Sponsored i. User “declares” his/her intention ii. Query is short and less noisy than Content Match iii. Click rates generally higher than for Content Match iv. Revenue model is typically CPC (recently some CPA) d. Summary i. Different revenue models 1. Depends on the goal of the advertiser campaign ii. Brand awareness 1. Display advertising 2. Pay per impression (CPM) iii. Attracting users to advertised product 1. Content Match, Sponsored Search 2. Pay per click (CPC), Pay per action (CPA) 7. Unified Marketplace a. Publishers, Ad-networks, advertisers participate together in a singe exchange b. Publishers put impressions in the exchange; advertisers/ad- networks bid for it c. CPM, CPC, CPA are all integrated into a single auction mechanism d. There are four parties in Internet Advertising i. e. However, the marketplace is fragmented i. Bing, Google, and Facebook each have their own ad networks ii. A content producer must decide which ad network to use 1. How good and relevant are the ads? 2. Share of the advertising revenues? iii. Advertisers have a similar decision 1. Which ad network to use? Employ people to handle each? 2. How much advertising dollars for each network? f. All parties participate in the auction together i. Brings transparency and value ii. But considerable resistance from current ad networks 8. Homework 2 a. Hands on experience with i. A popular advertising platform ii. Social advertising iii. Audience targeting iv. Bid optimization v. Creating ad campaigns b. Deliverables i. Report is due on April 4 th ii. 10-minute group presentations on April 4 and 6 th 1. Groups will be randomly selected to these dates 2. Come prepared to speak on April 4 th c. 10% of the assignment grade will be based on attendance on both days d. More info i. Find someone who already has a Facebook page e. Experimenting with different options i. Targeting attributed 1. What is the right audience for the ads? ii. Bid optimization 1. pay for clicks, pay for conversions/likes? iii. social advertising 1. The ad says “your friends x, y, and z like this page” iv. Ad creative 1. What keywords/ad-text will work best? 2. Different background images? v. Custom audiences 1. Target users who also like this other related page f. Goal: Lots of likes on the FB page without spending too much i. Structure your ad campaign 1. First try with broad targeting 2. Then slowly zoom in on a demographic that seems to give conversions at lower cost 3. Spread it out over the two weeks g. What we want to see i. Interesting targeting that is relevant to the business ii. Evidence that you tried to experiment with at least a few options iii. Evidence that you tried to optimize for cost 1. Exam Structure a. Conceptual Questions i. Only based on material in the slides ii. Give an answer, and a concise explanation 1. Even if the question is multiple choice b. Analytic Questions i. Give the answer ii. And show the calculation steps c. Gephi i. Show the final PNG ii. Briefly mention the steps you took 2. Quick Tour a. Patterns in Social Networks i. What is triadic closure? 1. My friends tend to know each other ii. What is strong triadic closure 1. My strong ties always know each other a. Either by a strong tie or by a weak tie b. Weak ties may or may not know each other i. 1. Bridge nodes will always be a weak node in a community a. If the bridge was a strong tie there should be a strong tie to the other parts of the network i. If a bridge has a strong tie it is no longer a bridge 2. Bridges bring in new information to the network iii. How do we measure triadic closure 1. Clustering coefficient a. Count the number of pairs of close friends, and then what ratio of those friends know each other or have the triadic closure iv. How do we measure bridge-ness of a link? 1. Jaccard similarity a. Take the node and look at all of their friends, look at the number of friends that are jointly friends of both and divide it by the total number of friends between each other i. Overlap of friends / Total number of friends 1. If the number is low it means they share little and means it is a bridge link b. Also Katz Score 2. What you care about a. Bridge 3. (or alternative similarity measures as well) v. Small World (6 Degrees of Separation) 1. Everyone’s connected by short paths a. Six degrees of separation b. We would say “diameter of the social network = 6” i. Number of steps needed to reach from person A to any other person in the network 2. People can find these paths a. Without look at the entire social network vi. How do we end up with a small-world network 1. Strong ties local connectivity a. If I want to reach someone in Austin, you would probably use strong ties 2. Weak ties bridges connect to far-off communities a. If you want to reach someone across the globe, you would use the bridge ties 3. You need both weak and strong ties to have a small world vii. Homophily (People of the same type flock together) 1. People of the same type stick together 2. Checking if homophily is present is a three-step process a. What is the network (who is friends with whom), and look and see if two groups stick with their own groups i. Actual Data: Write down the table of contents ii. Random Baseline: Create a table of expected counts if there had been no homophily iii. Chi-Square Test: Check if actual is different from random 1. Test of independence, checking if being a democrat or republican is independent of being associated with democrats/republicans a. Small value means (check slide) b. Large value means (check slide) viii. Long Tail 1. There are many songs/products/cities that are rare individually 2. But significant when viewed in aggregate a. Add up all the rare people in the network/songs/products to make the long tail 3. Not a bell curve but a power law a. i. Top Few Users on Twitter get the bulk of the followers on the twitter ii. In aggregate 80% of the followers follow 20% of the users 4. Online companies usually capture the long tail market share b. Reputation/Information Asymmetry i. You need a reputation system to combat asymmetric information 1. Car salesmen knows more than the car buyer 2. Person buying insurance knows more than the company selling insurance ii. The market price must be such that (Lemons – Slides at the end) 1. The party with more information always makes a profit 2. The party with less information expects to make a profit 3. There may be no such price depending on the probability of the outcomes iii. The find this market price: 1. Build the table of values and the fractions of product types a. How much value does a buyer and seller attach to a product i. Buyers hopefully attaches higher value than seller iv. Draw the price bar 1. Draw lines for the side with more information 2. What happens if the side with less information offers these prices a. Compute fractions of product types at this price b. Compute expected values c. Profit? i. 1. Compute the value of the car based on the likelihood of what type of car the buyer will receive and the value of that car c. Communities i. The best community is the clique ii. Generalizations of the Clique: 1. K-Core: everyone knows at least K others in the community a. Large K-Core is good 2. N-Clique: everyone can reach everyone else in the community in N steps a. Low N-Clique number of steps required i. d. Clustering Methods (Mechanism for coming up with these communities) i. Max-flow min-cut 1. Think of network as pipes 2. Find bottlenecks in the flow of water bridges a. Bottlenecks are bridges and if I remove the bridges I will get two separate communities b. Works very well if there are only two communities 3. Create a source and a sink ii. Hierarchical clustering (Jaccard or Katz) 1. Find pairwise similarities 2. Connect most similar nodes 3. Iterate a. i. Based on Jaccard or Katz connect the nodes 1. Pair wise or Pair-wise way of connecting 4. Slice the communities at a certain level and get your communities iii. Betweenness 1. Compute betweenness for each edge a. Betweenness = how many pairs of shortest paths use that edge 2. Remove the edge with the highest betweenness 3. Iterate until the group splits into communities iv. Modularity 1. “more edges within community, few edges across communities” a. Used one particular formula to figure out if it is true or not b. Look at slides e. Crowdsourcing i. Build a quality table for each worker 1. How? Using a golden set a. ii. Get answers from multiple workers iii. Combine them using odds ratios 1. Multiply the odds of “yes” for each worker; and odds from prior 2. Odds > 1 “yes” more likely 3. Odds = 1 can’t decide between “yes” or “no” a. f. Centrality i. Are node popularities most important? 1. Degree centrality ii. Are shortest paths very important? 1. Short paths to everyone? a. Closeness centrality 2. Being on short paths between everyone else? a. Betweenness centrality iii. Is being connected to the important people most important 1. Pagerank g. Epidemics i. When does someone adopt a product? 1. When some fraction of his/her friends do so? a. Linear Threshold i. Everyone has a threshold, once that threshold is broken I will adopt the product, otherwise I wont 2. If each friend tries to influence him/her probabilistically a. Once? (Singular Product) i. Independent Cascade 1. Same as linear threshold but with probabilistically b. Repeatedly? (Repeated Products – iPhone) i. SIR and SIS models 1. Adopting new versions of the product h. Gephi i. Load in graph (directed? Undirected?) ii. Layout via force atlas iii. Statistics Average Path Length 1. Betweenness centrality 2. Diameter iv. Modularity 1. Communities v. Filters 1. Remove low-degree nodes if network looks too messy vi. Set node colors and sizes using these… vii. Export figure 1. Outline a. Epidemics b. Models of Influences i. Bass Model 1. Works at the population level 2. Some innovators who take the risk of adopting the product 3. Some imitators who wait to adopt until everyone else ii. Linear Threshold Model 1. Social Network Structure a. Who will adopt the product next iii. Independent Cascade Model 1. Social Network Structure a. Who will adopt the product next iv. SIR and SIS models 1. Infected then lose interest, then you are infected again when you are drawn back in a. SIS Model i. Infection never dies off and it is a good model for products that have many variations 1. New models over time c. Examples of Diffusion i. Corporate Governance ii. Rumors cascades 2. Bass Diffusion Model a. Matches adoption patterns for products (e.g., IBM mainframes) b. However, this does not consider the social network structure c. It can tell you aggregate statistics of product adoption d. but not whom to target with advertising or discounts i. 1. Word-of-mouth is pretty darn important, as compared to innovators. Makes sense; there are many many new products each year. Why not let the innovators weed it out for me. On the other hand, innovators are influencers, and hence get rewarded for taking the risk… 3. REVIEW SLIDES FOR OTHER 3 MODELS 4. Diffusion in corporate governance a. Hostile takeovers came to the forefront in 1970s to 1980s i. In response, management adopted countermeasures 1. Golden Parachutes a. Executives get significant payouts in case of merger/takeover b. “Change-of-control-benefits” 2. Poison Pills a. Measures that make it difficult for any one party to amass a large stake i. If there is a hostile takeover, a very bad action will occur, every employee gets 10x the severance package b. Adoption Rate of Two options i. ii. Poison Pills started later, but grew exponentially 1. Reason for starting later is easy a. Questionable legality b. Delaware Supreme court gave its blessing November 1985 c. Poison pills are of questionable legality in many countries, but allowed if used “proportionately” in Delaware. iii. Reason for exponential growth 1. Company boards look at what other companies do 2. Directors with experience of poison pills at other companies share experience 3. Board-to-board diffusion è Poison pills “spread” via the interlocking network of directors who serve on multiple boards iv. Different process for parachutes 1. Geographic diffusion è a. Firms adopted it to the extent other firms in their metropolitan area did so 2. Parachutes were considered “naked self-interest” of executives a. Boards didn’t want to adopt it b. Knowledge of parachutes spread among CEOs i. who met each other on the golf course è geographical diffusion c. Rumor Cascades i. FB reshares of Cabela store receipt attributing extra sales tax to “Obamacare” ii. Tendrils often stop after getting snoped (red dots) 1. Someone starts a rumor. Others add link to snopes.com debunking rumor iii. Does pointing to an authoritative source kill a rumor? 1. Authoritative source has little long-term effect 2. High virility can overcome temporary setbacks a. Even though some reshares are deleted b. Others who didn’t see the snope link continue resharing iv. Then what can be done? 1. Competing viral rumors (often humorous) v. Diffusion and epidemic processes seek to explain 1. exponential product adoption, 2. followed by slower spread vi. Basic word-of-mouth model 1. I adopt a product once enough of my friends do so vii. Such processes are important for 1. spreading critical information between companies (via boards) 2. viral marketing campaigns 3. rumors 1. Exam 1 a. Bring Laptops i. Upload the assignment back to canvas ii. Open notes iii. Open laptop iv. Use Google if you want v. Gephi (might be) b. Format 2. Popularity in the Network a. N-Degree High = A lot of followers b. Who is following, how important are they? i. Centrality 3. Centrality a. Who’s the most important person in a social network i. Creating lists of “top users you should follow” ii. Promotions targeted at key “influencers” iii. Identifying ringleaders of illegal groups iv. Finding weak spots of computer networks b. This is called “centrality” i. And there are many ways to measure it 1. a. We’ll look at several basic types of network structures that will help us distinguish between the various centrality measures c. Types of Centrality i. Degree centrality ii. Closeness centrality iii. Betweenness centrality iv. PageRank d. Who’s central in a social network? i. The most connected individual 1. Degree is a direct measure of popularity 2. So many folks want to connect to this person a. They must know something ii. But this can be misleading (Degree Centrality) 1. With Degree Centrality you only know that everyone in central because they have the same measure of degree centrality a. 4. Degree Centrality a. Normalized Degree Centrality i. Degree / (N-1) 1. N = number of nodes in the network 2. Degree = number of connections a single node has to other nodes a. Always between 0.0 and 1.0 i. Take a measure, any measure, and you normalize it to be within a standardized range b. How well centralized is the entire network? i. Measures homogeneity of network ii. Is it worthwhile to look for high-degree individuals c. Variance of degrees i. In excel: VAR(degrees) 1. High VAR (Variance) means that the network is not homogenous 2. Low VAR means that the network is homogenous a. People all have similar degrees d. Freeman’s Formula i. 1. 0 = homogenous 2. 1 = non-homogenous a. One person that is super central b. Everyone else is not central ii. Hub and Spoke Network 1. iii. Ring Network 1. iv. Other networks (? / Line Network) 1. a. Degrees almost equal i. Low centralization scores v. Degree Centrality 1. Problems a. Logical measure: just look at a person’s neighbors b. Ignores the big picture 5. Closeness Centrality a. Degree centrality how many people am I close to? i. “Close” means “directly connected b. Closeness centrality how close am I to everyone else? i. Broader definition of “closeness” 1. Closeness Centrality Equation a. 1 / (sum of distances to everyone else) 2. Normalized Closeness Centrality Equation a. Closeness centrality * (N – 1) = 1/12 * 6 = 0.5 c. Example i. d. How does it do on our example? i. Hub and Spoke 1. The “hub” of the network gets maximum score a. Just as in degree i. 1. CC = 1 / (1+1+1+1+1+1+1) = 1/7 2. Normal CC = 1/7 * (8-1) = 1 ii. Ring 1. a. Everyone’s the same on the ring network i. Symmetry iii. Line Network (Closeness Centrality) 1. Closeness Centrality a. = 1 / (sum of distances to everyone else) b. 2. Line Network (Harmonic Centrality) a. = sum of ( 1 / distance ) i. distance = distance to central node b. = 0.33 +.05 + 1.0 + 1.0 + 0.5 + 0.33 = 366 6. Betweenness Centrality a. Betweenness centrality how important am I to others? b. Betweenness centrality = number of shortest paths that go through you i. 1. We’ve seen “edge betweenness” earlier; this is node betweenness. This models information flows, people trying to communicate quickly. ii. More generally 1. If there are 5 shortest between Alice and Bob, and Charlie is on 3 of them, then this adds 3/5 = 0.6 to Charlie’s Betweenness a. 2. Highest value of betweenness a. You are on every shortest path between the remaining N-1 people b. E.g. hub of the star network 3. Normalized Betweenness a. = Betweenness / [(N-1)(N-2) / 2] i. N = number of nodes in the network 7. Page Rank a. Originated with measuring centrality on the Web b. Webpages link to each other i. Pages A and B are “voting” for C ii. In degree of C = total number of votes iii. But A and B could be voting for others too iv. Then do their votes count as much? c. Simple idea: Everyone gets a total vote of 1.0 i. Weight of A’s vote for C = 1.0 / out- degree of A d. But A itself could be more important than B i. A could be google.com ii. B could be somerandomdude.com e. Better idea: i. Weight of A’s vote for C = (Importance of A) / (Out- Degree of A) 1. Importance of C = [(Importance of A) / (Out-Degree of A)] + [(Importance of B) / (Out-Degree of B)] f. Equation i. Importance of node = sum of (Importance of follower) / (Out- degree of follower) ii. Problem: Spam Rings 1. a. Infinite importance g. Final Equation i. 1. Importance = PageRank a. Many important people voting (linking) to your website 8. Summary a. Degree i. Me and my friends b. Closeness i. How close am I to everyone else c. Betweenness i. How many use me to reach the rest of the network as the shortest path 1. Network flow structures (If this person left the network what happens) d. PageRank i. How many people vote for me 1. Most spam resistant 9. Which centrality is best? a. All centrality measures have proponents b. What are the basic properties that every measure should satisfy? i. Consider a network with one clique of K people and cycle of P people 1. Network is disconnected a. c. Three Theories/Principles to Test Good Centrality Measures i. Size axiom 1. For any k-clique a. The p-cycle people have higher centrality for large enough p 2. For any p-cycle a. The k-clique people have higher centrality for large enough k 3. Theory (Size axiom) a. For any clique, I can find a big enough ring that the ring is more central b. For any ring, I can find a big enough clique that is more central ii. Density Axiom 1. a. All other things equal, two rings, equal number of rings i. If you have a few extra connections, then you are going to have higher centrality measures because you are more dense 2. P-Clique is more central than p-cycle in this network iii. Score-monotonicity axiom 1. Adding any link pointing to a person increases his/her centrality iv. Summary 1. Three Axioms a. Size axiom i. A big enough cycle can dominate any clique ii. A big enough clique can dominate any cycle b. Density axiom i. People in denser sub-networks have higher centrality c. Score-monotonicity axiom i. An extra link to me increases my centrality 10. Do our measures satisfy these axioms a. Closeness Centrality i. 1 / (sum of distances to everyone else) b. Variant: Harmonic Centrality i. = sum of 1 / distance 11. Summary a. Which measure to use depends on what you want to do with the centrality measure 1. Introduction a. The history of computation i. started with humans ii. Orbit of Halley’s comet, 1758 1. 3 astronomers sharing the labor iii. Astronomical almanac, 1760 1. Computations done twice 2. and compare by a third verifier iv. Logarithmic tables, 1794 1. Many workers knowing basic arithmetic 2. Detailed task workflows v. Main ideas 1. Division of labor 2. Lots of workers 3. Process Management and QA vi. Now, once again, we need human computers 1. Find good tags for images 2. Determine relevance of search result 3. Catch inappropriate content 4. Protein folding 5. Problems that computers aren’t good at yet vii. Human computation 1. Computing work done jointly by humans and computers 2. each doing the parts they are best at viii. Crowdsourcing 1. Distribution of tasks across human agents a. Human computation is typically thought of as a symbiotic relationship between computers and humans. Crowdsourcing focuses only on splitting up work among many humans, and managing the entire process. http://en.wikipedia.org/wiki/Human- based_computation ix. The ESP Game 1. Guess words that describe a picture 2. Another unknown player does the same at the same time 3. Success if a guess of yours matches a guess of his 4. a. Why does the ESP game work? i. Why does the ESP game work? 1. It’s fun! ii. What will you guess? iii. Words that are likely to be guessed by the other person iv. Words that 1. best describe the picture, AND 2. are common words 3. è often best tags for the image v. Incentives of users aligned with the desired output 1. Fun: Points, sense of accomplishment, sense of connection to random other user, … But you won’t get the unique tags (e.g., this is a “rocket”, but not a “Atlas V rocket”) b. Amazon Mechanical Turk i. 2. Outline a. Intro to crowdsourcing i. What is it? ii. Who are the crowd? b. Quality Assurance i. Managing output quality ii. Workflow and scaling c. Case: TopCoder 3. Who are the crowd? a. Mechanical Turk: i. Mostly American (47%) or Indian (34%) ii. In the US, mostly female; in India, mostly male 1. US: Supplementary income for stay-at-home parent 2. India: More significant source of income iii. Gender: 1. US: mostly female 2. India: mostly male iv. Both biased towards 1. younger workers 2. single 3. unemployed or employed part-time 4. Managing Output Quality a. Individual workers could have different qualities i. How can we assure quality of final result? 1. Example: rate website as a. b. How can we assure quality of final result? i. Workers could be spammers! 1. Random results 2. Core difficulty of crowdsourcing ii. Is there a way around spammers? 1. Quality through redundancy iii. How can we detect spammers? 1. Quality through a golden set iv. Spammers could be maximizing the number of human intelligence tasks they accomplish per day, and hence maximizing revenue. We will see how to estimate if someone is a spammer or not a little bit later 5. Quality through redundancy a. Get multiple workers to rate the same item b. Combine their votes i. How? ii. Combine multiple votes: Majority votes 1. Works best when workers are on similar quality and at least 51% correct iii. What if every work has different quality? iv. c. What is every worker has a different quality (This is an adult website) i. 2 workers ii. Each workers must vote “yes” or “no” iii. Suppose I know how often each worker is correct iv. Worker 1 says “yes”, worker 2 says “no” v. Odds of “yes” for worker 1 = 0.55 / 0.45 = 1.22 vi. Odds of “yes” for worker 2 = 0.15 / 0.85 = 0.17 vii. viii. 6. Quality through the Golden Set? a. Golden set i. A set of questions for which we already know the correct answer ii. Ask workers for their votes on the golden set iii. Use this to compute the worker accuracies iv. What would the worker quality of a spammer look like? 1. 7. Creating a big golden set is expensive a. i. If there are enough good-quality workers 1. we can have them label more questions 2. and use this feedback workflow to better estimate their qualities 3. A big golden set is not needed 8. Scaling up crowdsourcing a. Many labeling tasks are too big even for crowdsourcing i. Label many websites as adult or not b. From crowdsourcing to human computation i. Human computation = combining man and machine… ii. Man è labels some example websites iii. Machine è learns from these examples è and automatically labels new websites c. First-Cut Solution i. Use crowdsourced labels to build an automatic model ii. Use automatic model on future problems iii. d. An iterative approach is better i. Automatic model answers questions when it is confident ii. Otherwise it asks humans iii. and improves its model iv. 1. Fewer questions need human involvement a. Lower cost b. But these are the harder questions 9. TopCoder a. Info i. Finding qualified programmers is time consuming ii. Programmers’ skill sets become obsolete quickly iii. High levels of turnover (“48% every 3 years”) 1. iv. TopCoder used programming competitions with prizes to maintain a skilled community v. Connect talent with companies seeking to hire 1. vi. Platform Manager works with client to interface with community vii. Many software components can be reused in new projects 1. a. The change they made TopCoder is going to build the software for them b. TopCoder build reusable machine learning components viii. Clients pay monthly platform fees ix. Risk of Program Manager breaking the entire network 1. a. Reduces expense of platform managers i. Heavier load on client companies; competitions for all parts of software development from conceptualization to coding to bug testing. ii. Differences from Mechanical Turk? b. Two types of contests i. Algorithm design 1. Difficult, fun, coolness-factor 2. Helps build database of components 3. Attracts new members 4. Maintains community ii. Software development 1. Series of contests from conceptualization to coding to bug testing 2. Allows for member specialization 3. Brought in Revenue c. Three types of incentives i. Prizes for the top 2 submissions 1. Direct monetary reward 2. 80% of the reward went to the top 5% 3. Need to be the best ii. Monthly bonuses 1. Top 5 competitors per contest awarded points 2. Bonus based on monthly aggregate of points 3. Need to compete a lot iii. Profile and ratings 1. Bragging rights 2. Useful for programming career d. Community actions i. Reviews of submissions provide feedback ii. Community forum 1. Less-experienced members can ask question 2. Instant feedback 3. TopCoder acts as a learning experience 4. Critical to engage the long tail iii. Community gathering in Las Vegas 1. Top guys invited 2. Builds a special sub-community 3. Maintains this core community 4. Critical for revenue e. What’s in it for the clients? i. Cheap, compared to hiring/firing ii. Big bang for your buck, if you can engage top members iii. Risks? 1. No in-house knowledge; many are involved, no one is committed 2. Can every project be split into 2-week contests? 3. Sudden failures? 4. Can the client easily hire programmers? f. What’s in it for the programmers? i. Earn money while sitting at home ii. Pick and choose projects iii. Risks? 1. No security; race to the bottom for cash rewards 2. No explicit rewards for being a good community member 3. Gaining new skills difficult while making money difficult 4. In any move, you must be near the top to gain credit g. What’s in it for TopCoder? i. Revenue stream of large workforce ii. without corresponding recurring costs iii. Risks? 1. Too dependent on the top 5% (or maybe the top 0.5%) 2. What if someone keeps poaching them? 3. They hold the risks for delivering the software 4. Little in-house talent to make critical fixes at short notice 1. Why care about communities a. Opinion formation i. Politicized crowds on twitter b. Customer Engagement and Support i. Brand clusters and community forums on twitter c. If people are affected by their friends’ opinions (some support for this in some circumstances, but not always), then being in a tightly knit community may expose you to the same point of view over and over again. d. What determines tightly-knit community? i. Are the ties mutual? 1. 2. Max-Flow min-cut a. Pick a “source” b. And a “sink” c. Think of the links as water pipes, with fixed carrying capacity d. All nodes are junctions of pipes e. Send as much water flow from sources to sink i. “maximum flow” f. The bridge links become the bottleneck i. “minimum cut” 1. If source and sink in separate clusters, a. then flow is limited by bridges b. removing this minimum cut can create good communities g. Nice intuitive method of finding “cuts” in the graph i. Problems? 1. Min-cuts need not be “balanced” 2. The “source” and “sink” must be in different communities 3. What if there are more than two communities ii. Advantage 1. Works on directed graphs 3. Hierarchical clustering a. Compute similarity weights between all pairs of people i. Number of paths between the two people 1. More the number of paths, more similar/connected they are 2. But should long paths count for as much as short paths ii. All paths, but weigh shorter paths more (Katz Score) Good measure of similarity 1. Similarity between two nodes depends mainly on the number of short paths between them 2. a. No direct edge (Path length 1) b. Edges of Length two – 3 c. Length 3 paths – 4 3. All paths, but weigh shorter paths more a. We use a discount factor of .8 i. Smaller factor means we care more about short paths b. Any discount factor < 1.0 works c. Small factor shorter paths are more important b. Compute similarity weights between all pairs of people i. Start with all nodes disconnected ii. Connect people with the highest similarity weight 1. Iterate a. iii. A “slice” of this hierarchy gives a clustering of people 1. How can we choose the right slice a. Slice at the green line and see where the communities fall i. No easy answer ii. Pick the slice that gives the desired number of communities iv. Problems 1. Nodes on the periphery typically have small similarity weight 2. Such nodes go into isolated communities of their own 3. Though they should belong to the closest community a. Isolated nodes are very common in actual networks. Still this isn’t a huge problem because at high levels of the hierarchy, they should merge with their true cluster. b. 4. Why are the president and instructor so close in hierarchical clustering a. Both have high degree b. Both are connected to all the bridges i. Many paths between them c. More paths connect them to each other than to their students 4. Betweenness Centrality (Clustering) a. If there is a particular edge which a lot of people use to get their shortest paths i. Find bind links, and remove them ii. Keep iterating iii. Until the network breaks into two communities b. How bridge-like is a link? i. But bridges are the links that connect communities so to get the bridges we need the communities ii. And for the communities, we need the bridges? c. How bridge like is a link? i. What makes the middle link special? ii. Removing it would split the graph into two pieces iii. Could this be a definition of a bridge 1. Not robust… such bridges are rare in social networks iv. Think of the network as the road network? 1. Links represent interstates 2. Removing I-35 would make some people’s commute longer v. Whose commute becomes longer? 1. Folks for whom I-35 lies on the shortest path 2. Betweeness of I-35 = number of pairs of nodes for whom I-35 lies on the shortest path 3. Find shortest path for each pair of nodes a. Compute betweenness for all links b. What if it is on one of 2 shortest paths? Consider this as 1/2 cost vi. Betweenness Clustering 1. Find link with highest betweenness (they exist on many different shortest path), and remove it 2. Re-compute betweenness scores (they may change!) 3. Keep iterating until graph splits into communities vii. Problems 1. Must compute betweenness between all pairs 2. Must re-compute after every iterations 3. Considers only shortest paths 5. Modularity a. If I gave you a cluster, can you measure how good it is? i. Want more links within cluster ii. Only a few links crossing the boundary iii. “More within, few across” can we make this precise iv. E within 1. Will be higher if we have lots of edges in the community v. E expected 1. Nice network, e within will be higher than e expected 2. vi. Compute the difference between E within and E expected 1. Multiple links within the network creates a modular network 2. More links across clusters increases the “E expected” and decrease modularity overall 3. Positive Number a. Modular community vii. Good Clustering 1. A community that has high modularity a. Try to merge communities viii. Major Advantage 1. A community that has more links within itself than links across is a modular community ▯ Summary We saw several community detection methods Max-flow min-cut o Which nodes are the bottleneck Hierarchical clustering o Start with individual nodes Similarity between nodes Closeness of nodes Betweenness clustering o Start off with entire network and then try and split it into two Remove the bridge Do this again and again until you split the network into two Modularity o Try to come up with a measure for a good community o Random Community the measure will be ZERO o Modularity that is high is a good cluster General Notes o Are shortest paths important? Then betweenness o Do you already know some members of each cluster? May max-flow min-cut o Modularity is a good general-purpose solution o There are many other methods as well. 1. Why care about communities? a. Opinion formation i. Possible for each community to have a separate opinion ii. People in each community tend to hear the same viewpoint from their friends 1. If people are affected by their friends’ opinions (some support for this in some circumstances, but not always), then being in a tightly knit community may expose you to the same point of view over and over again. b. Opinion formation i. Communities may converge on one product (iPhone or Android) ii. or one investment philosophy (real-estate only, index funds only) c. Customer Engagement i. A forum where users of a product can exchange ideas, workarounds, fixes ii. Businesses can learn about impending product issues 1. How worried are Blizzard users about game cheats and hacks d. Customer Support i. A forum also offers a controlled environment where customers can interact with business representatives 1. Intuit’s Turbotax product has a community for asking questions, where both other users and Intuit representatives can respond e. Customer Analytics i. Also enables measurement of splits in user community ii. and isolated sub-groups 2. What determines a tightly-knit community? a. Are the ties mutual? i. Reciprocity implies closer connections 1. On Twitter, reciprocal ties è real-life friendships ii. Unreciprocated ties could be customer → celebrity/authority 1. Perhaps the authority holds the community together? b. How close are the members? i. Everyone knows everyone is best, but unlikely ii. Everyone knows everyone through a friend iii. Generally, short paths between members c. How close and densely-connected are the members? i. Everyone should have connections to at least k others ii. Avoids a single point of failure d. Is the community separated from the rest of the world? i. Group members should have relatively more connections within the community than outside it e. What determines a tightly-knit community? i. Are the ties mutual? ii. Are the members close to each other iii. Are they densely-connected? iv. Is the community separated from the rest of the world? 3. What are the common community patterns? a. Clique i. Must have 1. Everybody know everybody 2. Most densely connected 3. Can have overlapping cliques ii. But not robust 1. Even a single missing edge disqualifies the community 2. So typically only have small cliques not interesting b. K-Core i. Everybody know at least K others in the group 1. At least one node knows exactly k others ii. Every clique is K-Core iii. Loser version of a clique 1. iv. As we go from large K to small K 1. The community structures becomes weaker 2. But we can find bigger communities a. K-cores offer this tradeoff between within- community linkage and community size. i. c. n-clique i. Another generalization of cliques ii. Everyone knows everybody, within n hops 1. At least one pair of nodes is exactly n hops away a. d. What are the common community patterns i. Clique ii. K-clique iii. N-clique 1. General Information a. Members close to each other n-clique b. They have densely-connected K-Core c. Ideally you want both d. The community is separated from the rest of the role e. The first 2-clique has “strong ties” only, while the second one has only “weak ties” (though in this case, topologically they are identical) 4. Twitter Topic Communit
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'