Group Project Specification 300958 Social Web Analytics Part A Due Date: Friday of Week 10 Part B…

Group Project Specification
300958 Social Web Analytics
Part A Due Date: Friday of Week 10
Part B Due Date: Friday of Week 13
1 Aim
2Method
3Group Size and Organisation
4Due Date and Submission
5Report Format
6Marks
7Declaration
8Project Description
1 Aim
The Group Project provides us with a chance to analyse the Social Web using knowledge obtained from this unit with assistance from a computer based statistical package. For this project, we will focus on identifying a chosen company’s Twitter image.
2 Method
To complete this project:
Read through this specification.
Form a group and register your group using the Project Groups section of vUWS.
Choose a company that is active on Twitter, check that it is not already on the list of Group Project Twitter Handles . Then submit the Twitter handle of the company using the same link. Note that a given company cannot be allocated to more than one group. If duplicate company names are found on the list, the group with the later time stamp will be asked to find a new company.
Complete the data analysis required by the specification.
Write up your analysis using your favourite word processing/typesetting program, making sure that all of the working is shown and presented well. Include all the R code along with its output in your assignment.
Include the student declaration text on the front page of your report. Please make sure that the names and student numbers of each group member are clearly displayed on the front page. If a group member did not contribute to any part of the project, do not put their name to the cover (no contribution means 0 mark).
Submit the report as a PDF by the due date using the Submit Group Project . More detailed screenshots of your code should be in the Appendix part of the assignment, include comments in the code to explain what you tried to do.
3 Group Size and Organisation
Students in groups of size 3 or 4 are to work together to complete this project. One project report is to be submitted per group.
The group must be formed by signing-up to a group within the Project section of 300958 in vUWS. 0 marks will be awarded to lone submissions.
Groups must be formed by week 7. Once the group is formed, one person should be nominated within the group to be responsible for submitting the report.
4 Due date and Submission
The project report Part A is due in by 11:59 p.m. on the Friday of Week 10 . The project report Part B is due in by 11:59 p.m. on the Friday of Week 13 . The report must be submitted as a PDF file using the assignment submission facilities in the Project section of 300958 in vUWS. Only one student from each group needs to submit the assignment.
5 Report Format
Once the required analysis is performed by the group, the members of the group are to write up the analysis as a report. Remember that the assessor will only see the groups’ report and will be marking the group’s analysis based on your report. Therefore, the report should contain a clear and concise description of the procedures carried out, comments on the code, explanations of what you tried to do, the analysis of results and any conclusions reached from the analysis.
The required analysis in this specification covers the material presented in lectures and labs. Students should use the computer software R to carry out the required analysis and then present the results from the analysis in the report.
6 Marks
This project is worth 30% (Part A 19% + Part B 11%) of your final grade, and so the project will be marked out of 30. The project consists of four investigations (3 sections in Part A, 1 section at Part B) and will be marked using the following criteria:
Marks (Part A)
Criteria Satisfied
7 marks
First section completed correctly.
5 marks
Second section completed correctly.
6 marks
Third section completed correctly.
Marks (Part B)
Criteria Satisfied
9 marks
Fourth section completed correctly.
Part A: There is one mark allocated to presentation (based on the report formatting, style, grammar, clarity and mathematical notation).
Part B: There is one mark allocated to conclusion and another one mark allocated to presentation (based on the report formatting, style, grammar, clarity and mathematical notation) – two in total -.
If a report is submitted late, the maximum mark it can achieve will be reduced by 10% per day.
7 Declaration
The following declaration must be included in a clearly visible and readable place on the first page of the report.
By including this statement, we the authors of this work, verify that:
We hold a copy of this assignment that we can produce if the original is lost or damaged.
We hereby certify that no part of this assignment/product has been copied from any other student’s work or from any other source except where due acknowledgement is made in the assignment.
No part of this assignment/product has been written/produced for us by another person except where such collaboration has been authorised by the subject lecturer/tutor concerned.
We are aware that this work may be reproduced and submitted to plagiarism detection software programs for the purpose of detecting possible plagiarism ( which may retain a copy on its database for future plagiarism checking ).
We hereby certify that we have read and understand what the School of Computing and Mathematics defines as minor and substantial breaches of misconduct as outlined in the learning guide for this unit.
Note: An examiner or lecturer/tutor has the right not to mark this project report if the above declaration has not been added to the cover of the report.
8 Project Description
PART A (due Week 10, Friday 11:59 pm)
A company is investigating its public image and has approached your team to identify what the public associates with the company name. The company wants the three pieces of analysis to be performed in your first report.
8.1 Analysing the source of the tweets
In this section, we want to find out which sources the people use while tweeting about the company
Use the search_tweetsfunction from the rtweet library to search for 750 tweets about the company you selected. Save these tweets as “tweets.about”.
Examine the source column to see the source of tweets. Find out how many different levels of source exists in your tweets.
Obtain a vector of frequencies of each different source.
Create a data frame to save this information where first column represents source names, second column represents source counts.
List the top 10 most frequent tweet source name and draw the bar plot of the frequency of these top ten tweets source. Make sure each bar has names of the source.
Comment on the bar plot.
Company owner claims that Twitter users are equally likely to use ‘Twitter for iPhone’, ‘Twitter for Android’ and ‘Twitter Web Client’ when they post a tweet about the company. Use your tweet sample to test at a 5% level of significance whether this claim is true (Hint: First find frequencies of these sources in your data frame and save these counts in a vector, then apply the appropriate statistical test).
Comment on your findings.
8.2 Word-cloud of the company tweets and public tweets
In this section we want to visualize the similarity between the company tweets and public tweets as well as the language used in the tweets
Download the last 750 tweets from the chosen Twitter handle’s timeline, and save as tweets.company.
After doing pre-processing,
Construct a document term matrix of TFIDF weights of the tweets.company.
Construct a document term matrix of TFIDF weights of the tweets.about.
Construct word clouds of the words in tweets.aboutand tweets.company. Comment on both word-clouds.
Combine (merge) the tweets.aboutwith tweets.company and construct the document-term-matrix of the merged tweets using TFIDF weighting.
8.3 Connection between public and the company
In this section, we want to categorize (cluster) all the tweets and want to determine which topics are dominated by public tweets.
Compute the most appropriate number of clusters using the elbow method for the merged tweets you calculated in question 11 .
Cluster the merged tweets using the most appropriate clustering method.
Visualize your clustering in 2-dimensional vector space. Show each cluster in different colour and the tweets in tweets.aboutand tweets.company with different symbols in your visualization.
Comment on your visualization.
Compute the proportion of tweets.companyin each cluster. Print these proportions for all clusters.
Which cluster is dominated by tweets.about? Print top 20 words in the dominated cluster and comment on the theme of this cluster.
The company wants the above three parts of analysis to be written up as a professional report in the first deliverables. Each part should have its own section of the report and all questions should have thoughtful answers. Include all the code along with its output in your assignment.
Note: Save your data in your local drive, as you need it in Part B.
PART B (due Week 13, Friday 11:59 pm)
8.4 Network of the Tweets
In this section, we want to examine the network of tweets about the chosen company and the most influential users in the company’s network.
Compute cosine similarity matrix from the merged tweets. Save it as “Cos”
Examine the retweet counts of each tweet and sketch the histogram of retweet counts. Count data is mostly likely to be skewed. If it is skewed take the logarithm of the retweet counts and plot the histogram. (Note that taking logarithm of zeros create a problem, eliminate zeros in a sensible way before taking logs).
Use matrix “Cos” to create a graph of the tweets (not words) satisfying the conditions below:
Use the logarithm of retweet counts to decide the size of each vertex (tweet) in your tweet network. Vertex size must be proportional to the tweet’s retweet count (the more a tweet is retweeted, the bigger the vertex is).
Graph is a weighted graph, make sure edge thickness shows the weights (similarity) between tweets.
Visualize and comment on the graph. Note that , you cannot visualize 1500 tweets, you need to take a subgraph in a sensible way!
Page rank algorithm can also be used to find the most influential tweets in a graph. Use the page rank algorithm to find the top 10 most influential tweets in the network.
Examine the user names of the influential tweets and save 10 distinct user names of the most influential tweets you found by page rank algorithm. Save as users.
An influential user tends to have a lot of followers. Find the number of followers for all users in your usersvariable.
Calculate the total number of tweets of the usersto show how active a user is.
Based on the influence ratio and activity measure draw a scatter plot of the users (vertical axis is the activity measure; horizontal axis is the influence ratio). Label each user in your scatter plot.
Comment on the scatter plot.
Conclusion
By combining all this information (both in Part A and Part B), what can we say about the Company’s image on Twitter? Draw a conclusion from your report.
The company wants the above two parts of analysis to be written up as a professional report in the second deliverables. All questions should have thoughtful answers and clearly labelled. Include all the code along with its output in your assignment.