Assignment 2 Specifications
Description Marks out of Wtg(%) Word Count Due date
Assignment 2 Written Practical Report 100
30% 3000 19/04/13
The key frameworks and concepts covered in modules 1–5 are particularly relevant for this assignment.
Assignment 2 relates to the specific course learning objectives 1, 2 and 4 and associated MBA program learning goals and skills: Global Content, Problem solving, Critical thinking, and Written Communication at level 3:
1. demonstrate applied knowledge of people, markets, finances, technology and management in a global context of business intelligence practice (data warehouse design, data mining process, data visualisation and performance management) and resulting organisational change and how these apply to implementation of business intelligence in organisation systems and business processes
2. identify and solve complex organisational problems creatively and practically through the use of business intelligence and critically reflect on how evidence based decision making and sustainable business performance management can effectively addressing real world problems
4. demonstrate the ability to communicate effectively in a clear and concise manner in written report style for senior management with correct and appropriate acknowledgment of main ideas presented and discussed.
Assignment 2 consists of two main tasks and a number of sub tasks
Task 1 (Worth 60 marks) consists of the following sub tasks
The sinking of the Titanic is a famous event. You may find it useful to research the facts surrounding the sinking of the Titanic to inform your understanding of the problem and ensuing interpretation of your data analysis of the factors determining the survival of passengers on the Titanic. Use the data mining tool RapidMiner to conduct an exploratory analysis of the titanic_train.csv data set which is provided on the CIS8008 S1 2013 course study desk.
1.1 You need to identify five key variables that contribute most to determining the survival rate of passengers on the ill-fated Titanic on its maiden voyage. Note you should also refer to the data dictionary provided with the titanic_train.csv file which describes each of the variables and their range of values.
(Hint: an exploratory analysis should be based on summary statistics, histograms, crosstab tables and scatterplots of individual variables and the relationship between individual variables and the target variable survived. Which variables are correlated with target variable survived and other variables.
You might also need to consider reformatting some of variables to facilitate the next stage of analysis of the titanic._train.csv and titanic_score.csv data sets using a Decision Tree (Hint: you will need to convert the survival variable to nominal variable with the values Yes = 1, No = 0 in titanic_train.csv). See Data Mining for the Masses Chapters 3 and 4 for guidance in Exploratory Data Analysis using RapidMiner.
Discuss each of your five top variables and results of your exploratory data analysis in general using the RapidMiner data mining tool as well as how you dealt with missing data and unusual data informed by relevant supporting literature on the survival rate of passengers on the Titanic. Your discussion should also include appropriate statistical analysis results such as graphs and results tables from conducting an exploratory data analysis in the RapidMiner data mining tool with some supporting references on exploratory data analysis using Decision Trees in data mining (about 1000 words).
The following table lists the data dictionary for the data set titanic_train.csv.
(Note: titanic_score.csv is the same as titanic_train.csv but does not contain any values for target variable survived which is referred to as a label variable in Rapidminer).
pclass Passenger Class (1 = 1st class; 2 = 2nd class; 3 = 3rd class)
survived Survived (0 = No; 1 = Yes)
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
embarked Port of Embarkation(C = Cherbourg; Q = Queenstown; S = Southampton)
body Body Identification Number
Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5
Fare is in Pre-1970 British Pounds (£) Conversion Factors: 1£ = 12s = 240d and 1s = 20d With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.
Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiancées Ignored)
Parent: Mother or Father of Passenger Aboard Titanic
Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic
Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbours in a village, however, the definitions do not support such relations.
STORY BEHIND THE DATA: This dataset is based on the Titanic Passenger List edited by Michael A. Findlay, originally published in Eaton & Haas (1994) Titanic: Triumph and
Tragedy, Patrick Stephens Ltd, and expanded with the help of the internet community.
1.2. Build a model for predicting the survival of passengers on the Titanic using a decision tree in RapidMiner (See Chapter 10 of Data Mining for the Masses textbook for guidance on Decision Trees in RapidMiner) using the two data sets, titanic3_train.csv and
titanic3_score.csv. Then present and discuss the results of your Decision Tree analysis and a diagram showing your final Decision Tree. Comment on the relative predictive strength of this model and what you believe are the most significant variables that determined whether a passenger on the Titanic survived or not. Include some supporting references on using Decision Trees in data mining (about 1000 words).
1.3 Discuss how and why it is important that quality of information be incorporated into the design of a data warehouse architecture (500 words approx.).
Task 2 (Worth 30 marks) consists of the following sub tasks
With the following excel file sales.cub (excel data cube) provided on the CIS8008 S1 2013 course study desk produce the four following reports with appropriate accompanying graphs based on a pivot table for each report and briefly comment on each report in about 125 words in terms of what trends and patterns are apparent in each report. Note that you can complete this Assignment in either Microsoft excel using a Pivot table or using Tableau 7.0 Desktop.
The salesdata.cub file contains the following dimensions and information:
• Customer is categorised by country, state province, city, lastname
• Product is categorised by category, sub category and brand name
• Promotion by promotion name
• Store is categorised by store country, store state province, store city and store name
• Time is categorised by year, quarter and month
• Measures categorised by store cost, store sales, unit sales
2.1 Create a report and accompanying graph using a pivot table or Tableau 7.0 Desktop that lists the store values for each quarter of 1998 across country and state province and comment on key trends and patterns in this report (125 words approx.)
2.2 Create a report and accompanying graph using a pivot table or Tableau 7.0 Desktop that lists for country and state province across product category and unit sales sub product category of breakfast foods and comment on key trends and patterns in this report (125 words approx.)
2.3 Create a report and accompanying graph using a pivot table or Tableau 7.0 Desktop that lists for the states of Oregon (OR) and Washington (WA) the total sales and total sales value in dollar terms and comment on key trends and patterns in this report (125 words approx.)
2.4 Create a report and accompanying graph using a pivot table or Tableau 7.0 Desktop that lists by product categories of beer and wine and their product sub categories, by order of unit sales and comment on key trends and patterns in this report (125 words approx..)
Your assignment 2 report must be structured as follows, which is similar to the report structure detailed in Summers & Smith 2010:
1. Cover page for assignment 2 report
2. Introduction (250 words)
3. Body of report – main sections and subsections for assignment 2 task and sub tasks so Task 1 will be a main heading with appropriate sub headings etc....for each sub task etc..
4. Conclusion (250 words)
5. List of References
Harvard referencing resources
Install a reference tool (example Endnote) which integrates with your word processor. These tools are a great help for referencing and citing sources in your assignments. For more information on how to get Endnote you may visit the following webpage: http://www.usq.edu.au/library/services/staffresearch/endnote/ .
Originality of submission
• This assignment must be the expression of your own work. It is acceptable to discuss course content with others to improve your understanding and clarify requirements, but solutions to assignment questions must be done on your own. This also means that it is not sufficient to merely paraphrase the entire assignment content from a textbook or other source. Your assignment answers need be a reflection and synthesis of your research of the associated topics. This should be evident in your annotated bibliography.
• You need to demonstrate your understanding of associated topics for each assignment. You must not copy from anyone, including tutors and fellow students, nor provide copies of your work to others. Assignments that do not adhere to this requirement will be deemed as being the result of collusion or plagiarism. This may lead to severe academic penalties as outlined in Academic Regulation 5.10 of the USQ Handbook. It is your own responsibility to ensure the integrity of your work. Refer to the Faculty of Business policy for further details.
• An indiscriminate overuse of incorrectly referenced or cited web pages in your assignment will result in poor marks.
Late submission of assignment work
Late assignments will be penalised unless permission has been granted by the course leader BEFORE the assignment due date. Late assignments may be penalised by reducing the allocated mark by 5% per day late.
a) Page size should be A4. Allow for TOP MARGIN – 2.5 cm; BOTTOM MARGIN –
2.5 cm; SIDE MARGINS – 2.5 cm
b) Major headings should be in upper case, 14 point Times New Roman Bold
c) Minor headings should be in lower case, 12 point Times New Roman Bold
d) Body text should be formatted to 12 point, Times New Roman
e) Use single or one and half line spacing
f) Assignments should be in one of the following formats:
(1) Microsoft Word (2003)
(2) Microsoft Word (2007)
(3) Rich Text Format (RTF)
(4) Portable Document Format (PDF)
2) The file naming guidelines are as follows:
a) It is important that you use this convention to ensure that assignments can be tracked:
b) Replace ‘familyname’ with YOUR family or surname
c) Replace ‘studentnumber’ with YOUR student number (which starts with ‘00’)
3) BEFORE submitting your FINAL version of the assignment to the online submission system (EASE) you must upload your draft copies to the online TURNITIN system (see the course Study Desk for details on accessing and using the TURNITIN system).
a) TURNITIN produces an Originality Report that shows your overall similarity index. If the index is less than 25 percent, your assignment is considered to be plagiarism free. If not, you will have to rework your assignment, using the feedback in the
Originality Report as a guide, and then resubmit it to TURNITIN for another round of checks.
4) Once you have obtained a plagiarism-free assignment you should upload your
Originality Report along with the FINAL version of your assignment through the online assignment submission system (EASE). No hardcopy submissions will be accepted. See the course Study Desk for instructions on uploading your assignment.