View on GitHub

Exploratory_Data_Analysis

Projects for Data Analyst Nanodegree with Udacity

Welcome!

Find below short descriptions and links to the reports for the eight projects I completed as part of a two term nanodegree in data analysis with Udacity (October 2017 - June 2018). The goals of the program are to learn how to clean the data and how to create exploratory data analysis reports, through uncovering patterns and insights, drawing meaningful conclusions, and clearly communicating critical findings.

This project presents an analysis of local and global temperatures for a period of more than 250 years. The global temperatures trends are compared with local trends for Columbus (Ohio, US), Madrid (Spain) and Helsinki (Finland).

This presentation is using Jupyter Notebook with a Python 2 kernel, on a Windows 7 Professional system. The first steps of exploratory data analysis are performed using Google sheets. More EDA is done with SQLAlchemy and Pandas, and Matplotlib for visualizations. Regression estimates are obtained with the linear model in sklearn.

Link: Project1.Term1

2016 US Bike Share Activity Snapshot

An exploratory analysis on data from Motivate, a bike-share system provider for many major cities in the United States is performed. The system usage between three large cities: New York City, Chicago, and Washington (DC) are compared. The differences within each system for those users that are registered, regular users and those users that are short-term, casual users are also analyzed.

The project is completely done in Jupyter Notebook, using Python 3 and Matplotlib for visualizations. A notebook kernel is provided by Udacity and it has the information required for the completion of the project. It also contains questions and several blocks of code that guide the analysis of the data.

Link: Project2.Term1

Investigate the TMDb movie dataset

The dataset analyzed here contains information about 10,000 movies collected from The Movie Database TMDb on Kaggle. It is investigated how various factors, such as budget, release time, genre, influence the revenue and the ratings of the movie. The dataset is assessed, cleaned and formatted. Exploratory data analysis is performed and visualizations are constructed.

The work is done in Jupyter Notebook, using a kernel provided by Udacity. The code is written in Python 3, using Pandas and NumPy.

Link: Project3.Term1

Analyze Experiment Results

A dataset reflecting the results of an A/B test run by an e-commerce website is provided. In addition, a Jupyter Notebook containing various questions and required steps is also provided by Udacity. The goal is to determine if the company should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision. The analysis has three parts covering probability, A/B testing and regression.

The code is written in Python 3, using Pandas and NumPy. The visualizations are created with Matplotlib.

Link: Project4.Term1

Test a Perceptual Phenomenon

The data is collected from testing the Stroop effect, a classic result in experimental psychology. After initial exploratory data analysis, two statistical tests are performed: a paired t-test for dependent means and an A/B test.

The work is presented in a Jupyter Notebook, using Python in conjunction with Pandas, NumPy and several other packages from SciPy. The visualizations are created with Matplotlib.

Link: Project1.Term2

Explore and Summarize Data: Red Wines Analysis

Using R and exploratory data analysis techniques a selected dataset is examinated for distributions, outliers and anomalies. The dataset contains information about red “Vinho Verde”” wine samples, from the north of Portugal. The goal is to determine which physicochemical attributes are relevant to the quality of the wine. There are 1599 samples of wine and 13 attributes for each sample. There are 11 variables based on physicochemical tests and the remaining one is quality.

The project is written as Markdown document in RStudio, the analysis is performed in R.

Link: Project2.Term2

Wrangle and Analyze Data

Gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. Showcase the wrangling efforts through analysis and visualizations.

The dataset that is wrangled is an enhanced Twitter archive, provided by Udacity, of the user @dog_rates. This Twitter account, also known as WeRateDogs rates people’s dogs with a humorous comment about the dog. This dataset contains basic tweet data for more than 2000 tweets. The tweet text was used to extract ratings, dog names and dog stages. This file was downloaded manually from Udacity’s website.

The Twitter archive is enriched with two more datasets. First with data obtained by query Twitter API, using Python’s Tweepy library. This contains each tweet’s retweet count and favorite (“like”) count. Second with results obtained by running a neural network on the Twitter archive in order to predict what breed of dog is present in each tweet. The file is hosted on Udacity’s servers and it was downloaded programmatically using the Requests library.

The file wrangle_report describes in detail the wrangling efforts which are contained in the Jupyter notebook wrangle_act. These are performed in Python 3, using Pandas and NumPy. The file act_report.pdf communicates the insights and displays the visualizations produced from the wrangled data.

Create a Tableau Story - PISA 2012

A data visualization using Tableau, from a data set that tells a story or highlights trends or patterns in the data is created. The project reflects the theory and practice of data visualization, harnessing visual encodings and design principles for effective communication.

PISA is an international survey of students’ skills and knowledge in reading, mathematics and science as they approach the end of compulsory education. In 2012 around half million students from more than 60 countries took this complex set of tests. In this report, PISA 2012 will be used to investigate the differences in achievement in mathematics tests based on location, gender and student attitudes.

The design of the Story is based on two factors: the main message/idea of the story (as described in the summary) and the possible audience for the presentation. It is assumed that the audience has some basic statistical knowledge and some interest in or experience with mathematics education.

The data cleaning and preparation is done in Python 3 using Pandas and NumPy: PISA 2012_dataWrangle

An outline of the story: Tableau Story Outline

The results and the analysis are visually presented in a: Tableau Story