This is my capstone project for the Google/Cousera Data Analytics course and my first original R MarkDown file http://rmarkdown.rstudio.com.
I have made a separate RMD that would be more for the client presentation and not including so many notes about the proces that I thought would be helpful to fellow students in the course as this one has. In the personal one here I have also included extra code and plots, etc. to aid understanding of the process and not needed in the actual proposal. I have made both documents without any help except using the course material and reading online help (which is wonderful and extensive). If you are reading this document as a potential employer I hope you can see my keen interest in the field of data analytics and data science and my ability, albeit nascent, to pursue answers in the data and present them in an interesting way. This document’s production is ongoing as I just wanted it in some form to have the course done and move on to other learning opportunities.
That file is entitled: Bike Analysis with Proposal 1.0
(Note: The datasets have a different name because Cyclistic is a fictional company. For the purposes of this case study, Data location. The datasets are appropriate and will enable you to answer the business questions. The data has been made available by Motivate International Inc. under this license.) This is public data that you can use to explore how different customer types are using Cyclistic bikes. But note that data-privacy issues prohibit you from using riders’ personally identifiable information. This means that you won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.
On viewing the data you can see that there are many fields empty but they are confined to one variable and substantial progress towards the business goal can be made without this data so it does not present a problem. There is not a lot of metadata about the dataset but in these circumstances we will assume that it represents and accurate accumulation of their customers and bikes over the time period it was collected. It doe snot contain any personal identifiers so there is no concern that there will be any compromise of personal information when publishing any results and little concern about protecting the data in storage.
Data Import
# add read and view to base r
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
#read in .csv file to dataframe
aug_df<-read.csv("bike_data/202207-divvy-tripdata.csv")
The data was searched for duplicates - none were found.
The data was checked for na values in critical columns - none found
NOTE: During analysis the time length for rides was calculated and any entry that had 0 time length was dropped. There were only about .5 to 1% or rides in this category in any one month. What these represent would have to be confirmed with the stakeholders in a real task but in these simulation circumstances it was decided to simple remove them to avoid errors in calculations.
I initially shrank the size of each month to 50K and used this subset to test formatting and consolidation procedures and then proceeded later with the analysis after producing one dataset.
#make file shorter for inital analysis
short_month<-tail(aug_df, n = 40000)
Unneeded columns were removed from the dataset - stations, etc.
Check for duplicates using group_by and filter Stackoverflow
short_month %>%
group_by(ride_id) %>%
filter(n()>1)
## # A tibble: 0 × 13
## # Groups: ride_id [0]
## # … with 13 variables: ride_id <chr>, rideable_type <chr>, started_at <chr>,
## # ended_at <chr>, start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
#note column names
colnames(short_month)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
short_month = select(short_month, -c(start_station_name, start_station_id, end_station_name,end_station_id))
As part of preparing the data several new columns were added:
ride time length
weekday
month
ride distance
Ride length - shows the time each ride took using the difftime (part of baseR) method for
#add column for ride length
short_month <- short_month %>%
mutate(ride_length = round(difftime(ended_at,started_at,units = "mins")),digits=1)
Outliers (ride time over 500 minutes - generally only 2 per 10K were dropped to make better visuals without losing integrity using the subset() function. There were about 100/10K that were dropped that were giving times of 1500 minutes. These represent cases where there is no end time. These would be cases where stakeholders would have to supply their interpretation of what hey represent, either error in reporting or bicycle never returned. For the purpose of the analysis these cases have been dropped.
First check numbers on proposed outliers over 1500 minutes:
#count outliers: ride_length > 1500, these have no end time
nrow(short_month[short_month$ride_length >= 500, ])
## [1] 36
Second outlier group was cases that had a time of 0 minutes. In this case it is most likely the client signed up for the bike but then did not use it or cold be technical error.
#count outliers: where the time = 0, presumably biked signed on but not used
library(dplyr)
nrow(short_month[short_month$ride_length == 0, ])
## [1] 1329
There were not significant numbers of either groups so both of these categories were dropped pending further information from the stakeholders.
#drop ooutliers
short_month<-subset(short_month, ride_length >= 0)
short_month<-subset(short_month, ride_length < 500)
# only drops about 8 out of 10,000
#Adding weekday and month column The rest part of the process was extracting the weekday and month inot a separate column to help analysis. theLubridate library was added to help with these tasks.
#make weekday column
library(lubridate)
## Loading required package: timechange
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
short_month$weekday <- wday(short_month$started_at, week_start=1)
Some_date<-as.Date(short_month$started_at)
short_month$month <-months(Some_date)
Look at the Dataframe:
head(short_month)
## ride_id rideable_type started_at ended_at
## 783489 EC38A45E65305ADF classic_bike 2022-07-05 17:11:07 2022-07-05 17:13:53
## 783490 0188C58DEEC9318E electric_bike 2022-07-18 17:06:05 2022-07-18 17:08:25
## 783491 F8733C8A5B1505A7 classic_bike 2022-07-16 15:33:35 2022-07-16 15:43:36
## 783492 1341551AA941887C electric_bike 2022-07-02 09:39:14 2022-07-02 09:59:38
## 783493 7CF8CA0B1629E0FF classic_bike 2022-07-31 15:20:18 2022-07-31 15:26:29
## 783494 95A083EBA69382DC classic_bike 2022-07-31 15:47:14 2022-07-31 15:51:01
## start_lat start_lng end_lat end_lng member_casual ride_length digits
## 783489 41.87832 -87.64098 41.88224 -87.64107 member 3 mins 1
## 783490 41.87814 -87.64117 41.88224 -87.64107 casual 2 mins 1
## 783491 41.88698 -87.61281 41.88096 -87.61674 casual 10 mins 1
## 783492 41.88700 -87.61281 41.88096 -87.61674 casual 20 mins 1
## 783493 41.88698 -87.61281 41.88096 -87.61674 casual 6 mins 1
## 783494 41.87832 -87.64098 41.88224 -87.64107 casual 4 mins 1
## weekday month
## 783489 2 July
## 783490 1 July
## 783491 6 July
## 783492 6 July
## 783493 7 July
## 783494 7 July
Some preliminary summary comparison stats
table(short_month$member_casual)
##
## casual member
## 20917 19047
#compare general summary by memeber/casual
short_month %>%
group_by(member_casual) %>%
summarise_at(vars(ride_length), list(avg = mean,max = max, min=min))
## # A tibble: 2 × 4
## member_casual avg max min
## <chr> <drtn> <drtn> <drtn>
## 1 casual 17.73763 mins 486 mins 0 mins
## 2 member 12.75492 mins 480 mins 0 mins
#General distribution of ride length by member status:
Looking at a box plot is a good way to see if there is a difference in distribution of ride length between the two groups:
#try boxplot
ggplot(short_month, aes(x=member_casual, y=ride_length, fill=member_casual)) + geom_boxplot()
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.
There is very litle difference in the two groups and their distribtution. There is a small shift upward for the casual ride group but probably not the type of diffeence that can be capitalized on or useful in undestanding how the two groups differ.
To get anohter look a histogram of the each group was plotted side by side, again not showing a dramaitc differnece and both spread with a similar pattern over a similar range.
ggplot(data = short_month) +
geom_bar(mapping = aes(x = ride_length,fill=member_casual))+
xlim(0, 110)+
facet_wrap(~member_casual)
## Warning: Removed 275 rows containing non-finite values (`stat_count()`).
## Warning: Removed 3 rows containing missing values (`geom_bar()`).
Next the day of the week data was plotted using a bar chart wioht the two member typse side by side. Again they lead to a similar pattern.
# Create a grouped bar graph
ggplot(short_month, aes(x=weekday, y=ride_length, fill=member_casual)) +
geom_bar(stat="identity", position=position_dodge())
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.
ggplot(short_month, aes(x=ride_length, colour=member_casual)) + geom_density()
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.
#needed for lat long calcualtion of distance
#install.packages("geosphere")
install.packages('geosphere', repos='https://rspatial.r-universe.dev')
##
## The downloaded binary packages are in
## /var/folders/9x/7qfgqxt90qsf0x1p7x6143dh0000gt/T//RtmpsEne70/downloaded_packages
library(geosphere)
library(dplyr)
#lat-long distance calculation:note 0 represent retuns to origianl spot
#first write function:
calc_dist_lat_long <- function(lat1,lon1,lat2,lon2) {
lat_long_dist<-distGeo(p1 = c(lon1, lat1),p2 = c(lon2,lat2)) / 1000
lat_long_dist<-round(lat_long_dist, digits=2)
return(lat_long_dist)
}
#now write loop
for (row in 1:nrow(short_month)){
start_lat <- short_month[row, "start_lat"]
start_long <- short_month[row, "start_lng"]
end_lat <- short_month[row, "end_lat"]
end_long <- short_month[row, "end_lng"]
distKm<-calc_dist_lat_long(start_lat,start_long,end_lat,end_long)
distKm<-round(distKm, digits=2)
short_month$distance[row] <- distKm
#print(distKm)
}
#count he number of returns and non returns
short_month %>% count(distance==0.00)
## distance == 0 n
## 1 FALSE 33251
## 2 TRUE 6711
## 3 NA 2
library(dplyr)
short_month %>% group_by(member_casual,distance==0.0) %>% summarise(count_dist = n() )
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 5 × 3
## # Groups: member_casual [2]
## member_casual `distance == 0` count_dist
## <chr> <lgl> <int>
## 1 casual FALSE 16933
## 2 casual TRUE 3982
## 3 casual NA 2
## 4 member FALSE 16318
## 5 member TRUE 2729
short_month<-subset(short_month, distance >= 0)
#try boxplot
ggplot(short_month, aes(x=member_casual, y=distance, fill=member_casual)) + geom_boxplot()
# Create a grouped bar graph
ggplot(short_month, aes(x=weekday, y=distance, fill=member_casual)) +
geom_bar(stat="identity", position=position_dodge())
devtools::install_github("dkahle/ggmap", ref = "tidyup")
## Skipping install of 'ggmap' from a github remote, the SHA1 (2d756e5e) has not changed since last install.
## Use `force = TRUE` to force installation
library("ggmap")
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
register_google(key = "AIzaSyDi09XN9VqujrnE0zwR-ouHnzwKDvEMwbM")
library(RgoogleMaps)
library(rstudioapi)
#chicago area map
mapchico<-ggmap(get_googlemap(center = c(lon = -87.60, lat = 41.90),
zoom = 10,
maptype = 'terrain',
color = 'color'))+
geom_point(data=short_month, aes(x=start_lng, y=start_lat, color=member_casual), size = .2)
## Source : https://maps.googleapis.com/maps/api/staticmap?center=41.9,-87.6&zoom=10&size=640x640&scale=2&maptype=terrain&key=xxx-ouHnzwKDvEMwbM
print(mapchico)