
library(tidyverse)
library(skimr)
library(lubridate)
library(janitor)
library(readr)
library(gt)
library(scales)
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual ridersAlthough the pricing flexibility helps Cyclistic attract more customers, your manager believes that maximizing the number of annual members will be key to future growth.
Analyze the historical bike data of Cyclistic to identify trends and derive insights to help marketing team with new marketing strategy.
How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?
Lily Moreno: The director of marketing and your manager.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
From January 2021 to December 2021, a total of 12 datasets have been made accessible for each month. Each dataset contains the specifics of every ride registered by clients of Cyclistic . This data has been cleaned to remove rider’s personal information before being made public.
We have 13 columns and we can infer their content:
ride_id: Id for each trip taken, as of now we are not sure if they are unique or not, we have to find out
rideable_type: Represents the type of a bike
started_at: Date and time of the start time
ended_at: Date and time of the end time
start_station_name: Name of the starting station
start_station_id: Id of the starting station
end_station_name: Name of the ending station
end_station_id: Id of the ending station
start_lat: Latitude of the starting point
start_lng: Longitude of the starting point
end_lat: Latitude of the ending point
end_lng: Longitude of the ending point
member_casual: Represents the membership status
ds1 <- read.csv("202101-divvy-tripdata.csv")
ds2 <- read.csv("202102-divvy-tripdata.csv")
ds3 <- read.csv("202103-divvy-tripdata.csv")
ds4 <- read.csv("202104-divvy-tripdata.csv")
ds5 <- read.csv("202105-divvy-tripdata.csv")
ds6 <- read.csv("202106-divvy-tripdata.csv")
ds7 <- read.csv("202107-divvy-tripdata.csv")
ds8 <- read.csv("202108-divvy-tripdata.csv")
ds9 <- read.csv("202109-divvy-tripdata.csv")
ds10 <- read.csv("202110-divvy-tripdata.csv")
ds11 <- read.csv("202111-divvy-tripdata.csv")
ds12 <- read.csv("202112-divvy-tripdata.csv")
Merge all data frames into one big data frame
df <- rbind(ds1,ds2,ds3,ds4,ds5,ds6,ds7,ds8,ds9,ds10,ds11,ds12)
started_at and ended_at from Character to Datetimedf$started_at <- ymd_hms(df$started_at)
df$ended_at <- ymd_hms(df$ended_at)
ride_length, day_of_week, Month, Day to data for smoothing the analysis process.df1 <- df %>%
mutate(ride_length = difftime(ended_at,started_at,units = "mins")) %>%
mutate(day_of_week = weekdays(as.Date(df$started_at))) %>%
mutate(Month = month(df$started_at,label = TRUE)) %>%
mutate(Day = day(df$started_at))
ride_length value is negative.df1 %>%
count(ride_length<0) %>%
gt()
| ride_length < 0 | n |
|---|---|
| FALSE | 5594916 |
| TRUE | 147 |
Yes, there are some rows with negative ride_length and because of that we have to remove these rows as it can be a hindrance in the analysis process.
df1 <- df1[!(df1$ride_length< 0),]
ride_length from Factor to numeric so we can run calculations on the data.df1$ride_length <- as.numeric(as.character(df1$ride_length))
skim_without_charts(df1)
| Name | df1 |
| Number of rows | 5594916 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| factor | 1 |
| numeric | 6 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1.00 | 16 | 16 | 0 | 5594916 | 0 |
| rideable_type | 0 | 1.00 | 11 | 13 | 0 | 3 | 0 |
| start_station_name | 690789 | 0.88 | 3 | 53 | 0 | 847 | 0 |
| start_station_id | 690786 | 0.88 | 3 | 36 | 0 | 834 | 0 |
| end_station_name | 739149 | 0.87 | 10 | 53 | 0 | 844 | 0 |
| end_station_id | 739149 | 0.87 | 3 | 36 | 0 | 832 | 0 |
| member_casual | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
| day_of_week | 0 | 1.00 | 6 | 9 | 0 | 7 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Month | 0 | 1 | TRUE | 12 | Jul: 822397, Aug: 804323, Sep: 756111, Jun: 729590 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.05 | 41.64 | 41.88 | 41.90 | 41.93 | 42.07 |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.84 | -87.66 | -87.64 | -87.63 | -87.52 |
| end_lat | 4770 | 1 | 41.90 | 0.05 | 41.39 | 41.88 | 41.90 | 41.93 | 42.17 |
| end_lng | 4770 | 1 | -87.65 | 0.03 | -88.97 | -87.66 | -87.64 | -87.63 | -87.49 |
| ride_length | 0 | 1 | 21.94 | 178.34 | 0.00 | 6.75 | 12.00 | 21.78 | 55944.15 |
| Day | 0 | 1 | 15.45 | 8.79 | 1.00 | 8.00 | 15.00 | 23.00 | 31.00 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2021-01-01 00:02:05 | 2021-12-31 23:59:48 | 2021-08-01 01:48:51 | 4677893 |
| ended_at | 0 | 1 | 2021-01-01 00:08:39 | 2022-01-03 17:32:18 | 2021-08-01 02:18:49 | 4671279 |
We can see right immediately that there is an outlier data under ride_length, with a maximum trip time of 55944 minutes, or about 38 days. This might be due to a data error or the fact that the individual rented it for 38 days.
gt(head(df1))
| ride_id | rideable_type | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | end_lat | end_lng | member_casual | ride_length | day_of_week | Month | Day |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| E19E6F1B8D4C42ED | electric_bike | 2021-01-23 16:14:19 | 2021-01-23 16:24:44 | California Ave & Cortez St | 17660 | NA | NA | 41.90034 | -87.69674 | 41.89 | -87.72 | member | 10.4166667 | Saturday | Jan | 23 |
| DC88F20C2C55F27F | electric_bike | 2021-01-27 18:43:08 | 2021-01-27 18:47:12 | California Ave & Cortez St | 17660 | NA | NA | 41.90033 | -87.69671 | 41.90 | -87.69 | member | 4.0666667 | Wednesday | Jan | 27 |
| EC45C94683FE3F27 | electric_bike | 2021-01-21 22:35:54 | 2021-01-21 22:37:14 | California Ave & Cortez St | 17660 | NA | NA | 41.90031 | -87.69664 | 41.90 | -87.70 | member | 1.3333333 | Thursday | Jan | 21 |
| 4FA453A75AE377DB | electric_bike | 2021-01-07 13:31:13 | 2021-01-07 13:42:55 | California Ave & Cortez St | 17660 | NA | NA | 41.90040 | -87.69666 | 41.92 | -87.69 | member | 11.7000000 | Thursday | Jan | 7 |
| BE5E8EB4E7263A0B | electric_bike | 2021-01-23 02:24:02 | 2021-01-23 02:24:45 | California Ave & Cortez St | 17660 | NA | NA | 41.90033 | -87.69670 | 41.90 | -87.70 | casual | 0.7166667 | Saturday | Jan | 23 |
| 5D8969F88C773979 | electric_bike | 2021-01-09 14:24:07 | 2021-01-09 15:17:54 | California Ave & Cortez St | 17660 | NA | NA | 41.90041 | -87.69676 | 41.94 | -87.71 | casual | 53.7833333 | Saturday | Jan | 9 |
analysis1 <- df1 %>%
group_by(member_casual,day_of_week) %>%
summarise(Mean_ride_length = mean(ride_length), Median_ride_length = median(ride_length) ,Max_ride_length = max(ride_length),Min_ride_length = min(ride_length), count = n())%>%
gt()
analysis1 <- data.frame(analysis1)
analysis1 %>%
group_by(member_casual) %>%
gt()
| day_of_week | Mean_ride_length | Median_ride_length | Max_ride_length | Min_ride_length | count |
|---|---|---|---|---|---|
| casual | |||||
| Monday | 31.87545 | 15.950000 | 31681.650 | 0 | 286373 |
| Tuesday | 27.97233 | 14.283333 | 38922.917 | 0 | 274388 |
| Wednesday | 27.65731 | 13.966667 | 38963.083 | 0 | 278948 |
| Thursday | 27.70326 | 13.783333 | 49107.150 | 0 | 286064 |
| Friday | 30.34860 | 14.966667 | 55691.683 | 0 | 364075 |
| Saturday | 34.70623 | 17.816667 | 55944.150 | 0 | 557994 |
| Sunday | 37.56658 | 18.716667 | 53921.600 | 0 | 481104 |
| member | |||||
| Monday | 13.24753 | 9.200000 | 1499.950 | 0 | 416204 |
| Tuesday | 12.78812 | 9.133333 | 1499.950 | 0 | 465509 |
| Wednesday | 12.81916 | 9.216667 | 1499.967 | 0 | 477156 |
| Thursday | 12.77618 | 9.133333 | 1499.950 | 0 | 451520 |
| Friday | 13.32492 | 9.433333 | 1499.950 | 0 | 446423 |
| Saturday | 15.26457 | 10.816667 | 1559.933 | 0 | 433041 |
| Sunday | 15.65794 | 10.866667 | 1499.933 | 0 | 376117 |
ggplot(data=df1,aes(x=member_casual,y = prop.table(stat(count)),fill = member_casual,label = scales::percent(prop.table(stat(count)))))+
geom_bar(position = "dodge") +
geom_text(stat = 'count',position = position_dodge(.9),vjust = -0.5,size = 3) +
scale_y_continuous(labels = scales::percent)+
labs(title = "Categories of customer",subtitle = "Casual v/s Member",x="Customer Type",y="Percent",caption = "Figure 1")+
theme(
plot.caption = element_text(color = "blue", face = "italic",hjust = 0)
)
ggplot(data = df1)+
geom_bar(mapping = aes(x=member_casual,fill = day_of_week))+
facet_grid(day_of_week~Month)+
theme(legend.position = "bottom")+
labs(title = "Usage by Customers",subtitle = "Casual v/s Member",caption = "Figure 2")+
theme(
plot.caption = element_text(color = "blue", face = "italic",hjust = 0)
)
analysis3 <- df1 %>%
group_by(Month,member_casual) %>%
summarise(Number_of_Riders = n(), Average_Ride_Time = mean(ride_length))
analysis3 %>%
ggplot(aes(x = Month, y=Number_of_Riders,group = member_casual,colour = member_casual)) +
geom_point() +
geom_line() +
labs(title = "Line Graph",subtitle = "Casual v/s Member",y="Number of Riders",caption = "Figure 3")+
theme(
plot.caption = element_text(color = "blue", face = "italic",hjust = 0)
)
ggplot(data = analysis1,aes(x = day_of_week, y = Mean_ride_length, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Average Ride Length",subtitle = "Casual v/s Member",x="Weekday",y="Average Ride Length",caption = "Figure 4")+
theme(
plot.caption = element_text(color = "blue", face = "italic",hjust = 0)
)
Top10 <- df1 %>%
group_by(member_casual) %>%
count(start_station_name,sort = TRUE) %>%
slice(1:10)
Top10 <- data.frame(Top10)
Top10 %>%
filter(!start_station_name == "") %>%
ggplot(aes(x=n,y=start_station_name,fill = start_station_name))+
geom_bar(stat='identity')+
facet_wrap(~member_casual)+
theme(legend.position = "none",axis.title.x = element_blank(),axis.title.y = element_blank())+
labs(title = "Most Busiest Stations",subtitle = "Casual v/s Member",caption = "Figure 5")+
theme(
plot.caption = element_text(color = "blue", face = "italic",hjust = 0)
)
There is no direct answer to the question How do we convert casual users to annual members? but here are some insights and suggestions which we can implement for next year 2022.
As we can see that in Figure 1, the number of rides taken by casual type customers was 45.2% and for member type customers it was 54.8% which means that rides taken by members is 9.6% more than the rides taken by casual customers.
By Figures 2 & 3 we can see that users touched their peak in the month of July-August if we see in terms of the month. The rides taken by casual users saw a peak at the weekends but for member type customers the usage was similar throughout the week.
Even when there are more member type customers than casual users but when it comes to Average Ride Length casual users ride more than 2 times what members ride, as we can see in Figure 4.
In Figure 5, we can also see the list of the Top 10 busiest stations for both casual and member type users and most of them overlap each other.
We can conduct marketing campaigns on those most popular stations and at the weekends because then it can target a larger audience.
We can design annual membership in a way that seems much more reliable than daily passes. In doing so we have to add some more perks, some gifts or something like that if you purchase our annual membership with a certain credit card, you can get some additional instant discount.
We can increase the price of a daily pass on weekends only because there are much more rides taken on weekends by casual users and casual users may find it feasible to purchase an annual membership rather than a daily pass.