Case Study : Cyclistic

library(tidyverse)
library(skimr)
library(lubridate)
library(janitor)
library(readr)
library(gt)
library(scales)

1. Describing Scenario

1.1 Background

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual ridersAlthough the pricing flexibility helps Cyclistic attract more customers, your manager believes that maximizing the number of annual members will be key to future growth.

1.2 Business Task

Analyze the historical bike data of Cyclistic to identify trends and derive insights to help marketing team with new marketing strategy.

1.3 Objectives

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

1.4 Stakeholders

Lily Moreno: The director of marketing and your manager.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

2. Preparation of Data

2.1 Information about dataset

From January 2021 to December 2021, a total of 12 datasets have been made accessible for each month. Each dataset contains the specifics of every ride registered by clients of Cyclistic . This data has been cleaned to remove rider’s personal information before being made public.

We have 13 columns and we can infer their content:

ride_id: Id for each trip taken, as of now we are not sure if they are unique or not, we have to find out
rideable_type: Represents the type of a bike
started_at: Date and time of the start time
ended_at: Date and time of the end time
start_station_name: Name of the starting station
start_station_id: Id of the starting station
end_station_name: Name of the ending station
end_station_id: Id of the ending station
start_lat: Latitude of the starting point
start_lng: Longitude of the starting point
end_lat: Latitude of the ending point
end_lng: Longitude of the ending point
member_casual: Represents the membership status

ds1 <- read.csv("202101-divvy-tripdata.csv")
ds2 <- read.csv("202102-divvy-tripdata.csv")
ds3 <- read.csv("202103-divvy-tripdata.csv")
ds4 <- read.csv("202104-divvy-tripdata.csv")
ds5 <- read.csv("202105-divvy-tripdata.csv")
ds6 <- read.csv("202106-divvy-tripdata.csv")
ds7 <- read.csv("202107-divvy-tripdata.csv")
ds8 <- read.csv("202108-divvy-tripdata.csv")
ds9 <- read.csv("202109-divvy-tripdata.csv")
ds10 <- read.csv("202110-divvy-tripdata.csv")
ds11 <- read.csv("202111-divvy-tripdata.csv")
ds12 <- read.csv("202112-divvy-tripdata.csv")

Merge all data frames into one big data frame

df <- rbind(ds1,ds2,ds3,ds4,ds5,ds6,ds7,ds8,ds9,ds10,ds11,ds12)

3. Processing of Data

3.1 Manipulating Data

Changing data type of started_at and ended_at from Character to Datetime

df$started_at <- ymd_hms(df$started_at)
df$ended_at <- ymd_hms(df$ended_at)

Adding some relevant columns like ride_length, day_of_week, Month, Day to data for smoothing the analysis process.

df1 <- df %>% 
  mutate(ride_length = difftime(ended_at,started_at,units = "mins")) %>% 
  mutate(day_of_week = weekdays(as.Date(df$started_at))) %>% 
  mutate(Month = month(df$started_at,label = TRUE)) %>% 
  mutate(Day = day(df$started_at))

Checking if there is any ride_length value is negative.

df1 %>% 
  count(ride_length<0) %>% 
  gt()

ride_length < 0	n
FALSE	5594916
TRUE	147

Yes, there are some rows with negative ride_length and because of that we have to remove these rows as it can be a hindrance in the analysis process.

df1 <- df1[!(df1$ride_length< 0),]

Converting ride_length from Factor to numeric so we can run calculations on the data.

df1$ride_length <- as.numeric(as.character(df1$ride_length))

3.2 Summary Statistics of Dataset

skim_without_charts(df1)

Data summary
Name	df1
Number of rows	5594916
Number of columns	17
_______________________
Column type frequency:
character	8
factor	1
numeric	6
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
ride_id	0	1.00	16	16	5594916
rideable_type	0	1.00	11	13	3
start_station_name	690789	0.88	3	53	847
start_station_id	690786	0.88	3	36	834
end_station_name	739149	0.87	10	53	844
end_station_id	739149	0.87	3	36	832
member_casual	0	1.00	6	6	2
day_of_week	0	1.00	6	9	7

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Month	0	1	TRUE	12	Jul: 822397, Aug: 804323, Sep: 756111, Jun: 729590

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	0	1	41.90	0.05	41.64	41.88	41.90	41.93	42.07
start_lng	0	1	-87.65	0.03	-87.84	-87.66	-87.64	-87.63	-87.52
end_lat	4770	1	41.90	0.05	41.39	41.88	41.90	41.93	42.17
end_lng	4770	1	-87.65	0.03	-88.97	-87.66	-87.64	-87.63	-87.49
ride_length	0	1	21.94	178.34	0.00	6.75	12.00	21.78	55944.15
Day	0	1	15.45	8.79	1.00	8.00	15.00	23.00	31.00

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
started_at	0	1	2021-01-01 00:02:05	2021-12-31 23:59:48	2021-08-01 01:48:51	4677893
ended_at	0	1	2021-01-01 00:08:39	2022-01-03 17:32:18	2021-08-01 02:18:49	4671279

We can see right immediately that there is an outlier data under ride_length, with a maximum trip time of 55944 minutes, or about 38 days. This might be due to a data error or the fact that the individual rented it for 38 days.

3.3 Glipmse of Dataset

gt(head(df1))

ride_id	rideable_type	started_at	ended_at	start_station_name	start_station_id	end_station_name	end_station_id	start_lat	start_lng	end_lat	end_lng	member_casual	ride_length	day_of_week	Month	Day
E19E6F1B8D4C42ED	electric_bike	2021-01-23 16:14:19	2021-01-23 16:24:44	California Ave & Cortez St	17660	NA	NA	41.90034	-87.69674	41.89	-87.72	member	10.4166667	Saturday	Jan	23
DC88F20C2C55F27F	electric_bike	2021-01-27 18:43:08	2021-01-27 18:47:12	California Ave & Cortez St	17660	NA	NA	41.90033	-87.69671	41.90	-87.69	member	4.0666667	Wednesday	Jan	27
EC45C94683FE3F27	electric_bike	2021-01-21 22:35:54	2021-01-21 22:37:14	California Ave & Cortez St	17660	NA	NA	41.90031	-87.69664	41.90	-87.70	member	1.3333333	Thursday	Jan	21
4FA453A75AE377DB	electric_bike	2021-01-07 13:31:13	2021-01-07 13:42:55	California Ave & Cortez St	17660	NA	NA	41.90040	-87.69666	41.92	-87.69	member	11.7000000	Thursday	Jan	7
BE5E8EB4E7263A0B	electric_bike	2021-01-23 02:24:02	2021-01-23 02:24:45	California Ave & Cortez St	17660	NA	NA	41.90033	-87.69670	41.90	-87.70	casual	0.7166667	Saturday	Jan	23
5D8969F88C773979	electric_bike	2021-01-09 14:24:07	2021-01-09 15:17:54	California Ave & Cortez St	17660	NA	NA	41.90041	-87.69676	41.94	-87.71	casual	53.7833333	Saturday	Jan	9

4. Anlaysis Phase

4.1 Descriptive Analysis

analysis1 <- df1 %>% 
  group_by(member_casual,day_of_week) %>% 
  summarise(Mean_ride_length = mean(ride_length), Median_ride_length = median(ride_length) ,Max_ride_length = max(ride_length),Min_ride_length = min(ride_length), count = n())%>% 
  gt()
analysis1 <- data.frame(analysis1)
analysis1 %>% 
  group_by(member_casual) %>% 
  gt()

day_of_week	Mean_ride_length	Median_ride_length	Max_ride_length	Min_ride_length	count
casual
Monday	31.87545	15.950000	31681.650	0	286373
Tuesday	27.97233	14.283333	38922.917	0	274388
Wednesday	27.65731	13.966667	38963.083	0	278948
Thursday	27.70326	13.783333	49107.150	0	286064
Friday	30.34860	14.966667	55691.683	0	364075
Saturday	34.70623	17.816667	55944.150	0	557994
Sunday	37.56658	18.716667	53921.600	0	481104
member
Monday	13.24753	9.200000	1499.950	0	416204
Tuesday	12.78812	9.133333	1499.950	0	465509
Wednesday	12.81916	9.216667	1499.967	0	477156
Thursday	12.77618	9.133333	1499.950	0	451520
Friday	13.32492	9.433333	1499.950	0	446423
Saturday	15.26457	10.816667	1559.933	0	433041
Sunday	15.65794	10.866667	1499.933	0	376117

4.2 Graphical Analysis

ggplot(data=df1,aes(x=member_casual,y = prop.table(stat(count)),fill = member_casual,label = scales::percent(prop.table(stat(count)))))+
  geom_bar(position = "dodge") + 
  geom_text(stat = 'count',position = position_dodge(.9),vjust = -0.5,size = 3) +
  scale_y_continuous(labels = scales::percent)+
  labs(title = "Categories of customer",subtitle = "Casual v/s Member",x="Customer Type",y="Percent",caption = "Figure 1")+
  theme(
  plot.caption = element_text(color = "blue", face = "italic",hjust = 0)
)

ggplot(data = df1)+
  geom_bar(mapping = aes(x=member_casual,fill = day_of_week))+
  facet_grid(day_of_week~Month)+
  theme(legend.position = "bottom")+
  labs(title = "Usage by Customers",subtitle = "Casual v/s Member",caption = "Figure 2")+
  theme(
  plot.caption = element_text(color = "blue", face = "italic",hjust = 0)
)

analysis3 <- df1 %>% 
  group_by(Month,member_casual) %>% 
  summarise(Number_of_Riders = n(), Average_Ride_Time = mean(ride_length))

analysis3 %>% 
ggplot(aes(x = Month, y=Number_of_Riders,group = member_casual,colour = member_casual)) +
  geom_point() +
  geom_line() +
  labs(title = "Line Graph",subtitle = "Casual v/s Member",y="Number of Riders",caption = "Figure 3")+
  theme(
  plot.caption = element_text(color = "blue", face = "italic",hjust = 0)
)

ggplot(data = analysis1,aes(x = day_of_week, y = Mean_ride_length, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(title = "Average Ride Length",subtitle = "Casual v/s Member",x="Weekday",y="Average Ride Length",caption = "Figure 4")+
  theme(
  plot.caption = element_text(color = "blue", face = "italic",hjust = 0)
)

Top10 <- df1 %>% 
  group_by(member_casual) %>% 
  count(start_station_name,sort = TRUE) %>% 
  slice(1:10) 
Top10 <- data.frame(Top10)
Top10 %>% 
  filter(!start_station_name == "") %>% 
  ggplot(aes(x=n,y=start_station_name,fill = start_station_name))+
  geom_bar(stat='identity')+
  facet_wrap(~member_casual)+
  theme(legend.position = "none",axis.title.x = element_blank(),axis.title.y = element_blank())+
  labs(title = "Most Busiest Stations",subtitle = "Casual v/s Member",caption = "Figure 5")+
  theme(
  plot.caption = element_text(color = "blue", face = "italic",hjust = 0)
  )

5. Conclusion

There is no direct answer to the question How do we convert casual users to annual members? but here are some insights and suggestions which we can implement for next year 2022.

5.1 Insights

As we can see that in Figure 1, the number of rides taken by casual type customers was 45.2% and for member type customers it was 54.8% which means that rides taken by members is 9.6% more than the rides taken by casual customers.
By Figures 2 & 3 we can see that users touched their peak in the month of July-August if we see in terms of the month. The rides taken by casual users saw a peak at the weekends but for member type customers the usage was similar throughout the week.
Even when there are more member type customers than casual users but when it comes to Average Ride Length casual users ride more than 2 times what members ride, as we can see in Figure 4.
In Figure 5, we can also see the list of the Top 10 busiest stations for both casual and member type users and most of them overlap each other.

5.2 Suggetions

We can conduct marketing campaigns on those most popular stations and at the weekends because then it can target a larger audience.
We can design annual membership in a way that seems much more reliable than daily passes. In doing so we have to add some more perks, some gifts or something like that if you purchase our annual membership with a certain credit card, you can get some additional instant discount.
We can increase the price of a daily pass on weekends only because there are much more rides taken on weekends by casual users and casual users may find it feasible to purchase an annual membership rather than a daily pass.

How Does a Bike-Share Navigate Speedy Success?