library(tidyverse)
library(skimr)
library(lubridate)
library(janitor)
library(readr)
library(gt)
library(scales)

1. Describing Scenario

1.1 Background

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual ridersAlthough the pricing flexibility helps Cyclistic attract more customers, your manager believes that maximizing the number of annual members will be key to future growth.


1.2 Business Task

Analyze the historical bike data of Cyclistic to identify trends and derive insights to help marketing team with new marketing strategy.


1.3 Objectives

  • How do annual members and casual riders use Cyclistic bikes differently?

  • Why would casual riders buy Cyclistic annual memberships?

  • How can Cyclistic use digital media to influence casual riders to become members?


1.4 Stakeholders

  • Lily Moreno: The director of marketing and your manager.

  • Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data.

  • Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.


2. Preparation of Data

2.1 Information about dataset

From January 2021 to December 2021, a total of 12 datasets have been made accessible for each month. Each dataset contains the specifics of every ride registered by clients of Cyclistic . This data has been cleaned to remove rider’s personal information before being made public.

We have 13 columns and we can infer their content:

  • ride_id: Id for each trip taken, as of now we are not sure if they are unique or not, we have to find out

  • rideable_type: Represents the type of a bike

  • started_at: Date and time of the start time

  • ended_at: Date and time of the end time

  • start_station_name: Name of the starting station

  • start_station_id: Id of the starting station

  • end_station_name: Name of the ending station

  • end_station_id: Id of the ending station

  • start_lat: Latitude of the starting point

  • start_lng: Longitude of the starting point

  • end_lat: Latitude of the ending point

  • end_lng: Longitude of the ending point

  • member_casual: Represents the membership status

ds1 <- read.csv("202101-divvy-tripdata.csv")
ds2 <- read.csv("202102-divvy-tripdata.csv")
ds3 <- read.csv("202103-divvy-tripdata.csv")
ds4 <- read.csv("202104-divvy-tripdata.csv")
ds5 <- read.csv("202105-divvy-tripdata.csv")
ds6 <- read.csv("202106-divvy-tripdata.csv")
ds7 <- read.csv("202107-divvy-tripdata.csv")
ds8 <- read.csv("202108-divvy-tripdata.csv")
ds9 <- read.csv("202109-divvy-tripdata.csv")
ds10 <- read.csv("202110-divvy-tripdata.csv")
ds11 <- read.csv("202111-divvy-tripdata.csv")
ds12 <- read.csv("202112-divvy-tripdata.csv")

Merge all data frames into one big data frame

df <- rbind(ds1,ds2,ds3,ds4,ds5,ds6,ds7,ds8,ds9,ds10,ds11,ds12)

3. Processing of Data

3.1 Manipulating Data

  • Changing data type of started_at and ended_at from Character to Datetime
df$started_at <- ymd_hms(df$started_at)
df$ended_at <- ymd_hms(df$ended_at)
  • Adding some relevant columns like ride_length, day_of_week, Month, Day to data for smoothing the analysis process.
df1 <- df %>% 
  mutate(ride_length = difftime(ended_at,started_at,units = "mins")) %>% 
  mutate(day_of_week = weekdays(as.Date(df$started_at))) %>% 
  mutate(Month = month(df$started_at,label = TRUE)) %>% 
  mutate(Day = day(df$started_at))
  • Checking if there is any ride_length value is negative.
df1 %>% 
  count(ride_length<0) %>% 
  gt()
ride_length < 0 n
FALSE 5594916
TRUE 147

Yes, there are some rows with negative ride_length and because of that we have to remove these rows as it can be a hindrance in the analysis process.

df1 <- df1[!(df1$ride_length< 0),]
  • Converting ride_length from Factor to numeric so we can run calculations on the data.
df1$ride_length <- as.numeric(as.character(df1$ride_length))

3.2 Summary Statistics of Dataset

skim_without_charts(df1)
Data summary
Name df1
Number of rows 5594916
Number of columns 17
_______________________
Column type frequency:
character 8
factor 1
numeric 6
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1.00 16 16 0 5594916 0
rideable_type 0 1.00 11 13 0 3 0
start_station_name 690789 0.88 3 53 0 847 0
start_station_id 690786 0.88 3 36 0 834 0
end_station_name 739149 0.87 10 53 0 844 0
end_station_id 739149 0.87 3 36 0 832 0
member_casual 0 1.00 6 6 0 2 0
day_of_week 0 1.00 6 9 0 7 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Month 0 1 TRUE 12 Jul: 822397, Aug: 804323, Sep: 756111, Jun: 729590

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
start_lat 0 1 41.90 0.05 41.64 41.88 41.90 41.93 42.07
start_lng 0 1 -87.65 0.03 -87.84 -87.66 -87.64 -87.63 -87.52
end_lat 4770 1 41.90 0.05 41.39 41.88 41.90 41.93 42.17
end_lng 4770 1 -87.65 0.03 -88.97 -87.66 -87.64 -87.63 -87.49
ride_length 0 1 21.94 178.34 0.00 6.75 12.00 21.78 55944.15
Day 0 1 15.45 8.79 1.00 8.00 15.00 23.00 31.00

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2021-01-01 00:02:05 2021-12-31 23:59:48 2021-08-01 01:48:51 4677893
ended_at 0 1 2021-01-01 00:08:39 2022-01-03 17:32:18 2021-08-01 02:18:49 4671279

We can see right immediately that there is an outlier data under ride_length, with a maximum trip time of 55944 minutes, or about 38 days. This might be due to a data error or the fact that the individual rented it for 38 days.


3.3 Glipmse of Dataset

gt(head(df1))
ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual ride_length day_of_week Month Day
E19E6F1B8D4C42ED electric_bike 2021-01-23 16:14:19 2021-01-23 16:24:44 California Ave & Cortez St 17660 NA NA 41.90034 -87.69674 41.89 -87.72 member 10.4166667 Saturday Jan 23
DC88F20C2C55F27F electric_bike 2021-01-27 18:43:08 2021-01-27 18:47:12 California Ave & Cortez St 17660 NA NA 41.90033 -87.69671 41.90 -87.69 member 4.0666667 Wednesday Jan 27
EC45C94683FE3F27 electric_bike 2021-01-21 22:35:54 2021-01-21 22:37:14 California Ave & Cortez St 17660 NA NA 41.90031 -87.69664 41.90 -87.70 member 1.3333333 Thursday Jan 21
4FA453A75AE377DB electric_bike 2021-01-07 13:31:13 2021-01-07 13:42:55 California Ave & Cortez St 17660 NA NA 41.90040 -87.69666 41.92 -87.69 member 11.7000000 Thursday Jan 7
BE5E8EB4E7263A0B electric_bike 2021-01-23 02:24:02 2021-01-23 02:24:45 California Ave & Cortez St 17660 NA NA 41.90033 -87.69670 41.90 -87.70 casual 0.7166667 Saturday Jan 23
5D8969F88C773979 electric_bike 2021-01-09 14:24:07 2021-01-09 15:17:54 California Ave & Cortez St 17660 NA NA 41.90041 -87.69676 41.94 -87.71 casual 53.7833333 Saturday Jan 9

4. Anlaysis Phase

4.1 Descriptive Analysis

analysis1 <- df1 %>% 
  group_by(member_casual,day_of_week) %>% 
  summarise(Mean_ride_length = mean(ride_length), Median_ride_length = median(ride_length) ,Max_ride_length = max(ride_length),Min_ride_length = min(ride_length), count = n())%>% 
  gt()
analysis1 <- data.frame(analysis1)
analysis1 %>% 
  group_by(member_casual) %>% 
  gt()
day_of_week Mean_ride_length Median_ride_length Max_ride_length Min_ride_length count
casual
Monday 31.87545 15.950000 31681.650 0 286373
Tuesday 27.97233 14.283333 38922.917 0 274388
Wednesday 27.65731 13.966667 38963.083 0 278948
Thursday 27.70326 13.783333 49107.150 0 286064
Friday 30.34860 14.966667 55691.683 0 364075
Saturday 34.70623 17.816667 55944.150 0 557994
Sunday 37.56658 18.716667 53921.600 0 481104
member
Monday 13.24753 9.200000 1499.950 0 416204
Tuesday 12.78812 9.133333 1499.950 0 465509
Wednesday 12.81916 9.216667 1499.967 0 477156
Thursday 12.77618 9.133333 1499.950 0 451520
Friday 13.32492 9.433333 1499.950 0 446423
Saturday 15.26457 10.816667 1559.933 0 433041
Sunday 15.65794 10.866667 1499.933 0 376117

4.2 Graphical Analysis

ggplot(data=df1,aes(x=member_casual,y = prop.table(stat(count)),fill = member_casual,label = scales::percent(prop.table(stat(count)))))+
  geom_bar(position = "dodge") + 
  geom_text(stat = 'count',position = position_dodge(.9),vjust = -0.5,size = 3) +
  scale_y_continuous(labels = scales::percent)+
  labs(title = "Categories of customer",subtitle = "Casual v/s Member",x="Customer Type",y="Percent",caption = "Figure 1")+
  theme(
  plot.caption = element_text(color = "blue", face = "italic",hjust = 0)
)


ggplot(data = df1)+
  geom_bar(mapping = aes(x=member_casual,fill = day_of_week))+
  facet_grid(day_of_week~Month)+
  theme(legend.position = "bottom")+
  labs(title = "Usage by Customers",subtitle = "Casual v/s Member",caption = "Figure 2")+
  theme(
  plot.caption = element_text(color = "blue", face = "italic",hjust = 0)
)


analysis3 <- df1 %>% 
  group_by(Month,member_casual) %>% 
  summarise(Number_of_Riders = n(), Average_Ride_Time = mean(ride_length))
analysis3 %>% 
ggplot(aes(x = Month, y=Number_of_Riders,group = member_casual,colour = member_casual)) +
  geom_point() +
  geom_line() +
  labs(title = "Line Graph",subtitle = "Casual v/s Member",y="Number of Riders",caption = "Figure 3")+
  theme(
  plot.caption = element_text(color = "blue", face = "italic",hjust = 0)
)


ggplot(data = analysis1,aes(x = day_of_week, y = Mean_ride_length, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(title = "Average Ride Length",subtitle = "Casual v/s Member",x="Weekday",y="Average Ride Length",caption = "Figure 4")+
  theme(
  plot.caption = element_text(color = "blue", face = "italic",hjust = 0)
)


Top10 <- df1 %>% 
  group_by(member_casual) %>% 
  count(start_station_name,sort = TRUE) %>% 
  slice(1:10) 
Top10 <- data.frame(Top10)
Top10 %>% 
  filter(!start_station_name == "") %>% 
  ggplot(aes(x=n,y=start_station_name,fill = start_station_name))+
  geom_bar(stat='identity')+
  facet_wrap(~member_casual)+
  theme(legend.position = "none",axis.title.x = element_blank(),axis.title.y = element_blank())+
  labs(title = "Most Busiest Stations",subtitle = "Casual v/s Member",caption = "Figure 5")+
  theme(
  plot.caption = element_text(color = "blue", face = "italic",hjust = 0)
  )


5. Conclusion

There is no direct answer to the question How do we convert casual users to annual members? but here are some insights and suggestions which we can implement for next year 2022.

5.1 Insights

  • As we can see that in Figure 1, the number of rides taken by casual type customers was 45.2% and for member type customers it was 54.8% which means that rides taken by members is 9.6% more than the rides taken by casual customers.

  • By Figures 2 & 3 we can see that users touched their peak in the month of July-August if we see in terms of the month. The rides taken by casual users saw a peak at the weekends but for member type customers the usage was similar throughout the week.

  • Even when there are more member type customers than casual users but when it comes to Average Ride Length casual users ride more than 2 times what members ride, as we can see in Figure 4.

  • In Figure 5, we can also see the list of the Top 10 busiest stations for both casual and member type users and most of them overlap each other.


5.2 Suggetions

  • We can conduct marketing campaigns on those most popular stations and at the weekends because then it can target a larger audience.

  • We can design annual membership in a way that seems much more reliable than daily passes. In doing so we have to add some more perks, some gifts or something like that if you purchase our annual membership with a certain credit card, you can get some additional instant discount.

  • We can increase the price of a daily pass on weekends only because there are much more rides taken on weekends by casual users and casual users may find it feasible to purchase an annual membership rather than a daily pass.