Today we are going to perform Exploratory data analysis on ipl data and here is the link for the data(you can also check out this version of data of cricsheet).
Here is a preview of what our data looks like:-
Our task is to predict the batsman who is going to perform well in the upcoming match using dream11 points as a target column.
Feature Engineering
Creating some columns which help us to decide or evaluate batsman performance
Total_runs
total_runs = pd.DataFrame(df.groupby(['battingteam','bowlingteam','matchid','batsmanname'])['scorevalue'].sum()).\
rename(columns={"scorevalue": "total_runs"})
Here is a preview of the output
Tip:
As you can see after getting total runs other columns are acting as indexes and you are not able to access them. If you want to access the index columns perform all the operations before and save that dataframe in a CSV format and try loading a CSV file you will be able to access it.
Comment down if you find valuable
Let's do this for all the important columns:
#number of sixes
batsmen_scores6 = pd.DataFrame(df_2021[df_2021['scorevalue'] == 6].groupby(['battingteam','bowlingteam','matchid', 'batsmanname'])['scorevalue'].count()).\
rename(columns={"scorevalue": "run_6"})
#number of fours
batsmen_scores4 = pd.DataFrame(df_2021[df_2021['scorevalue'] == 4].groupby(['battingteam','bowlingteam','matchid', 'batsmanname'])['scorevalue'].count()).\
rename(columns={"scorevalue": "run_4"})
#no of balls
batsmen_ball_faced_legal = pd.DataFrame(df_2021.groupby(['battingteam','bowlingteam','matchid', 'batsmanname'])['over'].nunique()).\
rename(columns={"over": "total_legal_balls_faced"})
#strikerate
batsmen_Strikerate = pd.DataFrame((df_2021.groupby(['battingteam','bowlingteam','matchid','batsmanname'])['scorevalue'].sum()/df_2021.groupby(['battingteam','bowlingteam','matchid', 'batsmanname'])['over'].nunique())*100).\
rename(columns={"scorevalue":"strike_rate"})
#fifties
fifties = pd.DataFrame((df_2021.groupby(['battingteam','bowlingteam','matchid','batsmanname'])['scorevalue'].sum() >= 50)
).\
rename(columns={"scorevalue": "50's"})
#hundreds
hundreds = pd.DataFrame(df_2021.groupby(['battingteam','bowlingteam','matchid','batsmanname'])['scorevalue'].sum() >= 100 ).\
rename(columns={"scorevalue": "100's"})
#duck
duck = pd.DataFrame(df_2021.groupby(['battingteam','bowlingteam','matchid','batsmanname'])['scorevalue'].sum() == 0).\
rename(columns={"scorevalue": "duck"})
# #batsmen_position
batsmen_position = pd.DataFrame(df_2021.groupby(['battingteam','bowlingteam','matchid', 'batsmanname'])['fallofwickets'].min())
#batting_team
# batting_team = pd.DataFrame(df_2021['battingteam'])
# batting_team
# bowling_team = pd.DataFrame(df_2021['bowlingteam'])
Now align all the singular column values into a dataframe with multiple columns
total_runs['duck'] = duck
total_runs['Sixes'] = batsmen_scores6
total_runs['Fours'] = batsmen_scores4
total_runs['balls'] = batsmen_ball_faced_legal
total_runs['Fifties'] = fifties
total_runs['hundreds'] = hundreds
Replacing Boolean Values into integer values for the easier operations
df = total_runs.copy()
df.hundreds = df.hundreds.replace({True: 1, False: 0})
df.Fifties = df.Fifties.replace({True: 1, False: 0})
df.duck = df.duck.replace({True: 1, False: 0})
df.Sixes = df.Sixes.replace({True: 1, False: 0})
df = df.fillna(0)
We are going to create a target column which is dream11 which contains player points and we are going to according to the dream11 points system.
#point system of dream11
pointsconfig = {
'total_runs': 1,
'run_6': 2,
'run_4': 1,
'>=50': 8,
'>=100': 16,
'duck': -2,
}
dreamll_score = df['total_runs'] + df['Sixes']*2 + df['Fours'] + df['Fifties']*8 + df['hundreds']*16 - df['duck']*2 + 4
df['dreamll_score'] = dreamll_score
Here is the final output:
We will get into this data in the upcoming blog and dig into the insights.