developer.chat

22 September 2024

SEO Title

Recommendation System with 83% accuracy lgbm

AS we have imported necessary modules now we can start with

EDA (Exploratory Data Analysis) with wrangling of data and visualizing

The necessary thing for out statiscal analysis as well insights of our data

Last steps would be data imputations , merging , cross validations ,

Hyperparmaters tuning and visualization of every algowhat we used what are

It's result with Time consumption of algos producing the results

In [3]:
from subprocess import check_output

train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
songs = pd.read_csv('../input/songs.csv')
members = pd.read_csv('../input/members.csv')
sample = pd.read_csv('../input/sample_submission.csv')

In [4]:
train.head()
Out[4]:
 msnosong_idsource_system_tabsource_screen_namesource_typetarget
0FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=exploreExploreonline-playlist1
1Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=bhp/MpSNoqoxOIB+/l8WPqu6jldth4DIpCm3ayXnJqM=my libraryLocal playlist morelocal-playlist1
2Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=JNWfrrC7zNN7BdMpsISKa4Mw+xVJYNnxXh3/Epw7QgY=my libraryLocal playlist morelocal-playlist1
3Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=2A87tzfnJTSWqD7gIZHisolhe4DMdzkbd6LzO1KHjNs=my libraryLocal playlist morelocal-playlist1
4FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=3qm6XTZ6MOCU11x8FIVbAGH5l5uMkT3/ZalWG1oo2Gc=exploreExploreonline-playlist1

In [5]:
test.head()
Out[5]:
 idmsnosong_idsource_system_tabsource_screen_namesource_type
00V8ruy7SGk7tDm3zA51DPpn6qutt+vmKMBKa21dp54uM=WmHKgKMlp1lQMecNdNvDMkvIycZYHnFwDT72I5sIssc=my libraryLocal playlist morelocal-library
11V8ruy7SGk7tDm3zA51DPpn6qutt+vmKMBKa21dp54uM=y/rsZ9DC7FwK5F2PK2D5mj+aOBUJAjuu3dZ14NgE0vM=my libraryLocal playlist morelocal-library
22/uQAlrAkaczV+nWCd2sPF2ekvXPRipV7q0l+gbLuxjw=8eZLFOdGVdXBSqoAv5nsLigeH2BvKXzTQYtUM53I0k4=discoverNaNsong-based-playlist
331a6oo/iXKatxQx4eS9zTVD+KlSVaAFbTIqVvwLC1Y0k=ztCf8thYsS4YN3GcIL/bvoxLm/T5mYBVKOO4C9NiVfQ=radioRadioradio
441a6oo/iXKatxQx4eS9zTVD+KlSVaAFbTIqVvwLC1Y0k=MKVMpslKcQhMaFEgcEQhEfi5+RZhMYlU3eRDpySrH8Y=radioRadioradio

In [6]:
songs.head()
Out[6]:
 song_idsong_lengthgenre_idsartist_namecomposerlyricistlanguage
0CXoTN1eb7AI+DntdU1vbcwGRV4SCIDxZu+YD8JP8r4E=247640465張信哲 (Jeff Chang)董貞何啟弘3.0
1o0kFgae9QtnYgRkVPqLJwa05zIhRlUjfF7O1tDw0ZDU=197328444BLACKPINKTEDDY| FUTURE BOUNCE| Bekuh BOOMTEDDY31.0
2DwVvVurfpuz+XPuFvucclVQEyPqcpUkHR0ne1RQzPs0=231781465SUPER JUNIORNaNNaN31.0
3dKMBWoZyScdxSkihKG+Vf47nc18N9q4m58+b4e7dSSE=273554465S.H.E湯小康徐世珍3.0
4W3bqWd3T+VeHFzHAUfARgW9AvVRaF4N5Yzm4Mr6Eo/o=140329726貴族精選TraditionalTraditional52.0

In [7]:
members.head()
Out[7]:
 msnocitybdgenderregistered_viaregistration_init_timeexpiration_date
0XQxgAYj3klVKjR3oxPPXYYFp4soD4TuBghkhMTD4oTw=10NaN72011082020170920
1UizsfmJb9mV54qE9hCYyU07Va97c0lCRLEQX3ae+ztM=10NaN72015062820170622
2D8nEhsIOBSoE6VthTaqDX8U6lqjJ7dLdr72mOyLya2A=10NaN42016041120170712
3mCuD+tZ1hERA/o5GPqk38e041J8ZsBaLcu7nGoIIvhI=10NaN92015090620150907
4q4HRBfVSssAFS9iRfxWrohxuk9kCYMKjHOEagUMV6rQ=10NaN42017012620170613

In [8]:
sample.head()
members.shape
train.info()
print("\n")
songs.info()
print("\n")
members.info()
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7377418 entries, 0 to 7377417
Data columns (total 6 columns):
msno                  object
song_id               object
source_system_tab     object
source_screen_name    object
source_type           object
target                int64
dtypes: int64(1), object(5)
memory usage: 337.7+ MB


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2296320 entries, 0 to 2296319
Data columns (total 7 columns):
song_id        object
song_length    int64
genre_ids      object
artist_name    object
composer       object
lyricist       object
language       float64
dtypes: float64(1), int64(1), object(5)
memory usage: 122.6+ MB


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34403 entries, 0 to 34402
Data columns (total 7 columns):
msno                      34403 non-null object
city                      34403 non-null int64
bd                        34403 non-null int64
gender                    14501 non-null object
registered_via            34403 non-null int64
registration_init_time    34403 non-null int64
expiration_date           34403 non-null int64
dtypes: int64(5), object(2)
memory usage: 1.8+ MB

In [9]:
plt.figure(figsize=(20,15))
sns.set(font_scale=2)
sns.countplot(x='source_type',hue='source_type',data=train)
sns.set(style="darkgrid")
plt.xlabel('source types',fontsize=30)
plt.ylabel('count',fontsize=30)
plt.xticks(rotation='45')
plt.title('Count plot source types for listening music',fontsize=30)
plt.tight_layout()
 

First visualization we can see as if local library are more perffered than any other source types as well after that online playlist and local playlist and other features are showing less importance but can't say anything right now as we handn't deal with cleaning , imputing , stats

But as far we are sure are answers for buliding this systems in revolving maximum issues around local library see next what other result say

In [10]:
plt.figure(figsize=(20,15))
sns.set(font_scale=2)
sns.countplot(y='source_screen_name',data=train,facecolor=(0,0,0,0),linewidth=5,edgecolor=sns.color_palette('dark',3))
sns.set(style="darkgrid")
plt.xlabel('source types',fontsize=30)
plt.ylabel('count',fontsize=30)
plt.xticks(rotation='45')
plt.title('Count plot for which  screen using ',fontsize=30)
plt.tight_layout()
 

Second Visualization is telling us that most of the users are listenning local

playlist more means the app which is provided by the company they are using them apart from this we can also see that after this most the users are coming back to the songs by online playlist sources

Very less from the other different sources means are outliers , variance and std deviations are in 2 areas local libs and online playlist

In [11]:
plt.figure(figsize=(20,15))
sns.set(font_scale=2)
sns.countplot(x='source_system_tab',hue='source_system_tab',data=train)
sns.set(style="darkgrid")
plt.xlabel('source types',fontsize=30)
plt.ylabel('count',fontsize=30)
plt.xticks(rotation='45')
plt.title('Count plot for system tab there are using',fontsize=30)
plt.tight_layout()
 

so anyone who has installed KKBOX app we can see most of the users are going back to there songs via my library rather discovering them means there are different sources they can go back but most preffered one is my library

now doing some visualiaztion in members.csv

In [12]:
import matplotlib as mpl

mpl.rcParams['font.size'] = 40.0
labels = ['Male','Female']
plt.figure(figsize = (12, 12))
sizes = pd.value_counts(members.gender)
patches, texts, autotexts = plt.pie(sizes, 
                                    labels=labels, autopct='%.0f%%',
                                    shadow=False, radius=1,startangle=90)
for t in texts:
    t.set_size('smaller')
plt.legend()
plt.show()
 

As we can see we have male users more now visualization has to be done in this manner like from how many types genders which are popular ways to go back in there playlist

In [13]:
import matplotlib.pyplot as plt
mpl.rcParams['font.size'] = 40.0
plt.figure(figsize = (20, 20)) 
# Make data: I have 3 groups and 7 subgroups
group_names=['explore','my library','search','discover','radio','listen with','notification','settings']
group_size=pd.value_counts(train.source_system_tab)
print(group_size)
subgroup_names=['Male','Female']
subgroup_size=pd.value_counts(members.gender)
 
# Create colors
a, b, c,d,e,f,g,h=[plt.cm.autumn, plt.cm.GnBu, plt.cm.YlGn,plt.cm.Purples,plt.cm.cool,plt.cm.RdPu,plt.cm.BuPu,plt.cm.bone]
 
# First Ring (outside)
fig, ax = plt.subplots()
ax.axis('equal')
mypie, texts= ax.pie(group_size, radius=3.0,labels=group_names, colors=[a(0.6), b(0.6), c(0.6),d(0.6), e(0.6), f(0.6),g(0.6)])
plt.setp( mypie, width=0.3, edgecolor='white')
 
# Second Ring (Inside)
#mypie2, texts1 = ax.pie(subgroup_size, radius=3.0-0.3, labels=subgroup_names, labeldistance=0.7, colors=[h(0.5), b(0.4)])
#plt.setp( mypie2, width=0.3, edgecolor='white')
#plt.margins(0,0)
#for t in texts:
 #   t.set_size(25.0)
#for t in texts1:
 
    #t.set_size(25.0)    
plt.legend() 
# show it
plt.show()
 
my library      3684730
discover        2179252
search           623286
radio            476701
listen with      212266
explore          167949
notification       6185
settings           2200
Name: source_system_tab, dtype: int64
 
<matplotlib.figure.Figure at 0x7f84742910f0>
 

Inferences we can draw from this chart that among Men exploration method is only way they are using while females are using every possible way to get back their music of choices , in real world this thing also very much similar that men focuses in one direction in depth whereas women focuses in every possible direction but not in depth

We are moving in right direction of building a good accurate systems

now some statistics inferences

as we have numeric data in tow csv files and rest of the files with categorical data so members.csv file with 2 columns in numeric and song.csv

In [14]:
print(members.describe())
 
               city            bd  registered_via  registration_init_time  \
count  34403.000000  34403.000000    34403.000000            3.440300e+04   
mean       5.371276     12.280935        5.953376            2.013994e+07   
std        6.243929     18.170251        2.287534            2.954015e+04   
min        1.000000    -43.000000        3.000000            2.004033e+07   
25%        1.000000      0.000000        4.000000            2.012103e+07   
50%        1.000000      0.000000        7.000000            2.015090e+07   
75%       10.000000     25.000000        9.000000            2.016110e+07   
max       22.000000   1051.000000       16.000000            2.017023e+07   

       expiration_date  
count     3.440300e+04  
mean      2.016901e+07  
std       7.320925e+03  
min       1.970010e+07  
25%       2.017020e+07  
50%       2.017091e+07  
75%       2.017093e+07  
max       2.020102e+07  

In [15]:
print(songs.describe())
 
        song_length      language
count  2.296320e+06  2.296319e+06
mean   2.469935e+05  3.237800e+01
std    1.609200e+05  2.433241e+01
min    1.850000e+02 -1.000000e+00
25%    1.836000e+05 -1.000000e+00
50%    2.266270e+05  5.200000e+01
75%    2.772690e+05  5.200000e+01
max    1.217385e+07  5.900000e+01

doing stats test on members.csv

In [16]:
mpl.rcParams['font.size'] = 40.0
plt.figure(figsize = (20, 20)) 
sns.distplot(members.registration_init_time)
sns.set(font_scale=2)
plt.ylabel('ecdf',fontsize=50)
plt.xlabel('registration time ' ,fontsize=50)
Out[16]:
Text(0.5,0,'registration time ')
 

inferences we can drawn from above two result that maximum registration were done in time period of 2012 to 2016 and most this righthand skewed graph one more thing before applying we have to normalize it

In [17]:
members.describe()
Out[17]:
 citybdregistered_viaregistration_init_timeexpiration_date
count34403.00000034403.00000034403.0000003.440300e+043.440300e+04
mean5.37127612.2809355.9533762.013994e+072.016901e+07
std6.24392918.1702512.2875342.954015e+047.320925e+03
min1.000000-43.0000003.0000002.004033e+071.970010e+07
25%1.0000000.0000004.0000002.012103e+072.017020e+07
50%1.0000000.0000007.0000002.015090e+072.017091e+07
75%10.00000025.0000009.0000002.016110e+072.017093e+07
max22.0000001051.00000016.0000002.017023e+072.020102e+07

In [18]:
songs.describe()
Out[18]:
 song_lengthlanguage
count2.296320e+062.296319e+06
mean2.469935e+053.237800e+01
std1.609200e+052.433241e+01
min1.850000e+02-1.000000e+00
25%1.836000e+05-1.000000e+00
50%2.266270e+055.200000e+01
75%2.772690e+055.200000e+01
max1.217385e+075.900000e+01

In [19]:
train.describe()
Out[19]:
 target
count7.377418e+06
mean5.035171e-01
std4.999877e-01
min0.000000e+00
25%0.000000e+00
50%1.000000e+00
75%1.000000e+00
max1.000000e+00

In [20]:
train.info()
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7377418 entries, 0 to 7377417
Data columns (total 6 columns):
msno                  object
song_id               object
source_system_tab     object
source_screen_name    object
source_type           object
target                int64
dtypes: int64(1), object(5)
memory usage: 337.7+ MB

In [21]:
members.info()
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34403 entries, 0 to 34402
Data columns (total 7 columns):
msno                      34403 non-null object
city                      34403 non-null int64
bd                        34403 non-null int64
gender                    14501 non-null object
registered_via            34403 non-null int64
registration_init_time    34403 non-null int64
expiration_date           34403 non-null int64
dtypes: int64(5), object(2)
memory usage: 1.8+ MB

we can see that in members and songs csv files large differences bet min and max values which gives inferences that there are outliers in the csv files which has to be removed before making system

Data conversion of int , float and categorical has to be done to reduce the data size for computation as well as storage

In [22]:
train_members = pd.merge(train, members, on='msno', how='inner')
train_merged = pd.merge(train_members, songs, on='song_id', how='outer')
print(train_merged.head())
 
                                           msno  \
FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=   
pouJqjNRmZOnRNzzMWWkamTKkIGHyvhl/jo4HgbncnM=   
xbodnNBaLMyqqI7uFJlvHOKMJaizuWo/BB/YHZICcKo=   
s0ndDsjI79amU0RBiullFN8HRz9HjE++34jGNa7zJ/s=   
Vw4Umh6/qlsJDC/XMslyAxVvRgFJGHr53yb/nrmY1DU=   

                                        song_id source_system_tab  \
BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=           explore   
BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=          discover   
BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=        my library   
BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=        my library   
BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=        my library   

     source_screen_name      source_type  target  city    bd  gender  \
             Explore  online-playlist     1.0   1.0   0.0     NaN   
Online playlist more  online-playlist     0.0  15.0  18.0    male   
 Local playlist more    local-library     1.0   1.0   0.0     NaN   
 Local playlist more    local-library     1.0   5.0  21.0  female   
 Local playlist more    local-library     0.0   6.0  33.0  female   

   registered_via  registration_init_time  expiration_date  song_length  \
           7.0              20120102.0       20171005.0     206471.0   
           4.0              20151220.0       20170930.0     206471.0   
           7.0              20120804.0       20171004.0     206471.0   
           9.0              20110808.0       20170917.0     206471.0   
           9.0              20070323.0       20170915.0     206471.0   

  genre_ids artist_name              composer lyricist  language  
     359    Bastille  Dan Smith| Mark Crew      NaN      52.0  
     359    Bastille  Dan Smith| Mark Crew      NaN      52.0  
     359    Bastille  Dan Smith| Mark Crew      NaN      52.0  
     359    Bastille  Dan Smith| Mark Crew      NaN      52.0  
     359    Bastille  Dan Smith| Mark Crew      NaN      52.0  

In [23]:
test_members = pd.merge(test, members, on='msno', how='inner')
test_merged = pd.merge(test_members, songs, on='song_id', how='outer')
print(test_merged.head())
print(len(test_merged.columns))
 
          id                                          msno  \
0        0.0  V8ruy7SGk7tDm3zA51DPpn6qutt+vmKMBKa21dp54uM=   
1  1035059.0  08rvvaaab7dM7h78GC4SphLkUCSXPxpu6sY+k8aLUO4=   
2    89968.0  1NvrMNDUcvfqOIjhim8BgdK23znMzGwAO84W+qKs6dw=   
3   972394.0  GfSXhTVP3oj7h0545L/5xh6jD+7edQ7AH0iprl7dYbc=   
4  2194574.0  HkWEvfQyrb5Lve8X3B7HkCEkDFW8qFy/9kWFb4QbM5k=   

                                        song_id source_system_tab  \
0  WmHKgKMlp1lQMecNdNvDMkvIycZYHnFwDT72I5sIssc=        my library   
1  WmHKgKMlp1lQMecNdNvDMkvIycZYHnFwDT72I5sIssc=        my library   
2  WmHKgKMlp1lQMecNdNvDMkvIycZYHnFwDT72I5sIssc=        my library   
3  WmHKgKMlp1lQMecNdNvDMkvIycZYHnFwDT72I5sIssc=        my library   
4  WmHKgKMlp1lQMecNdNvDMkvIycZYHnFwDT72I5sIssc=          discover   

    source_screen_name          source_type  city    bd  gender  \
0  Local playlist more        local-library   1.0   0.0     NaN   
1  Local playlist more        local-library   5.0  29.0  female   
2  Local playlist more        local-library  14.0  20.0     NaN   
3  Local playlist more        local-library  22.0  22.0    male   
4     Discover Feature  song-based-playlist  15.0  26.0  female   

   registered_via  registration_init_time  expiration_date  song_length  \
0             7.0              20160219.0       20170918.0     224130.0   
1             7.0              20120105.0       20171113.0     224130.0   
2             3.0              20130908.0       20171003.0     224130.0   
3             7.0              20131011.0       20170911.0     224130.0   
4             9.0              20060616.0       20180516.0     224130.0   

  genre_ids         artist_name        composer lyricist  language  
0       458  梁文音 (Rachel Liang)  Qi Zheng Zhang      NaN       3.0  
1       458  梁文音 (Rachel Liang)  Qi Zheng Zhang      NaN       3.0  
2       458  梁文音 (Rachel Liang)  Qi Zheng Zhang      NaN       3.0  
3       458  梁文音 (Rachel Liang)  Qi Zheng Zhang      NaN       3.0  
4       458  梁文音 (Rachel Liang)  Qi Zheng Zhang      NaN       3.0  
18

In [24]:
del train_members
del test_members

In [25]:
ax = sns.countplot(y=train_merged.dtypes, data=train_merged)
 

In [26]:
print(train_merged.columns.to_series().groupby(train_merged.dtypes).groups)
print(test_merged.columns.to_series().groupby(test_merged.dtypes).groups)
 
{dtype('float64'): Index(['target', 'city', 'bd', 'registered_via', 'registration_init_time',
       'expiration_date', 'song_length', 'language'],
      dtype='object'), dtype('O'): Index(['msno', 'song_id', 'source_system_tab', 'source_screen_name',
       'source_type', 'gender', 'genre_ids', 'artist_name', 'composer',
       'lyricist'],
      dtype='object')}
{dtype('float64'): Index(['id', 'city', 'bd', 'registered_via', 'registration_init_time',
       'expiration_date', 'song_length', 'language'],
      dtype='object'), dtype('O'): Index(['msno', 'song_id', 'source_system_tab', 'source_screen_name',
       'source_type', 'gender', 'genre_ids', 'artist_name', 'composer',
       'lyricist'],
      dtype='object')}

Analysis on missing values

In [27]:
msno.heatmap(train_merged)
#msno.matrix(train_merged)
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f84742623c8>
 

as we can see lot of missing values are coming up but when common thing we notice that most of the missing values are arrived from members and songs

missing values from the heatmap also showing one thing that information which are missing and has positive correlation are gender with 4 variables of train.csv and rest of varibales with members.csv

In [28]:
#msno.dendrogram(train_merged)

A strong nullity correlation here we can see

song id -> lang, song_len,artist name, genre_id

composer -> lyricst

gender -> with song_id

from heatmap we can say if gender is missing 70% missing values will be in msno , target city etc

Now checking missing values and replacing them with some unique values

In [29]:
#--- Function to check if missing values are present and if so print the columns having them ---
def check_missing_values(df):
    print (df.isnull().values.any())
    if (df.isnull().values.any() == True):
        columns_with_Nan = df.columns[df.isnull().any()].tolist()
    print(columns_with_Nan)
    for col in columns_with_Nan:
        print("%s : %d" % (col, df[col].isnull().sum()))
    
check_missing_values(train_merged)
check_missing_values(test_merged)
 
True
['msno', 'source_system_tab', 'source_screen_name', 'source_type', 'target', 'city', 'bd', 'gender', 'registered_via', 'registration_init_time', 'expiration_date', 'song_length', 'genre_ids', 'artist_name', 'composer', 'lyricist', 'language']
msno : 1936406
source_system_tab : 1961255
source_screen_name : 2351210
source_type : 1957945
target : 1936406
city : 1936406
bd : 1936406
gender : 4897885
registered_via : 1936406
registration_init_time : 1936406
expiration_date : 1936406
song_length : 114
genre_ids : 205338
artist_name : 114
composer : 2591558
lyricist : 4855358
language : 150
True
['id', 'msno', 'source_system_tab', 'source_screen_name', 'source_type', 'city', 'bd', 'gender', 'registered_via', 'registration_init_time', 'expiration_date', 'song_length', 'genre_ids', 'artist_name', 'composer', 'lyricist', 'language']
id : 2071581
msno : 2071581
source_system_tab : 2080023
source_screen_name : 2234464
source_type : 2078878
city : 2071581
bd : 2071581
gender : 3123805
registered_via : 2071581
registration_init_time : 2071581
expiration_date : 2071581
song_length : 25
genre_ids : 132345
artist_name : 25
composer : 1595714
lyricist : 3008577
language : 42

In [30]:
#--- Function to replace Nan values in columns of type float with -5 ---
def replace_Nan_non_object(df):
    object_cols = list(df.select_dtypes(include=['float']).columns)
    for col in object_cols:
        df[col]=df[col].fillna(np.int(-5))
       
replace_Nan_non_object(train_merged) 
replace_Nan_non_object(test_merged)  

In [31]:
#--- memory consumed by train dataframe ---
mem = train_merged.memory_usage(index=True).sum()
print("Memory consumed by training set  :   {} MB" .format(mem/ 1024**2))
 
#--- memory consumed by test dataframe ---
mem = test_merged.memory_usage(index=True).sum()
print("Memory consumed by test set      :   {} MB" .format(mem/ 1024**2))
 
Memory consumed by training set  :   1350.117919921875 MB
Memory consumed by test set      :   670.9216995239258 MB

In [32]:
def change_datatype(df):
    float_cols = list(df.select_dtypes(include=['float']).columns)
    for col in float_cols:
        if ((np.max(df[col]) <= 127) and(np.min(df[col] >= -128))):
            df[col] = df[col].astype(np.int8)
        elif ((np.max(df[col]) <= 32767) and(np.min(df[col] >= -32768))):
            df[col] = df[col].astype(np.int16)
        elif ((np.max(df[col]) <= 2147483647) and(np.min(df[col] >= -2147483648))):
            df[col] = df[col].astype(np.int32)
        else:
            df[col] = df[col].astype(np.int64)

change_datatype(train_merged)
change_datatype(test_merged)

In [33]:
data = train_merged.groupby('target').aggregate({'msno':'count'}).reset_index()
a4_dims = (15, 8)
fig, ax = plt.subplots(figsize=a4_dims)
ax = sns.barplot(x='target', y='msno', data=data)
 

as we can see that new user are about 5500 and old users about 15000 ,

*-5 are those values which are empty

In [34]:
mpl.rcParams['font.size'] = 40.0
plt.figure(figsize = (20, 20)) 
data=train_merged.groupby('source_system_tab').aggregate({'msno':'count'}).reset_index()
sns.barplot(x='source_system_tab',y='msno',data=data)
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8474172ac8>
 

In [35]:
data = train_merged.groupby('source_screen_name').aggregate({'msno':'count'}).reset_index()
a4_dims = (15, 7)
fig, ax = plt.subplots(figsize=a4_dims)
ax = sns.barplot(x='source_screen_name', y='msno', data=data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
Out[35]:
[Text(0,0,'Album more'),
 Text(0,0,'Artist more'),
 Text(0,0,'Concert'),
 Text(0,0,'Discover Chart'),
 Text(0,0,'Discover Feature'),
 Text(0,0,'Discover Genre'),
 Text(0,0,'Discover New'),
 Text(0,0,'Explore'),
 Text(0,0,'Local playlist more'),
 Text(0,0,'My library'),
 Text(0,0,'My library_Search'),
 Text(0,0,'Online playlist more'),
 Text(0,0,'Others profile more'),
 Text(0,0,'Payment'),
 Text(0,0,'Radio'),
 Text(0,0,'Search'),
 Text(0,0,'Search Home'),
 Text(0,0,'Search Trends'),
 Text(0,0,'Self profile more'),
 Text(0,0,'Unknown')]
 

In [36]:
data = train_merged.groupby('source_type').aggregate({'msno':'count'}).reset_index()
a4_dims = (15, 7)
fig, ax = plt.subplots(figsize=a4_dims)
ax = sns.barplot(x='source_type', y='msno', data=data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
Out[36]:
[Text(0,0,'album'),
 Text(0,0,'artist'),
 Text(0,0,'listen-with'),
 Text(0,0,'local-library'),
 Text(0,0,'local-playlist'),
 Text(0,0,'my-daily-playlist'),
 Text(0,0,'online-playlist'),
 Text(0,0,'radio'),
 Text(0,0,'song'),
 Text(0,0,'song-based-playlist'),
 Text(0,0,'top-hits-for-artist'),
 Text(0,0,'topic-article-playlist')]
 

In [37]:
data = train_merged.groupby('language').aggregate({'msno':'count'}).reset_index()
a4_dims = (15, 7)
fig, ax = plt.subplots(figsize=a4_dims)
ax = sns.barplot(x='language', y='msno', data=data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
Out[37]:
[Text(0,0,'-5'),
 Text(0,0,'-1'),
 Text(0,0,'3'),
 Text(0,0,'10'),
 Text(0,0,'17'),
 Text(0,0,'24'),
 Text(0,0,'31'),
 Text(0,0,'38'),
 Text(0,0,'45'),
 Text(0,0,'52'),
 Text(0,0,'59')]
 

In [38]:
data = train_merged.groupby('registered_via').aggregate({'msno':'count'}).reset_index()
a4_dims = (15, 7)
fig, ax = plt.subplots(figsize=a4_dims)
ax = sns.barplot(x='registered_via', y='msno', data=data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
Out[38]:
[Text(0,0,'-5'),
 Text(0,0,'3'),
 Text(0,0,'4'),
 Text(0,0,'7'),
 Text(0,0,'9'),
 Text(0,0,'13')]
 

most users 7 and 9 ways to get registered

In [39]:
print(train_merged.columns)
data = train_merged.groupby('city').aggregate({'msno':'count'}).reset_index()
a4_dims = (15, 7)
fig, ax = plt.subplots(figsize=a4_dims)
ax = sns.barplot(x='city', y='msno', data=data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
 
Index(['msno', 'song_id', 'source_system_tab', 'source_screen_name',
       'source_type', 'target', 'city', 'bd', 'gender', 'registered_via',
       'registration_init_time', 'expiration_date', 'song_length', 'genre_ids',
       'artist_name', 'composer', 'lyricist', 'language'],
      dtype='object')
Out[39]:
[Text(0,0,'-5'),
 Text(0,0,'1'),
 Text(0,0,'3'),
 Text(0,0,'4'),
 Text(0,0,'5'),
 Text(0,0,'6'),
 Text(0,0,'7'),
 Text(0,0,'8'),
 Text(0,0,'9'),
 Text(0,0,'10'),
 Text(0,0,'11'),
 Text(0,0,'12'),
 Text(0,0,'13'),
 Text(0,0,'14'),
 Text(0,0,'15'),
 Text(0,0,'16'),
 Text(0,0,'17'),
 Text(0,0,'18'),
 Text(0,0,'19'),
 Text(0,0,'20'),
 Text(0,0,'21'),
 Text(0,0,'22')]
 

no of users are 1,13,5 are containig maximum values

In [40]:
a4_dims = (15, 7)
fig, ax = plt.subplots(figsize=a4_dims)
ax=sns.countplot(x="source_system_tab",data=train_merged,palette=['lightblue','orange','green'],hue="target")
plt.xlabel("source_screen_tab")
plt.ylabel("count")
plt.title("source_system_tab vs target ")
plt.show()
 

new user are coming form discover and my llibrary and old ones are from my library

In [41]:
a4_dims = (15, 7)
fig, ax = plt.subplots(figsize=a4_dims)
ax=sns.countplot(x="source_screen_name",data=train_merged,palette=['#A8B820','yellow','#98D8D8'],hue="target")
plt.xlabel("source_screen_name")
plt.ylabel("count")
plt.title("source_screen_name vs target ")
plt.xticks(rotation='90')
plt.show()
 

local playlist among new user and old one more most common way to get back their songs

In [42]:
a4_dims = (15, 7)
fig, ax = plt.subplots(figsize=a4_dims)
ax=sns.countplot(x="gender",data=train_merged,palette=['#705898','#7038F8','yellow'],hue="target")
plt.xlabel("male female participation")
plt.ylabel("count")
plt.title("male female participation vs target ")
plt.xticks(rotation='90')
plt.legend(loc='upper left')
plt.show()
 

new female users are more than male users about 500 to 600

In [43]:
a4_dims = (15, 7)
fig, ax = plt.subplots(figsize=a4_dims)
ax=sns.heatmap(data=train_merged.corr(),annot=True,fmt=".2f")
 

In [44]:
a4_dims = (15, 7)
fig, ax = plt.subplots(figsize=a4_dims)
ax=sns.boxplot(x="gender",y="city",data=train_merged,palette=['blue','orange','green'],hue="target")
plt.xlabel("gender")
plt.ylabel("city")
plt.title("city vs registered_via  ")
plt.show()
 

here we can see that most of our user are between 5 to 14 no of cities might be female ratio is same

avg no of male users are 13 to 15 city no

In [45]:
ax=sns.lmplot(x="bd",y="registered_via",data=train_merged,palette=['blue','orange','green'],hue="target",fit_reg=False)
plt.xlabel("bd age group")
plt.ylabel("registred_via")
plt.title(" bd age group vs registration_via ")
plt.show()
 

now we can see on thing that music users vary age form 0 to 100 we can see here are outliers to in bd but interesting information are that most users age group of younsters and 30+ age group form 5 to 10 registered_via index

In [46]:
ax=sns.lmplot(x="bd",y="city",data=train_merged,palette=['blue','orange','green'],hue="target",fit_reg=False)
plt.xlabel("bd age group")
plt.ylabel("city")
plt.title("bd (age group) vs city ")
plt.show()
 

with outlier as we can we didn't remove till now we will remove bd outliers at final stages before applying Ml but that last results insights are telling we have age group 20 to 30+ ages and city index we most 5 to 14

In [47]:
#remomving outlier from bd age group column

In [48]:
a4_dims = (15, 7)
fig, ax = plt.subplots(figsize=a4_dims)
ax=sns.boxplot(x="bd",y="gender",data=train_merged,palette=['blue','orange','green'])
plt.xlabel("bd age group")
plt.ylabel("gender")
plt.title("bd age group vs gender ")
plt.show()
 

as we can see that mean age group we have 24 to 27 with max is 50 in female case and in male case 48 about age group is max and min in female it is about 16 and in male case 18

one more observation we can see that female outlier are more there reason behind this logic females always tend fill up the things in hurry way because in male we can't see male with 100 , as if this bit funny logic , apart from this it all due unclean data that's it which we have to remove outliers

In [49]:
train_merged.describe()
def remove_outlier(df_in, col_name):

    #q1 = df_in[col_name].quantile(0.25)
    #q3 = df_in[col_name].quantile(0.75)
    #iqr = q3-q1 #Interquartile range
    fence_low  = 12
    fence_high = 45
    df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
    return df_out
df_final_train=remove_outlier(train_merged,'bd')

i guess we have almost clean data in this data cleanning i used brute force approach as i get 0.75 percentile values as 0 so there is now way taking standard deviations so i used lookup approach to remove outliers special in this bd age group let's hope my system accurarcy don't went down

Now moving toward ML approach or machine learning

I'm Coming ####head - ROHAN (Data Scientist at Databytes Analytics)

MY approch make machine learn till not get generalize models and i called it S.E.D.A.C.R.O.M.L

Clean the test data set

Now we ultimate clean data set

but not yet preprocssed so now preprocssing task

Why i'm making all the classification with trees only there is reason behind that other trees algos are best for classifications , improving loss function in gradient boosters and xgboost that's why and another more reason like why not svm is not made for this type working , powerfull in case of text audio , images

In [50]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import lightgbm as lgb

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.
 
members.csv
sample_submission.csv
song_extra_info.csv
songs.csv
test.csv
train.csv

In [51]:
print('Loading data...')
data_path = '../input/'
train = pd.read_csv(data_path + 'train.csv', dtype={'msno' : 'category',
                                                'source_system_tab' : 'category',
                                                  'source_screen_name' : 'category',
                                                  'source_type' : 'category',
                                                  'target' : np.uint8,
                                                  'song_id' : 'category'})
test = pd.read_csv(data_path + 'test.csv', dtype={'msno' : 'category',
                                                'source_system_tab' : 'category',
                                                'source_screen_name' : 'category',
                                                'source_type' : 'category',
                                                'song_id' : 'category'})
songs = pd.read_csv(data_path + 'songs.csv',dtype={'genre_ids': 'category',
                                                  'language' : 'category',
                                                  'artist_name' : 'category',
                                                  'composer' : 'category',
                                                  'lyricist' : 'category',
                                                  'song_id' : 'category'})
members = pd.read_csv(data_path + 'members.csv',dtype={'city' : 'category',
                                                      'bd' : np.uint8,
                                                      'gender' : 'category',
                                                      'registered_via' : 'category'},
                     parse_dates=['registration_init_time','expiration_date'])
songs_extra = pd.read_csv(data_path + 'song_extra_info.csv')
print('Done loading...')
 
Loading data...
Done loading...

In [52]:
song_cols = ['song_id', 'artist_name', 'genre_ids', 'song_length', 'language']
train = train.merge(songs[song_cols], on='song_id', how='left')
test = test.merge(songs[song_cols], on='song_id', how='left')

members['registration_year'] = members['registration_init_time'].apply(lambda x: int(str(x)[0:4]))
#members['registration_month'] = members['registration_init_time'].apply(lambda x: int(str(x)[4:6]))
#members['registration_date'] = members['registration_init_time'].apply(lambda x: int(str(x)[6:8]))

members['expiration_year'] = members['expiration_date'].apply(lambda x: int(str(x)[0:4]))
members['expiration_month'] = members['expiration_date'].apply(lambda x: int(str(x)[4:6]))
#members['expiration_date'] = members['expiration_date'].apply(lambda x: int(str(x)[6:8]))

# exepting some unimportanat features


# Convert date to number of days
members['membership_days'] = (members['expiration_date'] - members['registration_init_time']).dt.days.astype(int)

#members = members.drop(['registration_init_time'], axis=1)
#members = members.drop(['expiration_date'], axis=1)

In [53]:
# categorize membership_days 
members['membership_days'] = members['membership_days']//200
members['membership_days'] = members['membership_days'].astype('category')

In [54]:
member_cols = ['msno','city','registered_via', 'registration_year', 'expiration_year', 'membership_days']

train = train.merge(members[member_cols], on='msno', how='left')
test = test.merge(members[member_cols], on='msno', how='left')
 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
 
Exception ignored in: 'pandas._libs.lib.is_bool_array'
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
 
Exception ignored in: 'pandas._libs.lib.is_bool_array'
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

In [55]:
train.info()
 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7377418 entries, 0 to 7377417
Data columns (total 15 columns):
msno                  object
song_id               object
source_system_tab     category
source_screen_name    category
source_type           category
target                uint8
artist_name           category
genre_ids             category
song_length           float64
language              category
city                  category
registered_via        category
registration_year     int64
expiration_year       int64
membership_days       category
dtypes: category(9), float64(1), int64(2), object(2), uint8(1)
memory usage: 448.0+ MB

In [56]:
def isrc_to_year(isrc):
    if type(isrc) == str:
        if int(isrc[5:7]) > 17:
            return int(isrc[5:7])//5
        else:
            return int(isrc[5:7])//5
    else:
        return np.nan
#categorize song_year per 5years

songs_extra['song_year'] = songs_extra['isrc'].apply(isrc_to_year)
songs_extra.drop(['isrc', 'name'], axis = 1, inplace = True)

In [57]:
train = train.merge(songs_extra, on = 'song_id', how = 'left')
test = test.merge(songs_extra, on = 'song_id', how = 'left')
 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
 
Exception ignored in: 'pandas._libs.lib.is_bool_array'
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
 
Exception ignored in: 'pandas._libs.lib.is_bool_array'
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

In [58]:
train['genre_ids'] = train['genre_ids'].str.split('|').str[0]

In [59]:
temp_song_length = train['song_length']

In [60]:
train.drop('song_length', axis = 1, inplace = True)
test.drop('song_length',axis = 1 , inplace =True)

In [61]:
train.head()
Out[61]:
 msnosong_idsource_system_tabsource_screen_namesource_typetargetartist_namegenre_idslanguagecityregistered_viaregistration_yearexpiration_yearmembership_dayssong_year
0FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=exploreExploreonline-playlist1Bastille35952.01720122017103.0
1Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=bhp/MpSNoqoxOIB+/l8WPqu6jldth4DIpCm3ayXnJqM=my libraryLocal playlist morelocal-playlist1Various Artists125952.0139201120171119.0
2Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=JNWfrrC7zNN7BdMpsISKa4Mw+xVJYNnxXh3/Epw7QgY=my libraryLocal playlist morelocal-playlist1Nas125952.013920112017111.0
3Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=2A87tzfnJTSWqD7gIZHisolhe4DMdzkbd6LzO1KHjNs=my libraryLocal playlist morelocal-playlist1Soundway1019-1.013920112017112.0
4FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=3qm6XTZ6MOCU11x8FIVbAGH5l5uMkT3/ZalWG1oo2Gc=exploreExploreonline-playlist1Brett Young101152.01720122017103.0

In [62]:
song_count = train.loc[:,["song_id","target"]]

# measure repeat count by played songs
song_count1 = song_count.groupby(["song_id"],as_index=False).sum().rename(columns={"target":"repeat_count"})

# count play count by songs
song_count2 = song_count.groupby(["song_id"],as_index=False).count().rename(columns = {"target":"play_count"})

In [63]:
song_repeat = song_count1.merge(song_count2,how="inner",on="song_id")
song_repeat["repeat_percentage"] = round((song_repeat['repeat_count']*100) / song_repeat['play_count'],1)
song_repeat['repeat_count'] = song_repeat['repeat_count'].astype('int')
song_repeat['repeat_percentage'] = song_repeat['repeat_percentage'].replace(100.0,np.nan)
#cuz most of 100.0 are played=1 repeated=1 values. I think it is not fair compare with other played a lot songs

In [64]:
train = train.merge(song_repeat,on="song_id",how="left")
test = test.merge(song_repeat,on="song_id",how="left")
 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
 
Exception ignored in: 'pandas._libs.lib.is_bool_array'
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
 
Exception ignored in: 'pandas._libs.lib.is_bool_array'
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

In [65]:
# type cast
test['song_id'] = test['song_id'].astype('category')
test['repeat_count'] = test['repeat_count'].fillna(0)
test['repeat_count'] = test['repeat_count'].astype('int')
test['play_count'] = test['play_count'].fillna(0)
test['play_count'] = test['play_count'].astype('int')
#train['repeat_percentage'].replace(100.0,np.nan)

In [66]:
artist_count = train.loc[:,["artist_name","target"]]

# measure repeat count by played songs
artist_count1 = artist_count.groupby(["artist_name"],as_index=False).sum().rename(columns={"target":"repeat_count_artist"})

# measure play count by songs
artist_count2 = artist_count.groupby(["artist_name"],as_index=False).count().rename(columns = {"target":"play_count_artist"})

artist_repeat = artist_count1.merge(artist_count2,how="inner",on="artist_name")

In [67]:
artist_repeat["repeat_percentage_artist"] = round((artist_repeat['repeat_count_artist']*100) / artist_repeat['play_count_artist'],1)
artist_repeat['repeat_count_artist'] = artist_repeat['repeat_count_artist'].fillna(0)
artist_repeat['repeat_count_artist'] = artist_repeat['repeat_count_artist'].astype('int')
artist_repeat['repeat_percentage_artist'] = artist_repeat['repeat_percentage_artist'].replace(100.0,np.nan)

In [68]:
#use only repeat_percentage_artist
del artist_repeat['repeat_count_artist']
#del artist_repeat['play_count_artist']

so we can decision tree was much better than extra trees classifier

In [69]:
#merge it with artist_name to train dataframe
train = train.merge(artist_repeat,on="artist_name",how="left")
test = test.merge(artist_repeat,on="artist_name",how="left")
 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
 
Exception ignored in: 'pandas._libs.lib.is_bool_array'
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
 
Exception ignored in: 'pandas._libs.lib.is_bool_array'
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

here decision tree at it's final showoff can expect too much from this one

In [70]:
train.info()
 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7377418 entries, 0 to 7377417
Data columns (total 20 columns):
msno                        object
song_id                     object
source_system_tab           category
source_screen_name          category
source_type                 category
target                      uint8
artist_name                 category
genre_ids                   object
language                    category
city                        category
registered_via              category
registration_year           int64
expiration_year             int64
membership_days             category
song_year                   float64
repeat_count                int64
play_count                  int64
repeat_percentage           float64
play_count_artist           float64
repeat_percentage_artist    float64
dtypes: category(8), float64(4), int64(4), object(3), uint8(1)
memory usage: 771.6+ MB

In [71]:
del train['artist_name']
del test['artist_name']

In [72]:
msno_count = train.loc[:,["msno","target"]]

# count repeat count by played songs
msno_count1 = msno_count.groupby(["msno"],as_index=False).sum().rename(columns={"target":"repeat_count_msno"})

# count play count by songs
msno_count2 = msno_count.groupby(["msno"],as_index=False).count().rename(columns = {"target":"play_count_msno"})

msno_repeat = msno_count1.merge(msno_count2,how="inner",on="msno")

In [73]:
msno_repeat["repeat_percentage_msno"] = round((msno_repeat['repeat_count_msno']*100) / msno_repeat['play_count_msno'],1)
msno_repeat['repeat_count_msno'] = msno_repeat['repeat_count_msno'].fillna(0)
msno_repeat['repeat_count_msno'] = msno_repeat['repeat_count_msno'].astype('int')
#msno_repeat['repeat_percentage_msno'] = msno_repeat['repeat_percentage_msno'].replace(100.0,np.nan)
# it can be meaningful so do not erase 100.0 

In [74]:
#merge it with msno to train dataframe
train = train.merge(msno_repeat,on="msno",how="left")
test = test.merge(msno_repeat,on="msno",how="left")
 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
 
Exception ignored in: 'pandas._libs.lib.is_bool_array'
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
 
Exception ignored in: 'pandas._libs.lib.is_bool_array'
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

In [75]:
import gc
#del members, songs; gc.collect();

for col in train.columns:
    if train[col].dtype == object:
        train[col] = train[col].astype('category')
        test[col] = test[col].astype('category')

In [76]:
train['song_year'] = train['song_year'].astype('category')
test['song_year'] = test['song_year'].astype('category')

In [77]:
train.head()
Out[77]:
 msnosong_idsource_system_tabsource_screen_namesource_typetargetgenre_idslanguagecityregistered_via...membership_dayssong_yearrepeat_countplay_countrepeat_percentageplay_count_artistrepeat_percentage_artistrepeat_count_msnoplay_count_msnorepeat_percentage_msno
0FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=exploreExploreonline-playlist135952.017...103.010221547.41140.046.32791551150.6
1Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=bhp/MpSNoqoxOIB+/l8WPqu6jldth4DIpCm3ayXnJqM=my libraryLocal playlist morelocal-playlist1125952.0139...1119.011NaN303616.051.046262274.3
2Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=JNWfrrC7zNN7BdMpsISKa4Mw+xVJYNnxXh3/Epw7QgY=my libraryLocal playlist morelocal-playlist1125952.0139...111.02450.0289.021.546262274.3
3Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=2A87tzfnJTSWqD7gIZHisolhe4DMdzkbd6LzO1KHjNs=my libraryLocal playlist morelocal-playlist11019-1.0139...112.011NaN1.0NaN46262274.3
4FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=3qm6XTZ6MOCU11x8FIVbAGH5l5uMkT3/ZalWG1oo2Gc=exploreExploreonline-playlist1101152.017...103.015041236.4427.037.72791551150.6
5 rows × 22 columns

In [78]:
drop_list = ['repeat_count','repeat_percentage',
             'repeat_percentage_artist',
             'repeat_count_msno','repeat_percentage_msno'
            ]
train = train.drop(drop_list,axis=1)
test = test.drop(drop_list,axis=1)

as we can see that accuracy is more but still auc_score is very less

In [79]:
test['play_count_msno'] = test['play_count_msno'].fillna(0)
test['play_count_msno'] = test['play_count_msno'].astype('int')


train['play_count_artist'] = train['play_count_artist'].fillna(0)
test['play_count_artist'] = test['play_count_artist'].fillna(0)
train['play_count_artist'] = train['play_count_artist'].astype('int')
test['play_count_artist'] = test['play_count_artist'].astype('int')

In [80]:
from sklearn.model_selection import KFold
# Create a Cross Validation with 3 splits
kf = KFold(n_splits=3)

predictions = np.zeros(shape=[len(test)])

# For each KFold
for train_indices ,validate_indices in kf.split(train) : 
    train_data = lgb.Dataset(train.drop(['target'],axis=1).loc[train_indices,:],label=train.loc[train_indices,'target'])
    val_data = lgb.Dataset(train.drop(['target'],axis=1).loc[validate_indices,:],label=train.loc[validate_indices,'target'])

    params = {
            'objective': 'binary',
            'boosting': 'gbdt',
            'learning_rate': 0.2 ,
            'verbose': 0,
            'num_leaves': 2**8,
            'bagging_fraction': 0.95,
            'bagging_freq': 1,
            'bagging_seed': 1,
            'feature_fraction': 0.9,
            'feature_fraction_seed': 1,
            'max_bin': 256,
            'num_rounds': 80,
            'metric' : 'auc'
        }
    # Train the model    
    lgbm_model = lgb.train(params, train_data, 100, valid_sets=[val_data])
    predictions += lgbm_model.predict(test.drop(['id'],axis=1))
    del lgbm_model
    # We get the ammount of predictions from the prediction list, by dividing the predictions by the number of Kfolds.
predictions = predictions/3

INPUT_DATA_PATH = '../input/'

# Read the sample_submission CSV
submission = pd.read_csv(INPUT_DATA_PATH + '/sample_submission.csv')
# Set the target to our predictions
submission.target=predictions
# Save the submission file
submission.to_csv('submission.csv',index=False)
 
/opt/conda/lib/python3.6/site-packages/lightgbm/engine.py:99: UserWarning: Found `num_rounds` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
/opt/conda/lib/python3.6/site-packages/lightgbm/basic.py:681: UserWarning: categorical_feature in param dict is overrided.
  warnings.warn('categorical_feature in param dict is overrided.')
 
[1]	valid_0's auc: 0.733548
[2]	valid_0's auc: 0.741157
[3]	valid_0's auc: 0.747674
[4]	valid_0's auc: 0.750715
[5]	valid_0's auc: 0.761668
[6]	valid_0's auc: 0.762847
[7]	valid_0's auc: 0.76451
[8]	valid_0's auc: 0.767377
[9]	valid_0's auc: 0.768492
[10]	valid_0's auc: 0.769474
[11]	valid_0's auc: 0.770184
[12]	valid_0's auc: 0.770831
[13]	valid_0's auc: 0.771395
[14]	valid_0's auc: 0.77169
[15]	valid_0's auc: 0.771862
[16]	valid_0's auc: 0.772329
[17]	valid_0's auc: 0.772669
[18]	valid_0's auc: 0.772939
[19]	valid_0's auc: 0.773193
[20]	valid_0's auc: 0.77346
[21]	valid_0's auc: 0.773656
[22]	valid_0's auc: 0.773998
[23]	valid_0's auc: 0.774233
[24]	valid_0's auc: 0.774134
[25]	valid_0's auc: 0.774021
[26]	valid_0's auc: 0.774127
[27]	valid_0's auc: 0.774108
[28]	valid_0's auc: 0.773986
[29]	valid_0's auc: 0.774013
[30]	valid_0's auc: 0.773821
[31]	valid_0's auc: 0.773618
[32]	valid_0's auc: 0.773628
[33]	valid_0's auc: 0.773799
[34]	valid_0's auc: 0.773792
[35]	valid_0's auc: 0.773766
[36]	valid_0's auc: 0.773799
[37]	valid_0's auc: 0.773748
[38]	valid_0's auc: 0.773725
[39]	valid_0's auc: 0.773539
[40]	valid_0's auc: 0.773496
[41]	valid_0's auc: 0.773419
[42]	valid_0's auc: 0.773393
[43]	valid_0's auc: 0.77335
[44]	valid_0's auc: 0.773227
[45]	valid_0's auc: 0.77318
[46]	valid_0's auc: 0.773071
[47]	valid_0's auc: 0.772963
[48]	valid_0's auc: 0.77281
[49]	valid_0's auc: 0.77272
[50]	valid_0's auc: 0.772598
[51]	valid_0's auc: 0.772499
[52]	valid_0's auc: 0.77244
[53]	valid_0's auc: 0.772365
[54]	valid_0's auc: 0.772364
[55]	valid_0's auc: 0.772268
[56]	valid_0's auc: 0.772131
[57]	valid_0's auc: 0.772132
[58]	valid_0's auc: 0.77207
[59]	valid_0's auc: 0.772039
[60]	valid_0's auc: 0.77198
[61]	valid_0's auc: 0.771885
[62]	valid_0's auc: 0.771831
[63]	valid_0's auc: 0.771757
[64]	valid_0's auc: 0.771624
[65]	valid_0's auc: 0.771581
[66]	valid_0's auc: 0.771633
[67]	valid_0's auc: 0.771562
[68]	valid_0's auc: 0.771485
[69]	valid_0's auc: 0.771399
[70]	valid_0's auc: 0.771353
[71]	valid_0's auc: 0.771409
[72]	valid_0's auc: 0.77133
[73]	valid_0's auc: 0.771249
[74]	valid_0's auc: 0.771174
[75]	valid_0's auc: 0.771152
[76]	valid_0's auc: 0.771121
[77]	valid_0's auc: 0.771088
[78]	valid_0's auc: 0.77103
[79]	valid_0's auc: 0.770947
[80]	valid_0's auc: 0.770918
[1]	valid_0's auc: 0.691098
[2]	valid_0's auc: 0.69588
[3]	valid_0's auc: 0.700552
[4]	valid_0's auc: 0.703919
[5]	valid_0's auc: 0.708601
[6]	valid_0's auc: 0.711033
[7]	valid_0's auc: 0.71301
[8]	valid_0's auc: 0.715389
[9]	valid_0's auc: 0.716602
[10]	valid_0's auc: 0.7176
[11]	valid_0's auc: 0.718677
[12]	valid_0's auc: 0.719338
[13]	valid_0's auc: 0.720238
[14]	valid_0's auc: 0.721228
[15]	valid_0's auc: 0.722016
[16]	valid_0's auc: 0.723503
[17]	valid_0's auc: 0.724323
[18]	valid_0's auc: 0.725129
[19]	valid_0's auc: 0.725616
[20]	valid_0's auc: 0.726062
[21]	valid_0's auc: 0.726511
[22]	valid_0's auc: 0.726772
[23]	valid_0's auc: 0.727171
[24]	valid_0's auc: 0.727434
[25]	valid_0's auc: 0.727681
[26]	valid_0's auc: 0.7279
[27]	valid_0's auc: 0.728073
[28]	valid_0's auc: 0.728109
[29]	valid_0's auc: 0.728217
[30]	valid_0's auc: 0.728269
[31]	valid_0's auc: 0.728438
[32]	valid_0's auc: 0.72842
[33]	valid_0's auc: 0.72855
[34]	valid_0's auc: 0.728608
[35]	valid_0's auc: 0.728564
[36]	valid_0's auc: 0.728569
[37]	valid_0's auc: 0.72854
[38]	valid_0's auc: 0.728584
[39]	valid_0's auc: 0.728565
[40]	valid_0's auc: 0.728631
[41]	valid_0's auc: 0.728662
[42]	valid_0's auc: 0.728655
[43]	valid_0's auc: 0.728633
[44]	valid_0's auc: 0.728565
[45]	valid_0's auc: 0.728498
[46]	valid_0's auc: 0.728429
[47]	valid_0's auc: 0.7284
[48]	valid_0's auc: 0.728342
[49]	valid_0's auc: 0.728285
[50]	valid_0's auc: 0.728174
[51]	valid_0's auc: 0.72826
[52]	valid_0's auc: 0.728265
[53]	valid_0's auc: 0.728324
[54]	valid_0's auc: 0.728342
[55]	valid_0's auc: 0.728334
[56]	valid_0's auc: 0.728318
[57]	valid_0's auc: 0.728298
[58]	valid_0's auc: 0.728348
[59]	valid_0's auc: 0.728353
[60]	valid_0's auc: 0.728294
[61]	valid_0's auc: 0.728304
[62]	valid_0's auc: 0.728269
[63]	valid_0's auc: 0.728232
[64]	valid_0's auc: 0.728233
[65]	valid_0's auc: 0.728208
[66]	valid_0's auc: 0.728203
[67]	valid_0's auc: 0.728111
[68]	valid_0's auc: 0.728159
[69]	valid_0's auc: 0.728196
[70]	valid_0's auc: 0.728133
[71]	valid_0's auc: 0.728096
[72]	valid_0's auc: 0.727992
[73]	valid_0's auc: 0.727984
[74]	valid_0's auc: 0.72806
[75]	valid_0's auc: 0.728066
[76]	valid_0's auc: 0.728053
[77]	valid_0's auc: 0.728048
[78]	valid_0's auc: 0.728049
[79]	valid_0's auc: 0.72803
[80]	valid_0's auc: 0.728035
[1]	valid_0's auc: 0.652491
[2]	valid_0's auc: 0.657635
[3]	valid_0's auc: 0.662133
[4]	valid_0's auc: 0.663726
[5]	valid_0's auc: 0.667069
[6]	valid_0's auc: 0.668351
[7]	valid_0's auc: 0.669836
[8]	valid_0's auc: 0.672684
[9]	valid_0's auc: 0.673528
[10]	valid_0's auc: 0.674309
[11]	valid_0's auc: 0.675278
[12]	valid_0's auc: 0.67615
[13]	valid_0's auc: 0.676999
[14]	valid_0's auc: 0.678254
[15]	valid_0's auc: 0.678974
[16]	valid_0's auc: 0.68003
[17]	valid_0's auc: 0.680819
[18]	valid_0's auc: 0.681634
[19]	valid_0's auc: 0.682136
[20]	valid_0's auc: 0.682764
[21]	valid_0's auc: 0.683491
[22]	valid_0's auc: 0.683936
[23]	valid_0's auc: 0.684629
[24]	valid_0's auc: 0.685034
[25]	valid_0's auc: 0.685411
[26]	valid_0's auc: 0.685703
[27]	valid_0's auc: 0.685863
[28]	valid_0's auc: 0.685948
[29]	valid_0's auc: 0.686198
[30]	valid_0's auc: 0.686289
[31]	valid_0's auc: 0.686582
[32]	valid_0's auc: 0.686578
[33]	valid_0's auc: 0.686957
[34]	valid_0's auc: 0.687062
[35]	valid_0's auc: 0.687109
[36]	valid_0's auc: 0.687202
[37]	valid_0's auc: 0.687207
[38]	valid_0's auc: 0.687185
[39]	valid_0's auc: 0.687228
[40]	valid_0's auc: 0.687286
[41]	valid_0's auc: 0.68721
[42]	valid_0's auc: 0.687296
[43]	valid_0's auc: 0.687299
[44]	valid_0's auc: 0.687394
[45]	valid_0's auc: 0.687454
[46]	valid_0's auc: 0.687466
[47]	valid_0's auc: 0.687459
[48]	valid_0's auc: 0.687417
[49]	valid_0's auc: 0.687427
[50]	valid_0's auc: 0.68744
[51]	valid_0's auc: 0.687839
[52]	valid_0's auc: 0.687796
[53]	valid_0's auc: 0.687752
[54]	valid_0's auc: 0.687777
[55]	valid_0's auc: 0.687725
[56]	valid_0's auc: 0.68775
[57]	valid_0's auc: 0.687796
[58]	valid_0's auc: 0.687808
[59]	valid_0's auc: 0.68789
[60]	valid_0's auc: 0.687858
[61]	valid_0's auc: 0.688008
[62]	valid_0's auc: 0.687926
[63]	valid_0's auc: 0.687932
[64]	valid_0's auc: 0.687952
[65]	valid_0's auc: 0.688085
[66]	valid_0's auc: 0.688101
[67]	valid_0's auc: 0.688091
[68]	valid_0's auc: 0.688076
[69]	valid_0's auc: 0.68804
[70]	valid_0's auc: 0.688044
[71]	valid_0's auc: 0.688016
[72]	valid_0's auc: 0.687981
[73]	valid_0's auc: 0.687951
[74]	valid_0's auc: 0.687931
[75]	valid_0's auc: 0.68793
[76]	valid_0's auc: 0.687951
[77]	valid_0's auc: 0.687914
[78]	valid_0's auc: 0.687913
[79]	valid_0's auc: 0.687955
[80]	valid_0's auc: 0.687979

文章链接

https://developer.chat/recommendation-system-83-accuracy-lgbm

引用

登录发表评论

	song_id	song_length	genre_ids	artist_name	composer	lyricist	language
0	CXoTN1eb7AI+DntdU1vbcwGRV4SCIDxZu+YD8JP8r4E=	247640	465	張信哲 (Jeff Chang)	董貞	何啟弘	3.0
1	o0kFgae9QtnYgRkVPqLJwa05zIhRlUjfF7O1tDw0ZDU=	197328	444	BLACKPINK	TEDDY\| FUTURE BOUNCE\| Bekuh BOOM	TEDDY	31.0
2	DwVvVurfpuz+XPuFvucclVQEyPqcpUkHR0ne1RQzPs0=	231781	465	SUPER JUNIOR	NaN	NaN	31.0
3	dKMBWoZyScdxSkihKG+Vf47nc18N9q4m58+b4e7dSSE=	273554	465	S.H.E	湯小康	徐世珍	3.0
4	W3bqWd3T+VeHFzHAUfARgW9AvVRaF4N5Yzm4Mr6Eo/o=	140329	726	貴族精選	Traditional	Traditional	52.0

	msno	city	gender	registered_via	registration_init_time	expiration_date
0	XQxgAYj3klVKjR3oxPPXYYFp4soD4TuBghkhMTD4oTw=	1	NaN	7	20110820	20170920
1	UizsfmJb9mV54qE9hCYyU07Va97c0lCRLEQX3ae+ztM=	1	NaN	7	20150628	20170622
2	D8nEhsIOBSoE6VthTaqDX8U6lqjJ7dLdr72mOyLya2A=	1	NaN	4	20160411	20170712
3	mCuD+tZ1hERA/o5GPqk38e041J8ZsBaLcu7nGoIIvhI=	1	NaN	9	20150906	20150907
4	q4HRBfVSssAFS9iRfxWrohxuk9kCYMKjHOEagUMV6rQ=	1	NaN	4	20170126	20170613

	msno	song_id	source_system_tab	source_screen_name	source_type	target	genre_ids	language	city	registered_via	...	membership_days	song_year	repeat_count	play_count	repeat_percentage	play_count_artist	repeat_percentage_artist	repeat_count_msno	play_count_msno	repeat_percentage_msno
0	FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=	BBzumQNXUHKdEBOB7mAJuzok+IJA1c2Ryg/yzTF6tik=	explore	Explore	online-playlist	1	359	52.0	1	7	...	10	3.0	102	215	47.4	1140.0	46.3	2791	5511	50.6
1	Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=	bhp/MpSNoqoxOIB+/l8WPqu6jldth4DIpCm3ayXnJqM=	my library	Local playlist more	local-playlist	1	1259	52.0	13	9	...	11	19.0	1	1	NaN	303616.0	51.0	462	622	74.3
2	Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=	JNWfrrC7zNN7BdMpsISKa4Mw+xVJYNnxXh3/Epw7QgY=	my library	Local playlist more	local-playlist	1	1259	52.0	13	9	...	11	1.0	2	4	50.0	289.0	21.5	462	622	74.3
3	Xumu+NIjS6QYVxDS4/t3SawvJ7viT9hPKXmf0RtLNx8=	2A87tzfnJTSWqD7gIZHisolhe4DMdzkbd6LzO1KHjNs=	my library	Local playlist more	local-playlist	1	1019	-1.0	13	9	...	11	2.0	1	1	NaN	1.0	NaN	462	622	74.3
4	FGtllVqz18RPiwJj/edr2gV78zirAiY/9SmYvia+kCg=	3qm6XTZ6MOCU11x8FIVbAGH5l5uMkT3/ZalWG1oo2Gc=	explore	Explore	online-playlist	1	1011	52.0	1	7	...	10	3.0	150	412	36.4	427.0	37.7	2791	5511	50.6

标签（标签）

Search

category

AS we have imported necessary modules now we can start with

EDA (Exploratory Data Analysis) with wrangling of data and visualizing

The necessary thing for out statiscal analysis as well insights of our data

Last steps would be data imputations , merging , cross validations ,

Hyperparmaters tuning and visualization of every algowhat we used what are

It's result with Time consumption of algos producing the results

First visualization we can see as if local library are more perffered than any other source types as well after that online playlist and local playlist and other features are showing less importance but can't say anything right now as we handn't deal with cleaning , imputing , stats

But as far we are sure are answers for buliding this systems in revolving maximum issues around local library see next what other result say

Second Visualization is telling us that most of the users are listenning local

playlist more means the app which is provided by the company they are using them apart from this we can also see that after this most the users are coming back to the songs by online playlist sources

Very less from the other different sources means are outliers , variance and std deviations are in 2 areas local libs and online playlist

so anyone who has installed KKBOX app we can see most of the users are going back to there songs via my library rather discovering them means there are different sources they can go back but most preffered one is my library

now doing some visualiaztion in members.csv

As we can see we have male users more now visualization has to be done in this manner like from how many types genders which are popular ways to go back in there playlist

We are moving in right direction of building a good accurate systems

now some statistics inferences

as we have numeric data in tow csv files and rest of the files with categorical data so members.csv file with 2 columns in numeric and song.csv

doing stats test on members.csv

inferences we can drawn from above two result that maximum registration were done in time period of 2012 to 2016 and most this righthand skewed graph one more thing before applying we have to normalize it

we can see that in members and songs csv files large differences bet min and max values which gives inferences that there are outliers in the csv files which has to be removed before making system

Data conversion of int , float and categorical has to be done to reduce the data size for computation as well as storage

Analysis on missing values

as we can see lot of missing values are coming up but when common thing we notice that most of the missing values are arrived from members and songs

missing values from the heatmap also showing one thing that information which are missing and has positive correlation are gender with 4 variables of train.csv and rest of varibales with members.csv

A strong nullity correlation here we can see

song id -> lang, song_len,artist name, genre_id

composer -> lyricst

gender -> with song_id

from heatmap we can say if gender is missing 70% missing values will be in msno , target city etc

Now checking missing values and replacing them with some unique values

as we can see that new user are about 5500 and old users about 15000 ,

*-5 are those values which are empty

most users 7 and 9 ways to get registered

no of users are 1,13,5 are containig maximum values

new user are coming form discover and my llibrary and old ones are from my library

local playlist among new user and old one more most common way to get back their songs

new female users are more than male users about 500 to 600

here we can see that most of our user are between 5 to 14 no of cities might be female ratio is same

avg no of male users are 13 to 15 city no

now we can see on thing that music users vary age form 0 to 100 we can see here are outliers to in bd but interesting information are that most users age group of younsters and 30+ age group form 5 to 10 registered_via index

with outlier as we can we didn't remove till now we will remove bd outliers at final stages before applying Ml but that last results insights are telling we have age group 20 to 30+ ages and city index we most 5 to 14

as we can see that mean age group we have 24 to 27 with max is 50 in female case and in male case 48 about age group is max and min in female it is about 16 and in male case 18

i guess we have almost clean data in this data cleanning i used brute force approach as i get 0.75 percentile values as 0 so there is now way taking standard deviations so i used lookup approach to remove outliers special in this bd age group let's hope my system accurarcy don't went down

Now moving toward ML approach or machine learning

I'm Coming ####head - ROHAN (Data Scientist at Databytes Analytics)

MY approch make machine learn till not get generalize models and i called it S.E.D.A.C.R.O.M.L

Clean the test data set

Now we ultimate clean data set

but not yet preprocssed so now preprocssing task

so we can decision tree was much better than extra trees classifier

here decision tree at it's final showoff can expect too much from this one

as we can see that accuracy is more but still auc_score is very less

标签