LinkedIn Network Analysis

LinkedIn Network Analysis

2021, Dec 01    

Visualise my LinkedIn Social network using networkx and plotly to improve my understanding on network and graphy theories.

## Installing Libraries

import numpy as np
import pandas as pd
import networkx as nx
from pyvis import network as net
import janitor

import plotly.express as px
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.core.display import display, HTML

## Loading dataset
df = pd.read_csv('data/Connections.csv',skiprows=2)
df.info() # summary info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   First Name     397 non-null    object
 1   Last Name      397 non-null    object
 2   Email Address  7 non-null      object
 3   Company        390 non-null    object
 4   Position       390 non-null    object
 5   Connected On   400 non-null    object
dtypes: object(6)
memory usage: 18.9+ KB

At a quick glance, I have about 400 connections.

Data Cleaning

I will perform some cleaning, remove unnecessary attributes and remove null values from the data.

new_df = (
        df.clean_names() # remove spacing and capitalisation
        .drop(columns=['first_name','last_name','email_address']) # dropped first, last and email
        .dropna(subset=['company','position']) # remove null values in company and position
        .to_datetime('connected_on', format='%d %b %Y') # convert date column to datetime object
)
new_df.head()
company position connected_on
0 InfoCepts Talent Acquisition Lead 2021-11-28
1 Yara International Associate data engineer 2021-11-27
2 Yara International Lead Recruiter, Digital Ag Solutions 2021-11-25
3 Yara International Data Scientist 2021-11-25
4 Yara International Associate Digital Information Specialist 2021-11-25

Data Exploration

  1. Connnections at a glance
  2. New connections over time
  3. Top 15 companies my connections work at
  4. Top 15 roles my connections work as

Connections at a glance

new_df1 = new_df[['company','position']]
new_df1['My Network'] = 'My Network'

px.treemap(new_df1, path=['My Network', 'company', 'position'], width=1200, height=1200)

plotly treemap

New Connections over time

daily_connections = (new_df
                    .groupby(by=['connected_on']) # group by date
                    .size() # sum up new connections per day
                    .plot() # plot line chart
)

connections line graph

Looking at the number of new connections over time since i joined LinkedIn, bulk of my connections were created during the start - period between end 2019 and start of 2020).

Top 15 companies my connections work at

companies_count = (new_df
                    .groupby(by=['company']) # group by country
                    .size() # sum up count for each company
                    .to_frame('size') # convert to frame
                    .sort_values(by=['size'],ascending=False) # sort by descending order
                    .reset_index()
)
companies_count.head(15).plot(kind='barh').invert_yaxis() # convert to horizontal plot

companies bar chart

Top 15 roles my connections are working in

position_count = (new_df
                    .groupby(by=['position']) # group by country
                    .size() # sum up count for each company
                    .to_frame('size') # convert to frame
                    .sort_values(by=['size'],ascending=False) # sort by descending order
)
position_count.head(15).plot(kind='barh').invert_yaxis() # convert to horizontal plot

positions bar chart

The top 3 companies my connections are working in are from Yara, Archisen and NTU, which is expected given that I did my undergraduate degree in NTU, worked at Archisen after graduation before joining Yara International.

Most of my connections are Research Assistants, Data Scientist and Software Engineers.

Network Analysis

companies_count.reset_index(inplace=True,drop=True)
companies_count_reduced = companies_count.loc[companies_count['size'] >=2]
print(companies_count_reduced.shape)
(42, 2)
position_count.reset_index(inplace=True)
position_count_reduced = position_count.loc[position_count['size'] >=2]
print(position_count_reduced.shape)
(35, 2)
# Initialise Graph
g1 = nx.Graph()
g1.add_node('root') # initialising myself as centrala node

# 
for id,row in companies_count_reduced.iterrows():

    # store company name and count
    company = row['company']
    count = row['size']
    
    title = f"<b>{company}</b> - {count}"
    # extract the positions my connections hold and store them in a set to prevent duplication
    positions = set([x for x in new_df[company == new_df['company']]['position']])
    positions = ''.join('<li>{}</li>'.format(x) for x in positions)

    position_list = f"<ul>{positions}</ul>"
    hover_info = title + position_list

    g1.add_node(company, size = count*2, title = hover_info, color='#3449eb')
    g1.add_edge('root',company,color='grey')

# Generate the graph
company_nt = net.Network(height='700px', width='700px', bgcolor="grey", font_color='white',notebook=True)
company_nt.from_nx(g1)
company_nt.hrepulsion()

company_nt.show('company_graph.html')
display(HTML('company_graph.html'))
# initialize graph
g2 = nx.Graph()
g2.add_node('root') # intialize yourself as central

# use iterrows tp iterate through the data frame
for id, row in position_count_reduced.iterrows():

  count = f"{row['size']}"
  position= row['position']
  
  g2.add_node(position, size=count, color='#3449eb', title=count)
  g2.add_edge('root', position, color='grey')

# generate the graph
position_nt = net.Network(height='700px', width='700px', bgcolor="black", font_color='white', notebook = True)
position_nt.from_nx(g2)
position_nt.hrepulsion()

position_nt.show('position_graph.html')
display(HTML('position_graph.html'))