Making People Finder in 2022 Using BeautifulSoup
Introduction
Hello and welcome back everyone, in this part of the blog I am going to share how can we create our own people finder tool using BeautifulSoup and Python.
Why do we need people finder?
There are lots of benefits of having easier way to find the person. And this might be the best thing for HR companies. Having the list of professionals and their public profile based on their expertise and the experience is one of the rich data. So lets try to make one such data for ourselves.
How will we do it?
We will automate the search in Search Engine and some search portal like GitHub and then store the result in dataframe then into file. First, we will do google search to find the linkedin profile based on the keyword.
Installing Dependencies
For this purpose, we are going to use BeautifulSoup, a python library.
!pip install beautifulsoup4
Importing Libraries
import pandas as pd
import requests
import urllib3
from bs4 import BeautifulSoup as BS
import time
import bs4
import warnings
warnings.filterwarnings("ignore")
There are bunch of libraries we will use:
- Pandas to make dataframe later on.
- Requests to make HTTP requests.
- Urllib3 to make manager and headers.
- BeautifulSoup to scrape and search over the page.
- Time to show scrape time.
- Warning to supress the warnings.
Google Search
Lets use google search to make our first search. Head over to the Google.com and make a first search, linkedin google engineer
.
Whenever we search something in Google, it takes our query into https://google.com/search?q=
and shows the list of results. But mostly the results are location based.
query="linkedin google engineer"
url = f"https://google.com/search?q={query}"
http = urllib3.PoolManager()
http.addheaders = [('User-agent', 'Mozilla/61.0')]
# web_page = http.request('GET',url)
web_page=requests.get(url)
soup = BS(web_page.content, 'html5lib')
# soup
Getting All URLs
In the search result, there will be a lot of links and we only need links at this moment. So lets find all links using the element a
.
urls = soup.find_all("a")
# urls
Getting only URLs that will be relevent
There will be lots of other links which will not be relevant to us at this moment. For example the Google’s Sign In page or Privacy Policy so lets do simple check. We will put the name, url of the profile and then role in a dictionary.
profiles = {"names":[],"urls":[],"roles":[]}
for url in urls:
href = url.get("href")
if "/url?q=" in href and "linkedin" in href and \
"accounts.google.com" not in href and "policies.google.com" not in href and "linkedin.com/in" in href:
nhref=href.split("=")[1].split("&")[0]
print(url.text, nhref)
profiles["names"].append(url.text.split("-")[0])
profiles["roles"].append(url.text.split("-")[1])
profiles["urls"].append(nhref)
Akshay Miterani - Software Engineer - Google - LinkedInin.linkedin.com › akshay-mite... https://in.linkedin.com/in/akshay-miterani-108827105
David Garry - Software Engineer - Google | LinkedInwww.linkedin.com › davidgar... https://www.linkedin.com/in/davidgarry1
Betty Chen - Software Engineer - Google | LinkedInwww.linkedin.com › bettyjxch... https://www.linkedin.com/in/bettyjxchen
Risab Manandhar - Software Engineer - Google - LinkedInwww.linkedin.com › risab-ma... https://www.linkedin.com/in/risab-manandhar
Hai Bi - Software Engineer - Google | LinkedInwww.linkedin.com › ... https://www.linkedin.com/in/hai-bi-b6a10010
Sabbir Yousuf Sanny - Software Engineer - Google | LinkedInwww.linkedin.com › ... https://www.linkedin.com/in/sabbir-yousuf-sanny-11aa7a21
Delia Lazarescu - Software Engineer - Google - LinkedInca.linkedin.com › delialazarescu https://ca.linkedin.com/in/delialazarescu
Shailee Patel - Software Engineer - Google | LinkedInwww.linkedin.com › shailee26 https://www.linkedin.com/in/shailee26
Sahil Gaba - Software Engineer - Google | LinkedInwww.linkedin.com › gabag26 https://www.linkedin.com/in/gabag26
Dataframe of the results
Dataframes are easy to do data analysis works in Pandas. So lets make one out of above dictionary.
pd.DataFrame(profiles)
names | urls | roles | |
---|---|---|---|
0 | Akshay Miterani | https://in.linkedin.com/in/akshay-miterani-108... | Software Engineer |
1 | David Garry | https://www.linkedin.com/in/davidgarry1 | Software Engineer |
2 | Betty Chen | https://www.linkedin.com/in/bettyjxchen | Software Engineer |
3 | Risab Manandhar | https://www.linkedin.com/in/risab-manandhar | Software Engineer |
4 | Hai Bi | https://www.linkedin.com/in/hai-bi-b6a10010 | Software Engineer |
5 | Sabbir Yousuf Sanny | https://www.linkedin.com/in/sabbir-yousuf-sann... | Software Engineer |
6 | Delia Lazarescu | https://ca.linkedin.com/in/delialazarescu | Software Engineer |
7 | Shailee Patel | https://www.linkedin.com/in/shailee26 | Software Engineer |
8 | Sahil Gaba | https://www.linkedin.com/in/gabag26 | Software Engineer |
Pros and Cons of Using Google Search
- Its easier to make GET requests however it might ask us for security check if made too many request and at that time scraping fails.
- It is easier to find people as Google’s Crawlers already have list of results based on our query and thus we only have to do very little to find right information. But it might be tough to get information by visiting LinkedIn profile.
- With very little luck, we could visit the person’s LinkedIn profile without having to login. So relying in Google Search to find LinkedIn Profile is not much fruitful.
GitHub Search
In the above part, we scraped some of the LinkedIn profiles from the Google Search but we were unable to get portfolio of people. It is quite common among the tech people to have a portfolio and GitHub account. Lets use GitHub’s Search to find people based on the keyword. Most people often put their location, company they work for, twitter handle and the portfolio in the GitHub Profile and we are willing to scrape those.
The URL to get result is https://github.com/search?q=[QUERY]&type=users&p=[PAGE]
. Where QUERY is the query we will search for, type is user and the p for page.
query="google engineer"
url = f"https://github.com/search?q={query}&type=users"
print(f"URL : {url}")
http = urllib3.PoolManager()
http.addheaders = [('User-agent', 'Mozilla/61.0')]
# web_page = http.request('GET',url)
web_page=requests.get(url)
soup = BS(web_page.content, 'html5lib')
pages = soup.find_all("em", class_="current")[0].get("data-total-pages")
max_page = 5
pages
URL : https://github.com/search?q=google engineer&type=users
'100'
In above result, we did GET request and received a webpage and upon Inspecting the page, we can see the Elements. From Elements we can find the elements like dev
, a
and so on where our desired information will be. Like that, we searched for em
with class as current
and it have a data-total-pages
attribute in it. Upon doing get, one can get the value of it. It seems that there are 100 pages with results.
Now we will loop over to those pages to get the information of the user like name and URL of profile.
github_profiles = {"name":[], "urls":[]}
if pages:
pages=int(pages)
print(f"Total Pages: {pages}")
for page in range(1,pages):
if page>max_page:
break
url = f"https://github.com/search?q={query}&type=users&p={page}"
print(f"\n Current URL: {url} \n")
http = urllib3.PoolManager()
http.addheaders = [('User-agent', 'Mozilla/61.0')]
web_page=requests.get(url)
soup = BS(web_page.content, 'html5lib')
for a in soup.find_all("a",class_="mr-1"):
gurl = "https://github.com/"+a.get("href")
gname = a.text
print(gname, gurl)
github_profiles["name"].append(gname)
github_profiles["urls"].append(gurl)
Total Pages: 100
Current URL: https://github.com/search?q=google engineer&type=users&p=1
Seth Vargo https://github.com//sethvargo
Kevin Naughton Jr. https://github.com//kdn251
Miguel Ángel Durán https://github.com//midudev
Jose Alcérreca https://github.com//JoseAlcerreca
Shubham Mathur https://github.com//googleknight
Dan Field https://github.com//dnfield
Nick Bourdakos https://github.com//bourdakos1
Mark https://github.com//MarkEdmondson1234
Current URL: https://github.com/search?q=google engineer&type=users&p=2
Pierfrancesco Soffritti https://github.com//PierfrancescoSoffritti
Parker Moore https://github.com//parkr
Gokmen Goksel https://github.com//gokmen
Justin Poehnelt https://github.com//jpoehnelt
Sanket Singh https://github.com//singhsanket143
Shanqing Cai https://github.com//caisq
Adam Silverstein https://github.com//adamsilverstein
Mizux https://github.com//Mizux
Valerii Iatsko https://github.com//viatsko
Zulkarnine Mahmud https://github.com//zulkarnine
Current URL: https://github.com/search?q=google engineer&type=users&p=3
Yacine Rezgui https://github.com//yrezgui
Zulkarnine Mahmud https://github.com//zulkarnine
Google https://github.com//Google987
Prateek Narang https://github.com//prateek27
Rakina Zata Amni https://github.com//rakina
Sriram Sundarraj https://github.com//ssundarraj
Irene Ros https://github.com//iros
Clément Mihailescu https://github.com//clementmihailescu
Márton Braun https://github.com//zsmb13
Andrey Kulikov https://github.com//andkulikov
Current URL: https://github.com/search?q=google engineer&type=users&p=4
Kate Lovett https://github.com//Piinks
Faisal Abid https://github.com//FaisalAbid
Viktor Turskyi https://github.com//koorchik
Milad Naseri https://github.com//mmnaseri
Rahul Ravikumar https://github.com//tikurahul
Robert Kubis https://github.com//hostirosti
Corey Lynch https://github.com//coreylynch
Emma Twersky https://github.com//twerske
Shivam Goyal https://github.com//ShivamGoyal1899
Abhinay Omkar https://github.com//abhiomkar
Current URL: https://github.com/search?q=google engineer&type=users&p=5
Marek Siarkowicz https://github.com//serathius
Mais Alheraki https://github.com//pr-Mais
Abhinay Omkar https://github.com//abhiomkar
Imaculate https://github.com//imaculate
Zhixun Tan https://github.com//phisiart
Jafer Haider https://github.com//itsjafer
Christie Wilson https://github.com//bobcatfish
Jason Feinstein https://github.com//jasonwyatt
Ryan Sepassi https://github.com//rsepassi
Nick Rout https://github.com//ricknout
In above example, we looped for 5 pages and we have stored those info in dictionary github_profiles
.
Now we will open the profile of a person and extract information like Twitter Handle and Portfolio. Lets select a last profile url.
gurl
'https://github.com//ricknout'
Lets visit that url from BS4.
http = urllib3.PoolManager()
http.addheaders = [('User-agent', 'Mozilla/61.0')]
# web_page = http.request('GET',url)
web_page=requests.get(gurl)
soup = BS(web_page.content, 'html5lib')
# soup
Just like previous time, we should look for the class that holds our information. For headline we can do like below.
headline = soup.find_all("div",class_="p-note user-profile-bio mb-3 js-user-profile-bio f4")[0].text
headline
'Android Developer Relations Engineer at Google 🇿🇦'
For Followers and Following Counts we can do something like below.
followers = soup.find_all("a",class_="Link--secondary no-underline no-wrap")[0].text.strip().split("\n")[0]
following = soup.find_all("a",class_="Link--secondary no-underline no-wrap")[1].text.strip().split("\n")[0]
followers,following
('510', '29')
For the information like Twitter handle and Portfolio URL we can do something like below.
vcard = soup.find_all("ul",class_="vcard-details")[0].text
vcard = [v.strip() for v in vcard.strip().split("\n") if len(v.strip())>0]
vcard
['@google', 'Cape Town, South Africa', 'ricknout.dev', 'Twitter', '@ricknout']
But more easily, we can find these information using Itemprop attribute assigned.
vcard = soup.find_all("ul",class_="vcard-details")[0]
portfolio=None
home=None
work=None
twitter=None
for vc in vcard.find_all("li"):
item=vc.get("itemprop")
if item=="url":
portfolio=vc.text.strip()
if item=="homeLocation":
home=vc.text.strip()
if item=="worksFor":
work=vc.text.strip()
if item=="twitter":
twitter=vc.text.strip()
portfolio,home,work,twitter
('ricknout.dev',
'Cape Town, South Africa',
'@google',
'Twitter\n\n @ricknout')
Now let combine above codes to work as a whole.
query="google engineer"
url = f"https://github.com/search?q={query}&type=users"
print(f"URL : {url}")
http = urllib3.PoolManager()
http.addheaders = [('User-agent', 'Mozilla/61.0')]
# web_page = http.request('GET',url)
web_page=requests.get(url)
soup = BS(web_page.content, 'html5lib')
pages = soup.find_all("em", class_="current")[0].get("data-total-pages")
max_page = 5
github_profiles = {"name":[], "urls":[], "portfolio":[],"headline":[],
"followers":[],"following":[],
"home":[], "work":[], "twitter":[]}
if pages:
pages=int(pages)
print(f"Total Pages: {pages}. Running upto {max_page}.")
for page in range(1,pages):
if page>max_page:
break
url = f"https://github.com/search?q={query}&type=users&p={page}"
print(f"\n Current URL: {url} \n")
http = urllib3.PoolManager()
http.addheaders = [('User-agent', 'Mozilla/61.0')]
web_page=requests.get(url)
osoup = BS(web_page.content, 'html5lib')
for a in osoup.find_all("a",class_="mr-1"):
gurl = "https://github.com/"+a.get("href")
gname = a.text
print(f"Got: {gname}, {gurl}")
github_profiles["name"].append(gname)
github_profiles["urls"].append(gurl)
http = urllib3.PoolManager()
http.addheaders = [('User-agent', 'Mozilla/61.0')]
web_page=requests.get(gurl)
soup = BS(web_page.content, 'html5lib')
headline = soup.find_all("div",class_="p-note user-profile-bio mb-3 js-user-profile-bio f4")[0].text
github_profiles["headline"].append(headline)
followers = soup.find_all("a",class_="Link--secondary no-underline no-wrap")[0].text.strip().split("\n")[0]
following = soup.find_all("a",class_="Link--secondary no-underline no-wrap")[1].text.strip().split("\n")[0]
github_profiles["followers"].append(followers)
github_profiles["following"].append(following)
vcard = soup.find_all("ul",class_="vcard-details")[0]
portfolio=None
home=None
work=None
twitter=None
for vc in vcard.find_all("li"):
item=vc.get("itemprop")
if item=="url":
portfolio=vc.text.strip()
if item=="homeLocation":
home=vc.text.strip()
if item=="worksFor":
work=vc.text.strip()
if item=="twitter":
twitter=vc.text.strip().split("\n")[-1].strip()
github_profiles["portfolio"].append(portfolio)
github_profiles["home"].append(home)
github_profiles["work"].append(work)
github_profiles["twitter"].append(twitter)
URL : https://github.com/search?q=google engineer&type=users
Total Pages: 100. Running upto 5.
Current URL: https://github.com/search?q=google engineer&type=users&p=1
Got: Seth Vargo, https://github.com//sethvargo
Got: Kevin Naughton Jr., https://github.com//kdn251
Got: Miguel Ángel Durán, https://github.com//midudev
Got: Jose Alcérreca, https://github.com//JoseAlcerreca
Got: Shubham Mathur, https://github.com//googleknight
Got: Nick Bourdakos, https://github.com//bourdakos1
Got: Mark, https://github.com//MarkEdmondson1234
Got: Dan Field, https://github.com//dnfield
Got: Pierfrancesco Soffritti, https://github.com//PierfrancescoSoffritti
Current URL: https://github.com/search?q=google engineer&type=users&p=2
Got: Parker Moore, https://github.com//parkr
Got: Gokmen Goksel, https://github.com//gokmen
Got: Sanket Singh, https://github.com//singhsanket143
Got: Justin Poehnelt, https://github.com//jpoehnelt
Got: Shanqing Cai, https://github.com//caisq
Got: Valerii Iatsko, https://github.com//viatsko
Got: Mizux, https://github.com//Mizux
Got: Gabriela D'Ávila Ferrara, https://github.com//gabidavila
Got: Zulkarnine Mahmud, https://github.com//zulkarnine
Current URL: https://github.com/search?q=google engineer&type=users&p=3
Got: Adam Silverstein, https://github.com//adamsilverstein
Got: Yacine Rezgui, https://github.com//yrezgui
Got: Google, https://github.com//Google987
Got: Prateek Narang, https://github.com//prateek27
Got: Rakina Zata Amni, https://github.com//rakina
Got: Clément Mihailescu, https://github.com//clementmihailescu
Got: Faisal Abid, https://github.com//FaisalAbid
Got: Kate Lovett, https://github.com//Piinks
Got: Andrey Kulikov, https://github.com//andkulikov
Got: Márton Braun, https://github.com//zsmb13
Current URL: https://github.com/search?q=google engineer&type=users&p=4
Got: Kate Lovett, https://github.com//Piinks
Got: Faisal Abid, https://github.com//FaisalAbid
Got: Viktor Turskyi, https://github.com//koorchik
Got: Milad Naseri, https://github.com//mmnaseri
Got: Rahul Ravikumar, https://github.com//tikurahul
Got: Robert Kubis, https://github.com//hostirosti
Got: Corey Lynch, https://github.com//coreylynch
Got: Emma Twersky, https://github.com//twerske
Got: Shivam Goyal, https://github.com//ShivamGoyal1899
Got: Abhinay Omkar, https://github.com//abhiomkar
Current URL: https://github.com/search?q=google engineer&type=users&p=5
Got: Marek Siarkowicz, https://github.com//serathius
Got: Greg Spencer, https://github.com//gspencergoog
Got: Mais Alheraki, https://github.com//pr-Mais
Got: Jafer Haider, https://github.com//itsjafer
Got: Zhixun Tan, https://github.com//phisiart
Got: Imaculate, https://github.com//imaculate
Got: Christie Wilson, https://github.com//bobcatfish
Got: Jason Feinstein, https://github.com//jasonwyatt
Got: Nick Rout, https://github.com//ricknout
Got: Joe Stanton, https://github.com//JoeStanton
Turn Result into Dataframe
In order to do data analysis, it is easier to work with tabular data. So lets convert our above dictionary into dataframe.
df = pd.DataFrame(github_profiles)
df
name | urls | portfolio | headline | followers | following | home | work | ||
---|---|---|---|---|---|---|---|---|---|
0 | Seth Vargo | https://github.com//sethvargo | https://www.sethvargo.com | Engineer @google | 3.2k | 5 | Pittsburgh, PA | @sethvargo | |
1 | Kevin Naughton Jr. | https://github.com//kdn251 | youtube.com/kevinnaughtonjr | Software Engineer @google | 3.8k | 9 | New York, New York | @kevinnaughtonjr | |
2 | Miguel Ángel Durán | https://github.com//midudev | https://midu.dev | Software Engineer\n\nGitHub Star 🌟\nGoogle Dev... | 6.3k | 10 | Barcelona | @AdevintaSpain | @midudev |
3 | Jose Alcérreca | https://github.com//JoseAlcerreca | twitter.com/ppvi | Android Developer Relations Engineer @ Google | 2.3k | 0 | Madrid, Spain | @ppvi | |
4 | Shubham Mathur | https://github.com//googleknight | https://googleknight.github.io | Software engineer II @ MDL Bangalore\n | 29 | 40 | Bangalore, India | Mckinsey & Company | None |
5 | Nick Bourdakos | https://github.com//bourdakos1 | None | Software Engineer @google | 480 | 8 | New York City | None | |
6 | Mark | https://github.com//MarkEdmondson1234 | https://code.markedmondson.me/ | Data Engineer @iihnordic \nGoogle Developer E... | 783 | 117 | Copenhagen | @iihnordic | @HoloMarkeD |
7 | Dan Field | https://github.com//dnfield | None | Software Engineer @google for @flutter | 928 | 0 | None | None | |
8 | Pierfrancesco Soffritti | https://github.com//PierfrancescoSoffritti | https://pierfrancescosoffritti.com/ | Software engineer @google | 566 | 40 | London, UK | @psoffritti | |
9 | Parker Moore | https://github.com//parkr | https://byparker.com | 🍩 🌎 Senior Engineer. Currently: @google. Forme... | 1.3k | 316 | USA | None | |
10 | Gokmen Goksel | https://github.com//gokmen | None | Software Engineer @google | 396 | 58 | San Francisco | @gokmen | |
11 | Sanket Singh | https://github.com//singhsanket143 | None | SDE @google | SDE @linkedin | Google Summer Of... | 1.7k | 13 | India | None | |
12 | Justin Poehnelt | https://github.com//jpoehnelt | https://justin.poehnelt.com | @google, @googleworkspace Developer Relations... | 307 | 9 | United States | @jpoehnelt | |
13 | Shanqing Cai | https://github.com//caisq | https://caisq.github.io/ | Software Engineer @ Google Research | 343 | 59 | None | None | |
14 | Valerii Iatsko | https://github.com//viatsko | None | UI Engineer @ Google | 346 | 55 | None | None | |
15 | Mizux | https://github.com//Mizux | http://www.mizux.net | OSS Release Engineer @google | 167 | 81 | Tours, France | None | |
16 | Gabriela D'Ávila Ferrara | https://github.com//gabidavila | https://gabi.dev | Developer Relations Engineer @google | 247 | 25 | New Jersey | @google @GoogleCloudPlatform | @gabidavila |
17 | Zulkarnine Mahmud | https://github.com//zulkarnine | www.zulkarnine.com | Software Engineer at Google | 564 | 0 | None | None | |
18 | Adam Silverstein | https://github.com//adamsilverstein | http://www.earthbound.com | Developer Relations Engineer @ Google | 153 | 6 | Colorado, USA | None | |
19 | Yacine Rezgui | https://github.com//yrezgui | https://yrezgui.com | Creative software engineer.\nDeveloper advocat... | 513 | 143 | London, UK | None | |
20 | https://github.com//Google987 | youtube.com/alittlecoding | Software Engineer\nYoutube: a little coding | 13 | 11 | India | None | @arif_decrypted | |
21 | Prateek Narang | https://github.com//prateek27 | www.prateeknarang.com | Software Engineer-III at Google, Udemy Instruc... | 2.4k | 4 | Hyderabad | Google India | None |
22 | Rakina Zata Amni | https://github.com//rakina | None | Software Engineer @google @chromium 🇮🇩 🇯🇵👩💻 | 307 | 22 | Tokyo, Japan | None | |
23 | Clément Mihailescu | https://github.com//clementmihailescu | algoexpert.io/clem | Co-Founder & CEO, AlgoExpert | Ex-Google & Ex-... | 7.5k | 2 | None | AlgoExpert | @clemmihai |
24 | Faisal Abid | https://github.com//FaisalAbid | http://www.FaisalAbid.com | @google Developer Expert, Entrepreneur, and En... | 505 | 30 | Toronto | @eirene-cremations @bitstrapped @Shopistry | @FaisalAbid |
25 | Kate Lovett | https://github.com//Piinks | None | Software Engineer at @google for @flutter | 745 | 10 | Nashville, TN | @k8lovett | |
26 | Andrey Kulikov | https://github.com//andkulikov | http://linkedin.com/in/andkulikov/ | Software Engineer at Google | 378 | 0 | London | None | |
27 | Márton Braun | https://github.com//zsmb13 | https://zsmb.co/ | Android Developer Relations Engineer @google, ... | 418 | 7 | Budapest, Hungary | @zsmb13 | |
28 | Kate Lovett | https://github.com//Piinks | None | Software Engineer at @google for @flutter | 745 | 10 | Nashville, TN | @k8lovett | |
29 | Faisal Abid | https://github.com//FaisalAbid | http://www.FaisalAbid.com | @google Developer Expert, Entrepreneur, and En... | 505 | 30 | Toronto | @eirene-cremations @bitstrapped @Shopistry | @FaisalAbid |
limit_output extension: Maximum message size of 10000 exceeded with 15427 characters
df.shape
(48, 9)
We were able to scrape about 50 profiles. There are lots of rich information like portfolio of a person and his/her profile headline and twitter handle.
There are few more cleaning needed too. Like The followers and following counts.
df["followers"] = df.followers.apply(lambda x: 1000*float(x.replace("k","")) if "k" in x else float(x))
df.head()
name | urls | portfolio | headline | followers | following | home | work | ||
---|---|---|---|---|---|---|---|---|---|
0 | Seth Vargo | https://github.com//sethvargo | https://www.sethvargo.com | Engineer @google | 3200.0 | 5 | Pittsburgh, PA | @sethvargo | |
1 | Kevin Naughton Jr. | https://github.com//kdn251 | youtube.com/kevinnaughtonjr | Software Engineer @google | 3800.0 | 9 | New York, New York | @kevinnaughtonjr | |
2 | Miguel Ángel Durán | https://github.com//midudev | https://midu.dev | Software Engineer\n\nGitHub Star 🌟\nGoogle Dev... | 6300.0 | 10 | Barcelona | @AdevintaSpain | @midudev |
3 | Jose Alcérreca | https://github.com//JoseAlcerreca | twitter.com/ppvi | Android Developer Relations Engineer @ Google | 2300.0 | 0 | Madrid, Spain | @ppvi | |
4 | Shubham Mathur | https://github.com//googleknight | https://googleknight.github.io | Software engineer II @ MDL Bangalore\n | 29.0 | 40 | Bangalore, India | Mckinsey & Company | None |
df["following"] = df.following.apply(lambda x: 1000*float(x.replace("k","")) if "k" in x else float(x))
df.head()
name | urls | portfolio | headline | followers | following | home | work | ||
---|---|---|---|---|---|---|---|---|---|
0 | Seth Vargo | https://github.com//sethvargo | https://www.sethvargo.com | Engineer @google | 3200.0 | 5.0 | Pittsburgh, PA | @sethvargo | |
1 | Kevin Naughton Jr. | https://github.com//kdn251 | youtube.com/kevinnaughtonjr | Software Engineer @google | 3800.0 | 9.0 | New York, New York | @kevinnaughtonjr | |
2 | Miguel Ángel Durán | https://github.com//midudev | https://midu.dev | Software Engineer\n\nGitHub Star 🌟\nGoogle Dev... | 6300.0 | 10.0 | Barcelona | @AdevintaSpain | @midudev |
3 | Jose Alcérreca | https://github.com//JoseAlcerreca | twitter.com/ppvi | Android Developer Relations Engineer @ Google | 2300.0 | 0.0 | Madrid, Spain | @ppvi | |
4 | Shubham Mathur | https://github.com//googleknight | https://googleknight.github.io | Software engineer II @ MDL Bangalore\n | 29.0 | 40.0 | Bangalore, India | Mckinsey & Company | None |
Plotting Followers
We will use Seaborn
a library built above Matplotlib.
import seaborn as sns
sns.set()
df.followers.hist()
<AxesSubplot:>
It seems that most people have very less followers.
Finding Contact Details
If a profile has portfolio, then there is high chances that the portfolio has contact page too. So again, we can scrape that portfolio and collect such information.
Pros and Cons of Using GitHub Search
- It is quite easier to find people in tech based on the skill-set but finding people who work in tech but does not have GitHub profile is not possible.
- Getting contact details is only possible if a person has the portfolio and that portfolio has it. Either way, its easier than finding the information from LinkedIn Profile.
- Sometimes the results might not be shown once GitHub suspects something is wrong in our request.
Comments