Text Analysis with WordCloud in Python
WordCloud in Python can be done in different ways but one of the most popular and easier ones is using the package wordcloud
. We can install it using the following way.
!pip install wordcloud
Requirement already satisfied: wordcloud in c:\programdata\anaconda3\lib\site-packages (1.8.1)
Requirement already satisfied: pillow in c:\programdata\anaconda3\lib\site-packages (from wordcloud) (8.0.1)
Requirement already satisfied: numpy>=1.6.1 in c:\programdata\anaconda3\lib\site-packages (from wordcloud) (1.19.2)
Requirement already satisfied: matplotlib in c:\users\viper\appdata\roaming\python\python38\site-packages (from wordcloud) (3.5.3)
Requirement already satisfied: pyparsing>=2.2.1 in c:\users\viper\appdata\roaming\python\python38\site-packages (from matplotlib->wordcloud) (3.0.9)
Requirement already satisfied: fonttools>=4.22.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (4.37.1)
Requirement already satisfied: cycler>=0.10 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (0.10.0)
Requirement already satisfied: python-dateutil>=2.7 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (2.8.1)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (1.3.0)
Requirement already satisfied: packaging>=20.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->wordcloud) (20.4)
Requirement already satisfied: six in c:\programdata\anaconda3\lib\site-packages (from cycler>=0.10->matplotlib->wordcloud) (1.15.0)
WordCloud simply is the words scattered in an image and the word’s size differs based on different properties. Here in this blog, I will plot a word cloud for based on a tweet created with keywords ['worldcup', 'world cup', 'wcup', 'football', 'qatar worldcup prediction']
. You can read about how to scrape tweets from this blog and how to perform sentiment analysis from this blog.
Importing Packages
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import re
sns.set()
Reading CSV File
Let’s read our CSV file containing tweets.
df = pd.read_csv('en1670670038.8448372.csv')
df.head()
id | tweet_created_at | text | user | bio | location | hashtags | user_mentions | in_reply | protected | ... | verified | statuses_count | coordinates | is_quote_status | retweet_count | retweeted | lang | source | place | kwd | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1601532325241950209 | 2022-12-10 11:00:36+00:00 | @mcbenwell @TheTotallyShow Unsurprisingly, thi... | DrLouiseClare1 | Historian looking at Argentine, British and US... | NaN | [] | 2 | 1.601523e+18 | False | ... | False | 1187 | NaN | False | 0 | False | en | Twitter for iPhone | NaN | w |
1 | 1601532314613334018 | 2022-12-10 11:00:33+00:00 | All the best to England playing in the quarter... | bookajet | Enjoy freedom without responsibility and let y... | Farnborough Airport | [{'text': 'worldcup', 'indices': [61, 70]}, {'... | 0 | NaN | False | ... | False | 959 | NaN | False | 0 | False | en | Hootsuite Inc. | NaN | w |
2 | 1601532312696811520 | 2022-12-10 11:00:33+00:00 | 1/It's Matchday⚽️\r\n\r\nShow support to your ... | 0xNeverWinn | @GaHunter688 suspended | NaN | [{'text': 'worldcup', 'indices': [61, 70]}, {'... | 1 | NaN | False | ... | False | 8442 | NaN | False | 4472 | False | en | Twitter Web App | NaN | w |
3 | 1601532303762677762 | 2022-12-10 11:00:30+00:00 | Good Luck England ⚽⚽⚽\r\n #Itscominghome #Worl... | 3LionsOnMaShirt | Sharing the latest #ThreeLions news and fan ta... | Manchester, England | [{'text': 'Itscominghome', 'indices': [40, 54]... | 1 | NaN | False | ... | False | 157 | NaN | False | 1 | False | en | VillaBotMan | NaN | w |
4 | 1601532286507552768 | 2022-12-10 11:00:26+00:00 | Guess the Quarter Final Winners ⚽️🥂\r\n\r\nThe... | EdehRonald | crypto enthusiast/trader | NaN | [{'text': 'WorldcupQatar2022', 'indices': [64,... | 1 | NaN | False | ... | False | 92 | NaN | False | 24 | False | en | Twitter for Android | NaN | w |
5 rows × 26 columns
Two columns, text and bio can be used to plot a word cloud.
Simple WordCloud
Let’s plot a simple WordCloud of the following text:
txt = """Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla fringilla ex nec massa sollicitudin, et condimentum mi vehicula. Integer enim urna, pellentesque a augue sed, malesuada ornare enim. Integer at ullamcorper tellus. Cras condimentum orci ac enim egestas, nec elementum dolor varius. Vestibulum molestie magna vel sapien tristique dictum. Nam auctor vitae enim vitae lacinia. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus mollis in est vitae dictum. Duis et mauris dui. Etiam aliquam in leo vitae placerat. Cras tincidunt neque id lectus tincidunt accumsan. Donec ut dignissim mi, at consequat elit.
Suspendisse vel vestibulum lorem, vel aliquam justo. Praesent hendrerit, est et lobortis condimentum, elit augue bibendum velit, sed volutpat purus tortor maximus nisi. In sed volutpat lectus. Aenean at turpis vel nisl egestas mollis at sit amet dolor. Nullam semper dapibus orci, facilisis tempor nisl volutpat consectetur. Curabitur elit est, vehicula venenatis interdum at, suscipit et magna. Vestibulum a pretium felis. Curabitur tristique euismod laoreet. Aliquam erat volutpat. Sed luctus nulla sed posuere mattis. Vivamus ligula turpis, sollicitudin non rutrum non, consequat sodales diam. Donec dapibus nec ligula eu tincidunt. Maecenas risus massa, malesuada eu lorem a, fringilla imperdiet leo.
"""
wc = WordCloud(max_words=500, width=1000, height=500)
wcimg=wc.generate(txt)
plt.figure(figsize=(15,10))
plt.imshow(wcimg)
plt.title('WordCloud Test')
plt.show()
WordCloud accepts some parameters which can be found in docstring as well.
txt = """Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla fringilla ex nec massa sollicitudin, et condimentum mi vehicula. Integer enim urna, pellentesque a augue sed, malesuada ornare enim. Integer at ullamcorper tellus. Cras condimentum orci ac enim egestas, nec elementum dolor varius. Vestibulum molestie magna vel sapien tristique dictum. Nam auctor vitae enim vitae lacinia. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus mollis in est vitae dictum. Duis et mauris dui. Etiam aliquam in leo vitae placerat. Cras tincidunt neque id lectus tincidunt accumsan. Donec ut dignissim mi, at consequat elit.
Suspendisse vel vestibulum lorem, vel aliquam justo. Praesent hendrerit, est et lobortis condimentum, elit augue bibendum velit, sed volutpat purus tortor maximus nisi. In sed volutpat lectus. Aenean at turpis vel nisl egestas mollis at sit amet dolor. Nullam semper dapibus orci, facilisis tempor nisl volutpat consectetur. Curabitur elit est, vehicula venenatis interdum at, suscipit et magna. Vestibulum a pretium felis. Curabitur tristique euismod laoreet. Aliquam erat volutpat. Sed luctus nulla sed posuere mattis. Vivamus ligula turpis, sollicitudin non rutrum non, consequat sodales diam. Donec dapibus nec ligula eu tincidunt. Maecenas risus massa, malesuada eu lorem a, fringilla imperdiet leo.
"""
wc = WordCloud(max_words=500, width=1000, height=500, background_color='red')
wcimg=wc.generate(txt)
plt.figure(figsize=(15,10))
plt.imshow(wcimg)
plt.title('WordCloud Test')
plt.show()
WordCloud of Bio
We will plot WordCloud of only those bios where there is actually a text!
wc = WordCloud(max_words = 1000 , width = 1600 , height = 800,
collocations=False).generate(" ".join(df[df.bio.isna()==False].bio))
plt.figure(figsize=(15,10))
plt.imshow(wc)
plt.title('WordCloud of Bio')
plt.show()
Looking over the above WordCloud, we can see that the word https
is also there, it is a noise and we need to clear it.
Clearing Noise
def remove_noise(tweet):
'''
To remove noise
'''
return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
wc = WordCloud(max_words = 1000 , width = 1600 , height = 800,
collocations=False).generate(" ".join(df[df.bio.isna()==False].bio.apply(remove_noise)))
plt.figure(figsize=(15,10))
plt.imshow(wc)
plt.title('WordCloud of Bio')
plt.show()
It worked!
WordCloud of Tweet Text
Let’s plot WordCloud in Python for our tweet’s text.
def remove_noise(tweet):
'''
To remove noise
'''
return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
wc = WordCloud(max_words = 1000 , width = 1600 , height = 800,
collocations=False).generate(" ".join(df[df.text.isna()==False].text.apply(remove_noise)))
plt.figure(figsize=(15,10))
plt.imshow(wc)
plt.title('WordCloud of Text')
plt.show()
It’s obvious that our most repeated word is the world cup!
And we could simply plot like below.
wc.to_image()
If we want to plot wordcloud in python but with different font of the text, we can pass the font file and plot as well.
Comments