WordCloud in Python: Text Analysis and Twitter Data Visualization
A WordCloud in Python is a simple and useful way to visualize the most frequent words in a text dataset. The more often a word appears, the larger it becomes in the image. This makes WordClouds helpful for quick text analysis, social media analysis, tweet analysis, and exploratory NLP projects.
In this post, I will create WordCloud visualizations from Twitter data related to the World Cup. The tweets were collected using keywords such as:
["worldcup", "world cup", "wcup", "football", "qatar worldcup prediction"]
I will show how to:
- install and import the
wordcloudpackage - read tweet data from a CSV file
- create a simple WordCloud
- generate a WordCloud from Twitter user bios
- clean noisy text such as URLs and mentions
- create a WordCloud from tweet text
You can also read my earlier posts about scraping tweets with Tweepy and World Cup tweet sentiment analysis in Python.
What Is a WordCloud?
A WordCloud is an image where words are shown with different sizes. Usually, the most frequent words are shown larger, while less frequent words are shown smaller.
For example, if the word football appears many times in a tweet dataset, it will be larger in the WordCloud. This gives us a quick visual summary of the most common words in the text.
WordClouds are not a complete text analysis method, but they are useful for quick exploration and presentation.
Install WordCloud in Python
The easiest way to create a WordCloud in Python is to use the wordcloud package.
You can install it with pip:
!pip install wordcloud
For a normal Python environment, you can also run:
pip install wordcloud
Import Required Packages
First, import the required Python packages.
import re
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
sns.set()
Read the Tweet CSV File
Now, let’s read the CSV file that contains the tweet data.
df = pd.read_csv("en1670670038.8448372.csv")
df.head()
For this WordCloud tutorial, I mainly use two columns:
df[["text", "bio"]].head()
The text column contains the tweet text. The bio column contains the user profile bio.
Create a Simple WordCloud in Python
Before using the tweet dataset, let’s create a simple WordCloud from sample text.
txt = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla fringilla ex nec massa
sollicitudin, et condimentum mi vehicula. Integer enim urna, pellentesque a augue sed,
malesuada ornare enim. Integer at ullamcorper tellus. Cras condimentum orci ac enim
egestas, nec elementum dolor varius.
Suspendisse vel vestibulum lorem, vel aliquam justo. Praesent hendrerit, est et
lobortis condimentum, elit augue bibendum velit, sed volutpat purus tortor maximus nisi.
"""
wc = WordCloud(
max_words=500,
width=1000,
height=500
)
wcimg = wc.generate(txt)
plt.figure(figsize=(15, 10))
plt.imshow(wcimg)
plt.axis("off")
plt.title("WordCloud Test")
plt.show()

Customize WordCloud Background Color
The WordCloud class accepts many parameters. For example, we can change the background color.
wc = WordCloud(
max_words=500,
width=1000,
height=500,
background_color="red"
)
wcimg = wc.generate(txt)
plt.figure(figsize=(15, 10))
plt.imshow(wcimg)
plt.axis("off")
plt.title("WordCloud with Red Background")
plt.show()

WordCloud from Twitter User Bios
Now, let’s create a WordCloud from the bio column of the Twitter dataset.
bio_text = " ".join(df[df.bio.isna() == False].bio)
wc = WordCloud(
max_words=1000,
width=1600,
height=800,
collocations=False
).generate(bio_text)
plt.figure(figsize=(15, 10))
plt.imshow(wc)
plt.axis("off")
plt.title("WordCloud of Twitter User Bios")
plt.show()

Clean Text Before Creating a WordCloud
Text from social media usually contains noise such as URLs, mentions, hashtags, emojis, punctuation, and extra spaces.
def remove_noise(text):
"""Clean tweet text or user bio text."""
text = str(text)
text = re.sub(r"http\S+|www\S+|https\S+", " ", text)
text = re.sub(r"@\w+", " ", text)
text = re.sub(r"#", "", text)
text = re.sub(r"[^A-Za-z\s]", " ", text)
text = " ".join(text.split())
return text
Cleaned WordCloud from Twitter User Bios
custom_stopwords = STOPWORDS.union({
"https",
"http",
"co",
"amp",
"RT"
})
clean_bio_text = " ".join(
df[df.bio.isna() == False].bio.apply(remove_noise)
)
wc = WordCloud(
max_words=1000,
width=1600,
height=800,
collocations=False,
stopwords=custom_stopwords,
background_color="white"
).generate(clean_bio_text)
plt.figure(figsize=(15, 10))
plt.imshow(wc)
plt.axis("off")
plt.title("Cleaned WordCloud of Twitter User Bios")
plt.show()

WordCloud from Tweet Text
Now, let’s create a WordCloud from the actual tweet text.
clean_tweet_text = " ".join(
df[df.text.isna() == False].text.apply(remove_noise)
)
wc = WordCloud(
max_words=1000,
width=1600,
height=800,
collocations=False,
stopwords=custom_stopwords,
background_color="white"
).generate(clean_tweet_text)
plt.figure(figsize=(15, 10))
plt.imshow(wc)
plt.axis("off")
plt.title("WordCloud of Tweet Text")
plt.show()

From the WordCloud, we can clearly see that World Cup related words appear very often. This makes sense because the tweets were collected using World Cup keywords.
Display WordCloud as an Image
The generated WordCloud can also be shown directly as an image.
wc.to_image()

Optional: Save the WordCloud Image
wc.to_file("worldcup_tweet_wordcloud.png")
Final Thoughts
In this post, I showed how to create a WordCloud in Python using Twitter data. We created a simple WordCloud, generated WordClouds from Twitter bios and tweet text, and cleaned noisy text before visualization.
The most important step is text cleaning. Without cleaning, words like https, co, and random symbols can dominate the WordCloud. After cleaning, the visualization becomes more useful and easier to understand.
WordClouds are simple, but they are a good starting point for text analysis and NLP projects. They help us quickly see the most common words in a dataset before doing deeper analysis such as sentiment analysis, topic modeling, or classification.
Comments