WordCloud in Python: Text Analysis and Twitter Data Visualization

4 minute read

A WordCloud in Python is a simple and useful way to visualize the most frequent words in a text dataset. The more often a word appears, the larger it becomes in the image. This makes WordClouds helpful for quick text analysis, social media analysis, tweet analysis, and exploratory NLP projects.

In this post, I will create WordCloud visualizations from Twitter data related to the World Cup. The tweets were collected using keywords such as:

["worldcup", "world cup", "wcup", "football", "qatar worldcup prediction"]

I will show how to:

  • install and import the wordcloud package
  • read tweet data from a CSV file
  • create a simple WordCloud
  • generate a WordCloud from Twitter user bios
  • clean noisy text such as URLs and mentions
  • create a WordCloud from tweet text

You can also read my earlier posts about scraping tweets with Tweepy and World Cup tweet sentiment analysis in Python.

What Is a WordCloud?

A WordCloud is an image where words are shown with different sizes. Usually, the most frequent words are shown larger, while less frequent words are shown smaller.

For example, if the word football appears many times in a tweet dataset, it will be larger in the WordCloud. This gives us a quick visual summary of the most common words in the text.

WordClouds are not a complete text analysis method, but they are useful for quick exploration and presentation.

Install WordCloud in Python

The easiest way to create a WordCloud in Python is to use the wordcloud package.

You can install it with pip:

!pip install wordcloud

For a normal Python environment, you can also run:

pip install wordcloud

Import Required Packages

First, import the required Python packages.

import re

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

sns.set()

Read the Tweet CSV File

Now, let’s read the CSV file that contains the tweet data.

df = pd.read_csv("en1670670038.8448372.csv")
df.head()

For this WordCloud tutorial, I mainly use two columns:

df[["text", "bio"]].head()

The text column contains the tweet text. The bio column contains the user profile bio.

Create a Simple WordCloud in Python

Before using the tweet dataset, let’s create a simple WordCloud from sample text.

txt = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla fringilla ex nec massa
sollicitudin, et condimentum mi vehicula. Integer enim urna, pellentesque a augue sed,
malesuada ornare enim. Integer at ullamcorper tellus. Cras condimentum orci ac enim
egestas, nec elementum dolor varius.

Suspendisse vel vestibulum lorem, vel aliquam justo. Praesent hendrerit, est et
lobortis condimentum, elit augue bibendum velit, sed volutpat purus tortor maximus nisi.
"""

wc = WordCloud(
    max_words=500,
    width=1000,
    height=500
)

wcimg = wc.generate(txt)

plt.figure(figsize=(15, 10))
plt.imshow(wcimg)
plt.axis("off")
plt.title("WordCloud Test")
plt.show()

Simple WordCloud example created in Python

Customize WordCloud Background Color

The WordCloud class accepts many parameters. For example, we can change the background color.

wc = WordCloud(
    max_words=500,
    width=1000,
    height=500,
    background_color="red"
)

wcimg = wc.generate(txt)

plt.figure(figsize=(15, 10))
plt.imshow(wcimg)
plt.axis("off")
plt.title("WordCloud with Red Background")
plt.show()

WordCloud with red background color in Python

WordCloud from Twitter User Bios

Now, let’s create a WordCloud from the bio column of the Twitter dataset.

bio_text = " ".join(df[df.bio.isna() == False].bio)

wc = WordCloud(
    max_words=1000,
    width=1600,
    height=800,
    collocations=False
).generate(bio_text)

plt.figure(figsize=(15, 10))
plt.imshow(wc)
plt.axis("off")
plt.title("WordCloud of Twitter User Bios")
plt.show()

WordCloud generated from Twitter user bios

Clean Text Before Creating a WordCloud

Text from social media usually contains noise such as URLs, mentions, hashtags, emojis, punctuation, and extra spaces.

def remove_noise(text):
    """Clean tweet text or user bio text."""
    text = str(text)

    text = re.sub(r"http\S+|www\S+|https\S+", " ", text)
    text = re.sub(r"@\w+", " ", text)
    text = re.sub(r"#", "", text)
    text = re.sub(r"[^A-Za-z\s]", " ", text)
    text = " ".join(text.split())

    return text

Cleaned WordCloud from Twitter User Bios

custom_stopwords = STOPWORDS.union({
    "https",
    "http",
    "co",
    "amp",
    "RT"
})

clean_bio_text = " ".join(
    df[df.bio.isna() == False].bio.apply(remove_noise)
)

wc = WordCloud(
    max_words=1000,
    width=1600,
    height=800,
    collocations=False,
    stopwords=custom_stopwords,
    background_color="white"
).generate(clean_bio_text)

plt.figure(figsize=(15, 10))
plt.imshow(wc)
plt.axis("off")
plt.title("Cleaned WordCloud of Twitter User Bios")
plt.show()

Cleaned WordCloud generated from Twitter user bios

WordCloud from Tweet Text

Now, let’s create a WordCloud from the actual tweet text.

clean_tweet_text = " ".join(
    df[df.text.isna() == False].text.apply(remove_noise)
)

wc = WordCloud(
    max_words=1000,
    width=1600,
    height=800,
    collocations=False,
    stopwords=custom_stopwords,
    background_color="white"
).generate(clean_tweet_text)

plt.figure(figsize=(15, 10))
plt.imshow(wc)
plt.axis("off")
plt.title("WordCloud of Tweet Text")
plt.show()

WordCloud generated from World Cup tweet text using Python

From the WordCloud, we can clearly see that World Cup related words appear very often. This makes sense because the tweets were collected using World Cup keywords.

Display WordCloud as an Image

The generated WordCloud can also be shown directly as an image.

wc.to_image()

Final WordCloud image generated from tweet text

Optional: Save the WordCloud Image

wc.to_file("worldcup_tweet_wordcloud.png")

Final Thoughts

In this post, I showed how to create a WordCloud in Python using Twitter data. We created a simple WordCloud, generated WordClouds from Twitter bios and tweet text, and cleaned noisy text before visualization.

The most important step is text cleaning. Without cleaning, words like https, co, and random symbols can dominate the WordCloud. After cleaning, the visualization becomes more useful and easier to understand.

WordClouds are simple, but they are a good starting point for text analysis and NLP projects. They help us quickly see the most common words in a dataset before doing deeper analysis such as sentiment analysis, topic modeling, or classification.

Comments