Scraping Tweets with Tweepy

7 minute read

Tweets Scraping using Tweepy

Hello and welcome back everyone, in this blog we are going to explore how we can scrape tweets using Twitter’s API and Tweepy. The API calls are handled by Tweepy and we only need to give it Keys.

Getting API Keys

First we need to have a Twitter Developer Account and only with it, we can get keys to scrape tweets.

  • Visit developer.twitter.com. Please read carefully how much request is possible to which type of accounts.
  • Go to sign in and fill in the credentials.
  • Then create an app.

  • Generate and save API Keys.

Installing Tweepy

!pip install git+https://github.com/tweepy/tweepy.git
Collecting git+https://github.com/tweepy/tweepy.git
  Cloning https://github.com/tweepy/tweepy.git to c:\users\dell\appdata\local\temp\pip-req-build-div0g8k4
Requirement already satisfied: oauthlib<4,>=3.2.0 in c:\users\dell\appdata\roaming\python\python38\site-packages (from tweepy==4.10.0) (3.2.0)
Requirement already satisfied: requests<3,>=2.27.0 in c:\users\dell\appdata\roaming\python\python38\site-packages (from tweepy==4.10.0) (2.27.1)
Requirement already satisfied: requests-oauthlib<2,>=1.2.0 in c:\programdata\anaconda3\lib\site-packages (from tweepy==4.10.0) (1.3.0)
Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\dell\appdata\roaming\python\python38\site-packages (from requests<3,>=2.27.0->tweepy==4.10.0) (2.0.7)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\programdata\anaconda3\lib\site-packages (from requests<3,>=2.27.0->tweepy==4.10.0) (1.26.4)
Requirement already satisfied: idna<4,>=2.5 in c:\programdata\anaconda3\lib\site-packages (from requests<3,>=2.27.0->tweepy==4.10.0) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in c:\programdata\anaconda3\lib\site-packages (from requests<3,>=2.27.0->tweepy==4.10.0) (2020.12.5)
Building wheels for collected packages: tweepy
  Building wheel for tweepy (setup.py): started
  Building wheel for tweepy (setup.py): finished with status 'done'
  Created wheel for tweepy: filename=tweepy-4.10.0-py3-none-any.whl size=95239 sha256=21aef993404498afa5f673cece5e8ee6a9c82b6b89c7b063c963858021692b03
  Stored in directory: C:\Users\Dell\AppData\Local\Temp\pip-ephem-wheel-cache-yi90v1ry\wheels\ad\05\51\a78f66d15b87f9c623d2f3afc4401660ac4219e526c787fb8b
Successfully built tweepy
Installing collected packages: tweepy
  Attempting uninstall: tweepy
    Found existing installation: tweepy 4.8.0
    Uninstalling tweepy-4.8.0:
      Successfully uninstalled tweepy-4.8.0
Successfully installed tweepy-4.10.0


  Running command git clone -q https://github.com/tweepy/tweepy.git 'C:\Users\Dell\AppData\Local\Temp\pip-req-build-div0g8k4'

Preparing Keys

api_key="api_key here"
secret="secret key here"
bearer="bearer here"
access_token="access_token here"
access_token_secret="access_token_secret here"

Import and prepare API Object

Pass api_key, secret, access_token, access_token_secret into the OAuthHandler.

import tweepy as tw

api_key= api_key
api_secret= secret


auth = tw.OAuthHandler(api_key, api_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)

Making First API Call

key_word="Funny"
max_items=5
for tweet in tw.Cursor(api.search_tweets,
        q=key_word, count=max_items).items(max_items):
    print(tweet.text)
RT @SocDoneLeft: still support the Socialist Agenda? https://t.co/qWYlwscPuD
RT @hiraiboostan: i know it might sound weird but twice really did save me in some ways. they helped me feel less lonely and opened a door…
RT @memorytrain2012: 下北沢クラブキューに誘っていただきました。近藤さん(ex.PEALOUT)率いるmy funny hitchhiker、知り合いからずっと話を聞いていたthe MADRASとの3マンライブ。大先輩に囲まれて緊張感ありますが突然少年は遠慮…
I mean...

(https://t.co/cOvCZLjJVs)
#meme #memes #funny #joke #memebot https://t.co/HEi2bsOytd
RT @IcyJaime: “are you ok?” nah but l’m funny

In above example, we searched for the keyword Funny and took only 5 items.

Storing Tweets as CSV and DataFrame

Since we have already made some calls, lets create DataFrame where all the tweets will be stored.

import json,csv,time,os
def get_related_tweets(key_words, language="en", max_tweets=50, max_items=10):
    fname=language+str(time.time())+".csv"
    print(f"Filename {fname}")
    
    count=0
    tweets=max_tweets
    
    for key_word in key_words:
        print(f"Current Keyword: {key_word}")
        for tweet in tw.Cursor(api.search_tweets,
                               q=key_word, count=max_items).items(max_items):
            
            tweet_created_at = []
            text = []
            user=[]
            hashtags = []
            user_mentions = []
            in_reply = []
            protected = []
            followers_count = [] 
            friends_count = []
            listed_count = []
            created_at = []
            favourites_count = []
            geo_enabled = []
            verified =[]
            statuses_count=[]
            coordinates=[]
            is_quote_status=[]
            retweet_count=[]
            favorited=[]
            retweeted=[]
            source = []
            place=[]
            lang=[]
            kwd=[]
            ids=[]
            locations=[]
            description=[]

            if tweet.lang!=language:
                continue
            count+=1
            try:
              tweet_created_at.append(tweet.created_at)
              status = api.get_status(tweet.id, tweet_mode="extended")
              try:
                  txt = status.retweeted_status.full_text
              except AttributeError:  
                  txt = status.full_text

              description.append(tweet.user.description)
              locations.append(tweet.user.location)
              ids.append(tweet.id)
              text.append(txt)
              user.append(tweet.user.screen_name)
              hashtags.append(tweet.entities["hashtags"])
              user_mentions.append(len(tweet.entities["user_mentions"]))
              in_reply.append(tweet.in_reply_to_status_id)
              protected.append(tweet.user.protected)
              followers_count.append(tweet.user.followers_count)
              friends_count.append(tweet.user.friends_count)
              listed_count.append(tweet.user.listed_count)
              created_at.append(tweet.user.created_at)
              favourites_count.append(tweet.user.favourites_count)
              geo_enabled.append(tweet.user.geo_enabled)
              verified.append(tweet.user.verified)
              statuses_count.append(tweet.user.statuses_count)
              coordinates.append(tweet.coordinates)
              is_quote_status.append(tweet.is_quote_status)
              retweet_count.append(tweet.retweet_count)
              favorited.append(tweet.favorited)
              retweeted.append(tweet.retweeted)
              source.append(tweet.source)
              place.append(tweet.place)
              lang.append(tweet.lang)
              kwd.append(key_word)

              dict_data={"id":ids,'tweet_created_at':tweet_created_at, 
                                    'text': text, 'user': user, "bio":description,"location":locations,
                                    "hashtags":hashtags, "user_mentions":user_mentions,
                                    "in_reply":in_reply, "protected":protected, "followers_count":followers_count,
                                    "friends_count":friends_count, "listed_count":listed_count, "created_at":created_at,
                                    "favourites_count":favourites_count, "geo_enabled":geo_enabled, "verified":verified,
                                    "statuses_count":statuses_count, "coordinates":coordinates, "is_quote_status":is_quote_status,
                                    "retweet_count":retweet_count,
                                    "retweeted":retweeted,"lang":lang,
                                    "source":source,"place":place,"kwd":key_word}
              csv_columns=list(dict_data.keys())
              dict_data = {k:v[0] for k,v in dict_data.items()}
              if os.path.isfile(fname):  
                # print("File Exists")
                pass
              else:
                # print("File does not exist")
                with open(fname, 'a', encoding='utf-8') as csvfile:
                  writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
                  writer.writeheader()

              with open(fname, 'a', encoding='utf-8') as csvfile:
                  writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
                  for data in [dict_data]:
                      writer.writerow(data)

              if count>=tweets:
                break
            except:
              print("Something is wrong. Skipping this tweet.")
    
    return pd.read_csv(fname, parse_dates=["tweet_created_at","created_at"])

Above function takes keywords, language and number of max_tweets and max_items which will be used in API.

  • Prepare a file name based on current timestamp, so that there will be no repeat.
  • Prepare count and max tweets number so that we will be getting only max_tweets number of tweets.
  • We loop onto keyword.
  • Loop into the results given by Cursor. We pass api.search_tweets, which is called by the Cursor.
  • Prepare lists to hold the important fields like tweet date, account date, user, tweet text and so on.
  • If current language of the tweet is not language we wanted, then we will skip the current result.
  • We append all the important values into their respective lists and the finally create a dictionary.
  • Then write that dictionary using csv.DictWriter
  • Finally read that csv file as a dataframe and return it.

keywords=["climate","funny"]
tweets_df = get_related_tweets(keywords)
tweets_df
Filename en1654358920.3398395.csv
Current Keyword: climate
Current Keyword: funny
<div id=5ac3434b-fba6-4b09-8a10-4531f3212092 style="display:none; background-color:#9D6CFF; color:white; width:200px; height:30px; padding-left:5px; border-radius:4px; flex-direction:row; justify-content:space-around; align-items:center;" onmouseover="this.style.backgroundColor='#BA9BF8'" onmouseout="this.style.backgroundColor='#9D6CFF'" onclick="window.commands?.execute('create-mitosheet-from-dataframe-output');">See Full Dataframe in Mito
id tweet_created_at text user bio location hashtags user_mentions in_reply protected ... verified statuses_count coordinates is_quote_status retweet_count retweeted lang source place kwd
0 1533118562847670279 2022-06-04 16:08:44+00:00 Households and businesses rely on the Met Office to give good weather advice. With climate change, this becomes even more important.\r\n\r\nCutting the Met Office would be an absurd and ridiculous decision. https://t.co/kQOnGIQwhQ RobertCHeale Local Man in the Brighton and Hove area who is practical and caring. Interested in sport, current affairs and music. Generally safe, sound and steady! NaN [] 0 NaN False ... False 38237 NaN True 0 False en Twitter Web App NaN c
1 1533118548859572230 2022-06-04 16:08:40+00:00 So read my last 5 tweets.\r\nThat's my outline for trading stocks these days.\r\n\r\n If you think you have the skills and knowledge to be successful in this volatile trading climate then go right ahead. MasterBJones MASTER STOCK TRADER/100% FREE!\r\nDisclaimer: My tweets are 100% ONLY for educational/entertainment purposes. NOT a licenced financial professional. NOT Advice. NaN [] 0 NaN False ... False 54293 NaN False 0 False en Twitter Web App NaN c
2 1533118545495658496 2022-06-04 16:08:39+00:00 New York's Democratic supermajority just refused to pass climate policy for the 3rd year in a row. National solar industry lobbyists teamed up with fossil fueled merchant power producers to kill a bill because they were scared of losing market share to public renewables asgardwalathor I used to rule Asgard, but Ragnarok happened.\r\nNow I am jobless, so I tweet on Indian politics. Asgard [] 1 NaN False ... False 7004 NaN False 128 False en Twitter Web App NaN c
3 1533118545143287808 2022-06-04 16:08:39+00:00 Nothing is more backwards than Leftists policies… https://t.co/hgKsT0vvGa ijaredtaylor Investing in our youth and community.... Gilbert, Arizona [] 0 NaN False ... False 1519 NaN False 0 False en Twitter for iPad NaN c
4 1533118544208076800 2022-06-04 16:08:39+00:00 For 50 years, governments have failed to act on climate change. No more excuses | Christiana Figueres et al https://t.co/aActtKN7M1 https://t.co/EQw4tY3cOl rubyjean72802 BLUE to the bone. Nana to 3 beautiful grandchildren who deserve the opportunity to live in a truly free country ruled by its citizens rather than the elite few. NaN [] 1 NaN False ... False 26375 NaN False 13 False en Twitter Web App NaN c
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14 1533118576986574851 2022-06-04 16:08:47+00:00 @rebeccahchase I know. Occasionally can be funny lol X. ❤️. bathb0y Name's Justin. Bsc (Hons) Degree in Social Policy and Criminology with the Open University. Loves Cricket, Rugby Union, Horse Racing mainly. Also love Yoga! Bath/Chippenham, UK [] 1 1.533118e+18 False ... False 62166 NaN False 0 False en Twitter for Android NaN f
15 1533118576000790529 2022-06-04 16:08:47+00:00 OP "I got scolded by kpopapi today"\r\n💎appa i'm really late right?\r\n🐢i told you to be early, didn't i?!\r\n\r\n🐢are you studying hard?\r\n💎but i'm an office worker\r\nVernon suddenly back to being an idol instead of his dad act. Op found it v funny.\r\n🐢ah i thought you were a student 17_dlwlrma @pledis_17 @_iuofficial NaN [] 1 NaN False ... False 300573 NaN False 294 False en Twitter for iPhone NaN f
16 1533118568358985729 2022-06-04 16:08:45+00:00 @Accraaaaaa_ Brother, you’re indeed funny OforiSah1 I Love Reggae. Kpone Shanghai-High Tension [] 1 1.532977e+18 False ... False 290 NaN False 0 False en Twitter for iPhone NaN f
17 1533118567486345217 2022-06-04 16:08:45+00:00 Twitter is so funny \r\nThe impact and reach of JD is far more than any character designed by Loki till date : Wallpapers Whatsapp statuses ellathalyum JD irkar even today.\r\nLoki will soon give another massy character in #Thalapathy67 💥 https://t.co/Fhqmar2gi4 Manoj_vj_ ÄrÐêñ† †hålåþå†h¥ £åñ//SuRiYa // #Beast // Thala MSDian🔥/💪\r\nMech boy murugagoundampalayam, Namakkal [] 1 NaN False ... False 217497 NaN False 45 False en Twitter for Android NaN f
18 1533118567108911106 2022-06-04 16:08:45+00:00 jungkook's reaction when he remembered they had a song titled 'heartbeat' is still so funny like😭 https://t.co/jGhLLYP5Hg ning061313 마리즈 manila [] 1 NaN False ... False 239433 NaN False 2784 False en Twitter for iPhone NaN f

</div>

Whats next?

  • Performing EDA based on some topic like COVID, College and so on.
  • Performing Sentiment Analysis based on the tweet text.
  • Performing Tweet Generation using Some Markov Models.

Comments