Python for Stock Market Analysis: Working with Moving Averages

13 minute read

Introduction

This blog is a first part of our series Python for Stock Market Analysis.

Disclaimer: This blog is for educational purpose only and we do not recommend taking the knowledge gained from this blog to implement in real financial exercises.

This blog tries to implement preliminary metrics that are used in the stock market analysis. The dataset we will be using is available via yahoofinance.

For interactive version of this blog, please visit this link.

Preliminary Actions

Install Libraries

Please install:

YahooFinance as pip install yfinance for downloading data of stock’s history.
Pandas as pip install pandas for data analysis.
Plotly as pip install plotly for interactive visualizations.
Cufflinks as pip install cufflinks for using interactive plots in pandas DataFrame.

You might need to install pip install -U kaleido if you need to save plots as png image.

If you are new into plotly, then we have an awesome blog about it where we have done plots based on COVID 19 dataset.

!pip install yfinance

Requirement already satisfied: yfinance in c:\programdata\anaconda3\lib\site-packages (0.1.63)
Requirement already satisfied: numpy>=1.15 in c:\users\dell\appdata\roaming\python\python38\site-packages (from yfinance) (1.19.5)
Requirement already satisfied: requests>=2.20 in c:\users\dell\appdata\roaming\python\python38\site-packages (from yfinance) (2.26.0)
Requirement already satisfied: multitasking>=0.0.7 in c:\programdata\anaconda3\lib\site-packages (from yfinance) (0.0.9)
Requirement already satisfied: pandas>=0.24 in c:\programdata\anaconda3\lib\site-packages (from yfinance) (1.2.4)
Requirement already satisfied: lxml>=4.5.1 in c:\programdata\anaconda3\lib\site-packages (from yfinance) (4.6.3)
Requirement already satisfied: pytz>=2017.3 in c:\programdata\anaconda3\lib\site-packages (from pandas>=0.24->yfinance) (2021.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\programdata\anaconda3\lib\site-packages (from pandas>=0.24->yfinance) (2.8.1)
Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.7.3->pandas>=0.24->yfinance) (1.15.0)
Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\dell\appdata\roaming\python\python38\site-packages (from requests>=2.20->yfinance) (2.0.7)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\programdata\anaconda3\lib\site-packages (from requests>=2.20->yfinance) (1.26.4)
Requirement already satisfied: certifi>=2017.4.17 in c:\programdata\anaconda3\lib\site-packages (from requests>=2.20->yfinance) (2020.12.5)
Requirement already satisfied: idna<4,>=2.5 in c:\programdata\anaconda3\lib\site-packages (from requests>=2.20->yfinance) (2.10)

Import Required Libraries

import pandas as pd
import plotly.express as px
import cufflinks
import plotly.io as pio 
import yfinance as yf
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')
pio.renderers.default = "notebook" # should change by looking into pio.renderers

pd.options.display.max_columns = None

Download Stock Data of Apple

By default, we are allowed to download data from 1900-01-01

symbols = ["AAPL"]

df = yf.download(tickers=symbols)
df.head()

[*********************100%***********************]  1 of 1 completed

	Open	High	Low	Close	Adj Close	Volume
Date
1980-12-12	0.128348	0.128906	0.128348	0.128348	0.100326	469033600
1980-12-15	0.122210	0.122210	0.121652	0.121652	0.095092	175884800
1980-12-16	0.113281	0.113281	0.112723	0.112723	0.088112	105728000
1980-12-17	0.115513	0.116071	0.115513	0.115513	0.090293	86441600
1980-12-18	0.118862	0.119420	0.118862	0.118862	0.092911	73449600

It seems that data is only available from 1980-12-12. The column names in the above fields are:

Open: It is the price of the stock at the start of the day of that date.
Close: It is the price of the stock at the end of that date.
High: It is the highest price of the stock on that date.
Low: It is the lowest price of the stock on that date.
Volume: It is the number of shares traded on that date.

Perform EDA

EDA or Exploratory Data Analysis is the first step in any Data Analysis and lets do that in our Stock Data too. We have blogs about doing EDA, Statistical and Inferential Analysis please check them out for more about EDAs.

Checking for Null Value

# convert column names into lowercase
df.columns = [c.lower() for c in df.columns]

ndf = pd.DataFrame()
for c in df.columns:
    nc = df[c].isna().sum()
    tr = len(df[c])
    rate = nc/tr
    ndf = ndf.append({"col_name":c,"total_rows": tr, 
                "null_rows": nc,
                "rate": rate},ignore_index=True)
ndf

	col_name	total_rows
0	open	10390.0
1	high	10390.0
2	low	10390.0
3	close	10390.0
4	adj_close	10390.0
5	volume	10390.0

It seems that we do not have any null rows present on the data.

View the Distribution

It gives us the frequency of value’s some range. It is simply a histogram.

fig = df.iplot(kind="hist",subplots=True, title="Distribution of All Variables", asFigure=True)
fig.write_image("stock_analysis/dist.png")
fig.show()

It seems that all values of the columns are left tailed.

View the Box Plot

Box Plot gives the clear picture of our descriptive nature of the data.

fig = df.iplot(kind="box",subplots=True, title="Box of All Variables", asFigure=True)
fig.write_image("stock_analysis/box.png")
fig.show()

It seems that we have too many outliers but it does not matter right now.

Summary of our data

df.describe()

	open	high	low	close	adj_close	volume
count	10390.000000	10390.000000	10390.000000	10390.000000	10390.000000	1.039000e+04
mean	13.689530	13.837209	13.542035	13.695320	13.077773	3.326112e+08
std	29.525352	29.857351	29.199483	29.542847	29.249790	3.394925e+08
min	0.049665	0.049665	0.049107	0.049107	0.038385	0.000000e+00
25%	0.281250	0.287946	0.273996	0.281250	0.234167	1.251712e+08
50%	0.466518	0.476004	0.459732	0.466518	0.385693	2.205952e+08
75%	14.034375	14.205357	13.918214	14.033482	12.025377	4.136293e+08
max	182.630005	182.940002	179.119995	182.009995	181.778397	7.421641e+09

Box Plot already gave us the summary of the data. We can see that the average volume is 3.326112e+08 but will it give a true picture about the volume’s flow over the course of the time? It won’t because there will be certain rise and falls of the values over the time. Lets try to visualize it as line plot.

fig=df.iplot(kind="line",subplots=True, title="Trend of All Variables", asFigure=True)
fig.write_image("stock_analysis/trend.png")
fig.show()

As we can see in the above plot that, the trend of the OHLC is in increasing order while Volume is not. The values of share increases/decreases but in overall, it seems to be increasing.

Moving Average

Moving average is a kind of average where we take the average of data within some time frame only. While looking at the time series data that have high volatility (e.g. standard deviation), the simple average DOES NOT give a clear picture of the mean or average value. One reason is that, in real world financial data, the amount/price does increase/decrease with some unexpected factors like COVID outbreak, or expected factors like Tesla’s new car. So to get the figure that will well represent the average amount, we will take the average over some time only. By doing so, we wont be caring much about the history that is too much old and does not affect much to our present.

Simple Moving Average (SMA)

Simple Moving Average is the simplest example of the Moving Average where we take the data from some time frame and divide it by number of data points. The size of the time frame is often known as the window of movement. It is an example of Technical Indicator (heuristic or pattern-based signals produced by the price or volume).

A formula to calculate Simple Moving Average is:

\[SMA = \frac{V_1 + V_2 + V_3 + ... + V_n}{n}\]

Where,

V is a value at period n
n is number of periods

Lets try to implement this concept in our data, we will take window size or n as 5.

tdf = df.copy()
smadf = tdf.rolling(window=5).mean()
smadf

	open	high	low	close	adj_close	volume
Date
1980-12-12	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-15	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-16	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-17	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-18	0.119643	0.119978	0.119420	0.119420	0.093347	182107520.0
...	...	...	...	...	...	...
2022-02-18	170.208002	171.663998	168.304001	170.080002	170.080002	72770540.0
2022-02-22	169.730002	171.085999	167.422000	169.168002	169.168002	73766000.0
2022-02-23	168.644000	169.725998	165.322000	166.624005	166.624005	78910580.0
2022-02-24	164.789999	167.628000	161.712000	164.662006	164.662006	94904600.0
2022-02-25	163.351999	166.269998	160.191998	163.856006	163.856006	99363080.0

10390 rows × 6 columns

for c in smadf.columns:
    tdf[f"sma_{c}"] = smadf[c]
tdf

	open	high	low	close	adj_close	volume	sma_open	sma_high	sma_low	sma_close	sma_adj_close	sma_volume
Date
1980-12-12	0.128348	0.128906	0.128348	0.128348	0.100326	469033600	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-15	0.122210	0.122210	0.121652	0.121652	0.095092	175884800	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-16	0.113281	0.113281	0.112723	0.112723	0.088112	105728000	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-17	0.115513	0.116071	0.115513	0.115513	0.090293	86441600	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-18	0.118862	0.119420	0.118862	0.118862	0.092911	73449600	0.119643	0.119978	0.119420	0.119420	0.093347	182107520.0
...	...	...	...	...	...	...	...	...	...	...	...	...
2022-02-18	169.820007	170.539993	166.190002	167.300003	167.300003	82614200	170.208002	171.663998	168.304001	170.080002	170.080002	72770540.0
2022-02-22	164.979996	166.690002	162.149994	164.320007	164.320007	91162800	169.730002	171.085999	167.422000	169.168002	169.168002	73766000.0
2022-02-23	165.539993	166.149994	159.750000	160.070007	160.070007	90009200	168.644000	169.725998	165.322000	166.624005	166.624005	78910580.0
2022-02-24	152.580002	162.850006	152.000000	162.740005	162.740005	141147500	164.789999	167.628000	161.712000	164.662006	164.662006	94904600.0
2022-02-25	163.839996	165.119995	160.869995	164.850006	164.850006	91881700	163.351999	166.269998	160.191998	163.856006	163.856006	99363080.0

10390 rows × 12 columns

Plotting SMA of All

smac = [c for c in tdf.columns if "sma" in c]
col = [c for c in tdf.columns if "sma" not in c]

for s,c in zip(smac,col):
    fig = tdf[[c, s]].iplot(kind="line", title=f"{s} vs {c}", xTitle="Date", asFigure=True)
    fig.write_image(f"stock_analysis/sma_{c}.png")
    fig.show()

We can not see the much difference between WMA and SMA and it is because of the level (daily) of our data. Lets try to plot data of last 100 days only.

for s,c,w in zip(smac,col, wmac):
    fig=tdf[-100:][[c, s, w]].iplot(kind="line", title=f"{s} vs {c} vs {w}",  xTitle="Date", asFigure=True)
    fig.write_image(f"stock_analysis/sma_{c}2.png")
    fig.show()

Now it is more clearer. Looking over the plot of open,

We can clearly see that the default value of the open have some spikes and huge downfalls.
But the value of SMA and WMA is not getting affected that much by those sudden rise/fall in the value because it contains the value from the past.
If we look over the November to December, the Open value is increasing rapidly but the SMA is increasing slowly because it knows that there were some small values in last 5 days. Similarly WMA is also increasing slowly. But we can see that WMA is always much nearer to the Open value because it is giving much importance to the latest value.
The sudden rise/fall in open is not the sign or price up/down in long run because one must always consider moving averages.

Exponential Moving Average (EMA)

It is similar to the WMA in the sense of giving weights to values but, instead of the linear weights, we will give exponential weights.

A general formula of EMA at time t is:

\[EMA_t = \left[V_t * \left(\frac{s}{1+d}\right)\right] + EMA_y * \left[1-\left(\frac{s}{1+d}\right)\right]\]

Where,

EMAt is EMA value at t
Vt is value at t
EMAy is EMA at t-1
s is smoothing parameter
d is number of ts

Purpose of using EMA is to give high weights to more recent values and shows more sensitivity to more recent data. This average is more responsive to the latest price changes than SMA.

We do not have to use this scary formula from the scratch because pandas gives us some ways to do it with little code. Please refer to Pandas documentation for more info about EWM.

\[y_0 = x_0 \\\ y_t = (1 - \alpha) y_{t-1} + \alpha x_t,\]

Where, alpha is either the value given by us or smoothing/(time periods+1). Smoothing is generally taken as 2 and time periods is taken as our requirement.

emadf=df.ewm(span=5, min_periods=5, adjust=True).mean()
emadf

	open	high	low	close	adj_close	volume
Date
1980-12-12	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-15	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-16	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-17	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-18	0.118153	0.118552	0.117994	0.117994	0.092232	1.239301e+08
...	...	...	...	...	...	...
2022-02-18	170.786802	171.961456	168.279383	169.672761	169.670227	7.574409e+07
2022-02-22	168.851200	170.204305	166.236253	167.888510	167.886821	8.088366e+07
2022-02-23	167.747464	168.852868	164.074169	165.282342	165.281216	8.392551e+07
2022-02-24	162.691644	166.851914	160.049446	164.434897	164.434146	1.029995e+08
2022-02-25	163.074428	166.274608	160.322962	164.573267	164.572766	9.929357e+07

10390 rows × 6 columns

for c in emadf.columns:
    tdf[f"ema_{c}"] = emadf[c]
tdf

	open	high	low	close	adj_close	volume	sma_open	sma_high	sma_low	sma_close	sma_adj_close	sma_volume	wma_open	wma_high	wma_low	wma_close	wma_adj_close	wma_volume	ema_open	ema_high	ema_low	ema_close	ema_adj_close	ema_volume
Date
1980-12-12	0.128348	0.128906	0.128348	0.128348	0.100326	469033600	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-15	0.122210	0.122210	0.121652	0.121652	0.095092	175884800	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-16	0.113281	0.113281	0.112723	0.112723	0.088112	105728000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-17	0.115513	0.116071	0.115513	0.115513	0.090293	86441600	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1980-12-18	0.118862	0.119420	0.118862	0.118862	0.092911	73449600	0.119643	0.119978	0.119420	0.119420	0.093347	182107520.0	0.117932	0.118304	0.117746	0.117746	0.092038	1.234001e+08	0.118153	0.118552	0.117994	0.117994	0.092232	1.239301e+08
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2022-02-18	169.820007	170.539993	166.190002	167.300003	167.300003	82614200	170.208002	171.663998	168.304001	170.080002	170.080002	72770540.0	170.538670	171.722664	168.136002	169.608669	169.608669	7.264790e+07	170.786802	171.961456	168.279383	169.672761	169.670227	7.574409e+07
2022-02-22	164.979996	166.690002	162.149994	164.320007	164.320007	91162800	169.730002	171.085999	167.422000	169.168002	169.168002	73766000.0	168.796001	170.064666	166.084666	167.688671	167.688671	7.877865e+07	168.851200	170.204305	166.236253	167.888510	167.886821	8.088366e+07
2022-02-23	165.539993	166.149994	159.750000	160.070007	160.070007	90009200	168.644000	169.725998	165.322000	166.624005	166.624005	78910580.0	167.399332	168.419331	163.527333	164.656006	164.656006	8.419305e+07	167.747464	168.852868	164.074169	165.282342	165.281216	8.392551e+07
2022-02-24	152.580002	162.850006	152.000000	162.740005	162.740005	141147500	164.789999	167.628000	161.712000	164.662006	164.662006	94904600.0	162.044666	166.127334	159.086666	163.361339	163.361339	1.049387e+08	162.691644	166.851914	160.049446	164.434897	164.434146	1.029995e+08
2022-02-25	163.839996	165.119995	160.869995	164.850006	164.850006	91881700	163.351999	166.269998	160.191998	163.856006	163.856006	99363080.0	161.727998	165.291332	158.805998	163.424006	163.424006	1.039311e+08	163.074428	166.274608	160.322962	164.573267	164.572766	9.929357e+07

10390 rows × 24 columns

Plotting EMA of All

Instead viewing EMA of entire data, lets view it of last 100 days only.

smac = [c for c in tdf.columns if "sma" in c]
wmac = [c for c in tdf.columns if "wma" in c]
emac = [c for c in tdf.columns if "ema" in c]
col = [c for c in tdf.columns if "sma" not in c and "wma" not in c and "ema" not in c]

for s,c,w,e in zip(smac,col, wmac, emac):
    fig=tdf[-100:][[c, s, w, e]].iplot(kind="line", title=f"{s} vs {c} vs {w} vs {e}", xTitle="Date", asFigure=True)
    fig.write_image(f"stock_analysis/ema_{c}.png")
    fig.show()

Looking over the EMA,it seems that it is much more smoother than the other values. But the smoothness depends on the value of the smoothing. Based on EMA, lots of other important metrics are calculated in Stock Market Analysis and to note down few:

Guppy Moving Average (GMMA)
Percentage Price Oscillator (PPO)
Relative Strength Index (RSI)
Moving Average Convergence Divergence (MCAD)

We will be exploring all above 4 metrics in the next blog please stay tuned for that.

Plotting Candlestick

Candlesticks are often used in stock data analysis for clear visualization and lets try that as well. We will use graph_objects of Plotly.

import plotly.graph_objects as go

fig=go.Figure()

fig.add_trace(go.Candlestick(x=tdf[-1000:].index,
                open=tdf[-1000:]['open'],
                high=tdf[-1000:]['high'],
                low=tdf[-1000:]['low'],
                close=tdf[-1000:]['close'], 
                 name = 'Stock Market Data'))
fig.add_trace(go.Candlestick(x=tdf[-1000:].index,
                open=tdf[-1000:]['ema_open'],
                high=tdf[-1000:]['ema_high'],
                low=tdf[-1000:]['ema_low'],
                close=tdf[-1000:]['ema_close'], 
                 name = 'EMA Stock Market Data'))

fig.update_layout(
    title= "AAPL Stock Data",
    yaxis_title="Stock's Price in USD",
    xaxis_title="Date")               

fig.update_xaxes(
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=150, label="150D", step="day", stepmode="backward"),
            dict(count=4, label="4m", step="month", stepmode="backward"),
            dict(step="all")
        ])
    )
)

color_hi_fill = 'black'
color_hi_line = 'blue'

color_lo_fill = 'yellow'
color_lo_line = 'purple'

fig.data[0].increasing.fillcolor = color_hi_fill
fig.data[0].increasing.line.color = color_hi_line
fig.data[0].decreasing.fillcolor = 'rgba(0,0,0,0)'
fig.data[0].decreasing.line.color = 'rgba(0,0,0,0)'

fig.data[1].increasing.fillcolor = 'rgba(0,0,0,0)'
fig.data[1].increasing.line.color = 'rgba(0,0,0,0)'
fig.data[1].decreasing.fillcolor = color_lo_fill
fig.data[1].decreasing.line.color = color_lo_line

fig.write_image("stock_analysis/candle.png")

fig.show()

Moving Median

What if we used median instead of the mean? Lets copy and paste the codes written in above steps and calculate median instead of the mean.

tdf = df.copy()
smmdf = tdf.rolling(window=5).median()

for c in smmdf.columns:
    tdf[f"smm_{c}"] = smmdf[c]

emadf=df.ewm(span=5, min_periods=5, adjust=True).mean()

for c in emadf.columns:
    tdf[f"ema_{c}"] = emadf[c]
    
smmc = [c for c in tdf.columns if "smm" in c]
emac = [c for c in tdf.columns if "ema" in c]
col = [c for c in tdf.columns if "smm" not in c and "ema" not in c]

for s,c,e in zip(smmc,col,emac):
    fig=tdf[-100:][[c, s, e]].iplot(kind="line", title=f"{s} vs {c} vs {e}", xTitle="Date", asFigure=True)
    fig.write_image(f"stock_analysis/mma_{c}.png")
    fig.show()

EMA seems to be much near to the open and EMA is more sensitive towards the change than Simple Moving Median.

Moving Variance

tdf = df.copy()
smmdf = tdf.rolling(window=5).var()

for c in smmdf.columns:
    tdf[f"smv_{c}"] = smmdf[c]

emadf=df.ewm(span=5, min_periods=5, adjust=True).var()

for c in emadf.columns:
    tdf[f"emv_{c}"] = emadf[c]
    
smmc = [c for c in tdf.columns if "smv" in c]
emac = [c for c in tdf.columns if "emv" in c]
col = [c for c in tdf.columns if "smv" not in c and "emv" not in c]

for s,c,e in zip(smmc,col,emac):
    fig=tdf[-100:][[c, s, e]].iplot(kind="line", y = [s,e], secondary_y=c, title=f"{s} vs vs {e}", xTitle="Date", asFigure=True)
    fig.write_image(f"stock_analysis/mva_{c}.png")
    fig.show()

Variance seems to be increasing when there is sudden change in the trend and it seems to be decreasing when the change seems to be normal.

Conclusion

In this blog, we have explored some of popular moving average algorithms used in the stock market analysis and in the next blog, we will explore some of the popular metrics that uses Moving Average as the base metric.

References

Twitter Facebook LinkedIn

Quassarian Viper