Goal
The goal of
the first part of lab 2 was to use our knowledge of basic statistics to analyse
which of two cycling teams would be better to back, using the previous race
times of each teams 15 riders. The aim is to identify which of the teams is
more likely to be the overall winners of the Tour de Geographia, rather than
the team that has the first cyclist over the finish line, as the overall
winning team produces a larger cash return for those who back them.
Method
To decide
which team would be better to back, excel was used to calculate the range,
mean, median, mode, skewness and kurtosis of each team’s previous race times.
The standard deviation of each team’s times was also calculated, by hand. This was done
using the calculation for the population of a data set, rather than a sample,
as we have results for all racers within each team.
After
inputting the both team’s previous race times, the results for range, mean,
median and mode were changed to express the times in hours and minutes, rather
than just minutes, for ease of understanding.
These
figures were then used to decide which of the teams would be the overall
winner, therefore producing the largest cash return for their backers.
Figure 1 - Results Table |
Figure 2 - Standard Deviation Calculation for Team Tobler
|
Figure 3 - Standard Deviation Calculation for Team Astana |
Results
The range
for Team Astana was 1hr and 20 minutes. For Team Tobler, this was 30 minutes.
Range is a measure of variability, which measures the difference between the
highest and lowest values in a given data set. In this case, the difference in
finishing times between Team Astana’s fastest racer and their slowest was 1
hour and 20 minutes. Meanwhile, for Team Tobler, the difference between their
fastest and slowest riders finishing times was only thirty minutes. This
indicates that while Team Astana has one very fast racer, the rest of their
team do not all match this speed, however the racers in Team Tobler are more
equal in pace. As it is the overall winning team rather than the individual
winner that we are looking to back, this statistic would suggest that backing
Team Tobler makes more sense, as it is less likely that one of their racers
will lag behind the rest, causing them to lose overall.
Next is the
mean – this is a measure of central tendency that tells us the average of the
observations within a data set – in this case, the average finishing time of
each team. Team Astana have a mean finishing time of 37 hours and 54 minutes,
while Team Tobler has an average of 38 hours and 6 minutes. This, initially,
would suggest that it would be better to back Team Astana, as their riders on
average take a shorter amount of time to finish a race. However, using the mean
can be heavily influenced by outliers, and as we know Team Astana has a star
rider and a larger range between the star and their slowest team member, we can
infer that the mean has been impacted by this outlier. The same is true for
Team Tobler. Therefore, the mean here is useful but only when it is considered
alongside our other statistics.
The median
was calculated next – this is another measure of central tendency that works by
ranking the observations from highest to lowest or vice-versa, and taking the
middle value to be the median. In the case of a data set where there is an even
number of observations, the median is the average of the two middle
observations. Team Astana’s median is 38 hours, while Team Tobler’s median is
38 hours and 12 minutes. This slight difference makes it seem that backing Team
Astana would be better, however it is important to note that the median doesn’t
tell us anything about the speeds of the rest of the riders, like how much
faster of slower they were compared to this middle value, which is not
particularly helpful if we are interested in the overall winners of the race.
The mode is
our next statistic - this is another measure of central tendency and tells us
the observation that occurs most frequently within our data set. For both Team
Astana and Team Tobler, the mean is the same as the median: 38 hours and 38
hours 12 minutes respectively. While this is interesting, it does not tell us
much about which team is better to back as it focuses on a few results only,
while we are looking to figure out which team is more likely to win overall.
Kurtosis is
another statistic available to make our judgment with. This measures peakedness
in a data set, as it would be displayed in a histogram. The result indicated
how peaked or flat the kurtosis of a data set is relative to what it would be
in a normal distribution. A negative kurtosis indicates that the distribution
of data is relatively flat – this is known as platykurtic. If the distribution
is peaked, the kurtosis is positive – known as leptokurtic. A normal
distribution is known as mesokurtic. As excel has been used to calculate the
kurtosis here, the program automatically subtracts three from the kurtosis, so
in this case a result greater than one indicates a positive, platykurtic
result. Anything less than -1 indicates a negative, leptokurtic kurtosis. Here,
Team Astana’s kurtosis is 1.168, while Team Tobler’s is 2.927. This means that
the distribution of speeds for Team Astana is more peaked than Team Tobler, so
more of Team Astana’s riders are closer to the mean speed. However, as the
value is not 0 (mesokurtic), there are still some riders who are further away
from the mean, which could affect Team Astana’s chances of winning overall.
Despite this, as the distribution of speeds for Team Tobler is flatter, it
would suggest that Team Astana is still more likely to win.
Next is the
skewness, which measures how symmetrical the distribution in a data set is. This
tells us how much distribution of data deviates from the mean. A result of 0
indicates that there is no skewness, while anything greater than 0 is a
positive skewness, which can also be seen if the mean is greater than the
median. A result of less than 0 indicates a negatively skewed data set, in
which the mean is less than the median. For Team Astana, the skewness is
-0.003. For Team Tobler, it is -1.563. Both these results indicate a negative
skewness, however the results for Team Tobler are more asymmetrical than those
for Team Astana. This means that more of Team Tobler’s results would lower than
the mean race time than for Team Astana, however both teams have more results
that are slower than the mean times than faster. As Team Astana’s results have
been shown to deviate less from the mean than Team Tobler’s, this suggests that
it would be better to back them, especially as their mean time is faster than
Team Tobler’s by twelve minutes.
Finally, we
have the standard deviation. This is a normalization technique that illustrates
how clustered the observations are around the mean of a data set. Approximately
68% of all observations will be within one standard deviation on either side of
the mean. 95% of observations will fall within two standard deviations of the
mean while 99% will fall within three standard deviations from the mean. For
Team Astana, the standard deviation is 16.63 minutes, so 68% of the racers speeds
will fall within 16.63 greater or less than the mean. For Team Tobler, the
standard deviation is 7.62 minutes. As the standard deviation for Team Tobler is
smaller, this also means that 99% of their racers will cross the finish line in
less time than it takes for 99% of Team Astana to cross the finish line, which
would suggest that they are more likely to win.
Overall, it
would appear from the above statistics that Team Tobler is more likely to win
the race. The most useful statistics in predicting this were the range and
standard deviation, as we are looking for the overall winners and not the
individual winner. While the skewness and kurtosis values are interesting, they
are diagnostics which would be better suited to testing data sets that are
larger to make sure that there are no errors in collection. The standard
deviation tells us that it is more likely that 99% of Team Tobler racers will
complete the race within a shorter amount of time, while the range tells us
that the difference between their fastest and shortest rider is less than that
of Team Astana, making them more likely to win overall as their outliers are
less significant.
Figure 4 - Results with Range, Mean, Median, Mode and Standard Deviation expressed in hours and minutes |
Part 2
Goal
The goal of part two of the assignment was to use population
data for Wisconsin counties from the years 2010 and 2015 to calculate the
weighted mean centers of population.
Method
First, a shapefile of Wisconsin counties was downloaded from
(the census website). This was then used to join the 2010 and 2015 population
data from an excel spreadsheet. Once the data had been joined, the layer was
exported to create a new shapefile. This step allows the user to perform
analysis tools successfully, such as the Mean Center tool.
Next, the Mean Centre tool was used to figure out the geographical
mean center of Wisconsin – which is located in Wood County. After this, the
tool was performed two further times, once by weighting the mean center by the
2010 population data, and once weighted by the 2015 population data.
Results
The first mean center created shows only the geographical
mean center of Wisconsin, which is in Wood County. Using the tool in this way
simply pinpoints the center point of all of the counties, then picks the
central point in the state from this using the latitude and longitude of each
point. Here, central tendency is measured spatially. It does not tell us much
about the state in this context, but would be a useful tool when applied to
other problems, such as looking for the origin location in a crime pattern.
On the other hand, the mean center when weighted by the
population levels of each county for 2010 and 2015 are useful in identifying
patters in population distribution and. We can see from the map that for both
2010 and 2015 the weighted mean centers of population was located in Green Lake
county, south east of the unweighted mean center. This we can see has not
changed significantly in the five years between data sets, which could be due
to the financial crash in 2008, meaning that it is more difficult for people to
buy or sell homes resulting in less movement in population trends. Also, this
could reflect the job market here being high – the weighted mean center is
close to Madison and Milwaukee, where the population in the surrounding
counties will be larger due to commuters living in the surrounding suburbs. There
is, however, a slight shift west in the weighted mean center for 2015
population, which could be due to expanding population in the west of the state
where it borders Minnesota within commuting distance of the Twin Cities.
Figure 5 - Map showing the geographic mean center of Wisconsin, along with mean centers weighted by 2010 and 2015 population data |
Sources
US Census Bureau
Rogerson, P. A. 2015. Statistical Methods for Geography. Sage Publications: London