Monday, February 20, 2017

Assignment 2

Part 1


Goal

The goal of the first part of lab 2 was to use our knowledge of basic statistics to analyse which of two cycling teams would be better to back, using the previous race times of each teams 15 riders. The aim is to identify which of the teams is more likely to be the overall winners of the Tour de Geographia, rather than the team that has the first cyclist over the finish line, as the overall winning team produces a larger cash return for those who back them.
Method

To decide which team would be better to back, excel was used to calculate the range, mean, median, mode, skewness and kurtosis of each team’s previous race times. The standard deviation of each team’s times was also calculated, by hand. This was done using the calculation for the population of a data set, rather than a sample, as we have results for all racers within each team.

After inputting the both team’s previous race times, the results for range, mean, median and mode were changed to express the times in hours and minutes, rather than just minutes, for ease of understanding.

These figures were then used to decide which of the teams would be the overall winner, therefore producing the largest cash return for their backers.
Figure 1 - Results Table



Figure 2 - Standard Deviation Calculation for Team Tobler


Figure 3 - Standard Deviation Calculation for Team Astana




Results

The range for Team Astana was 1hr and 20 minutes. For Team Tobler, this was 30 minutes. Range is a measure of variability, which measures the difference between the highest and lowest values in a given data set. In this case, the difference in finishing times between Team Astana’s fastest racer and their slowest was 1 hour and 20 minutes. Meanwhile, for Team Tobler, the difference between their fastest and slowest riders finishing times was only thirty minutes. This indicates that while Team Astana has one very fast racer, the rest of their team do not all match this speed, however the racers in Team Tobler are more equal in pace. As it is the overall winning team rather than the individual winner that we are looking to back, this statistic would suggest that backing Team Tobler makes more sense, as it is less likely that one of their racers will lag behind the rest, causing them to lose overall.

Next is the mean – this is a measure of central tendency that tells us the average of the observations within a data set – in this case, the average finishing time of each team. Team Astana have a mean finishing time of 37 hours and 54 minutes, while Team Tobler has an average of 38 hours and 6 minutes. This, initially, would suggest that it would be better to back Team Astana, as their riders on average take a shorter amount of time to finish a race. However, using the mean can be heavily influenced by outliers, and as we know Team Astana has a star rider and a larger range between the star and their slowest team member, we can infer that the mean has been impacted by this outlier. The same is true for Team Tobler. Therefore, the mean here is useful but only when it is considered alongside our other statistics.

The median was calculated next – this is another measure of central tendency that works by ranking the observations from highest to lowest or vice-versa, and taking the middle value to be the median. In the case of a data set where there is an even number of observations, the median is the average of the two middle observations. Team Astana’s median is 38 hours, while Team Tobler’s median is 38 hours and 12 minutes. This slight difference makes it seem that backing Team Astana would be better, however it is important to note that the median doesn’t tell us anything about the speeds of the rest of the riders, like how much faster of slower they were compared to this middle value, which is not particularly helpful if we are interested in the overall winners of the race.

The mode is our next statistic - this is another measure of central tendency and tells us the observation that occurs most frequently within our data set. For both Team Astana and Team Tobler, the mean is the same as the median: 38 hours and 38 hours 12 minutes respectively. While this is interesting, it does not tell us much about which team is better to back as it focuses on a few results only, while we are looking to figure out which team is more likely to win overall.

Kurtosis is another statistic available to make our judgment with. This measures peakedness in a data set, as it would be displayed in a histogram. The result indicated how peaked or flat the kurtosis of a data set is relative to what it would be in a normal distribution. A negative kurtosis indicates that the distribution of data is relatively flat – this is known as platykurtic. If the distribution is peaked, the kurtosis is positive – known as leptokurtic. A normal distribution is known as mesokurtic. As excel has been used to calculate the kurtosis here, the program automatically subtracts three from the kurtosis, so in this case a result greater than one indicates a positive, platykurtic result. Anything less than -1 indicates a negative, leptokurtic kurtosis. Here, Team Astana’s kurtosis is 1.168, while Team Tobler’s is 2.927. This means that the distribution of speeds for Team Astana is more peaked than Team Tobler, so more of Team Astana’s riders are closer to the mean speed. However, as the value is not 0 (mesokurtic), there are still some riders who are further away from the mean, which could affect Team Astana’s chances of winning overall. Despite this, as the distribution of speeds for Team Tobler is flatter, it would suggest that Team Astana is still more likely to win.

Next is the skewness, which measures how symmetrical the distribution in a data set is. This tells us how much distribution of data deviates from the mean. A result of 0 indicates that there is no skewness, while anything greater than 0 is a positive skewness, which can also be seen if the mean is greater than the median. A result of less than 0 indicates a negatively skewed data set, in which the mean is less than the median. For Team Astana, the skewness is -0.003. For Team Tobler, it is -1.563. Both these results indicate a negative skewness, however the results for Team Tobler are more asymmetrical than those for Team Astana. This means that more of Team Tobler’s results would lower than the mean race time than for Team Astana, however both teams have more results that are slower than the mean times than faster. As Team Astana’s results have been shown to deviate less from the mean than Team Tobler’s, this suggests that it would be better to back them, especially as their mean time is faster than Team Tobler’s by twelve minutes.

Finally, we have the standard deviation. This is a normalization technique that illustrates how clustered the observations are around the mean of a data set. Approximately 68% of all observations will be within one standard deviation on either side of the mean. 95% of observations will fall within two standard deviations of the mean while 99% will fall within three standard deviations from the mean. For Team Astana, the standard deviation is 16.63 minutes, so 68% of the racers speeds will fall within 16.63 greater or less than the mean. For Team Tobler, the standard deviation is 7.62 minutes. As the standard deviation for Team Tobler is smaller, this also means that 99% of their racers will cross the finish line in less time than it takes for 99% of Team Astana to cross the finish line, which would suggest that they are more likely to win.

Overall, it would appear from the above statistics that Team Tobler is more likely to win the race. The most useful statistics in predicting this were the range and standard deviation, as we are looking for the overall winners and not the individual winner. While the skewness and kurtosis values are interesting, they are diagnostics which would be better suited to testing data sets that are larger to make sure that there are no errors in collection. The standard deviation tells us that it is more likely that 99% of Team Tobler racers will complete the race within a shorter amount of time, while the range tells us that the difference between their fastest and shortest rider is less than that of Team Astana, making them more likely to win overall as their outliers are less significant.


Figure 4 - Results with Range, Mean, Median, Mode and Standard Deviation expressed in hours and minutes


Part 2

Goal
The goal of part two of the assignment was to use population data for Wisconsin counties from the years 2010 and 2015 to calculate the weighted mean centers of population.

Method
First, a shapefile of Wisconsin counties was downloaded from (the census website). This was then used to join the 2010 and 2015 population data from an excel spreadsheet. Once the data had been joined, the layer was exported to create a new shapefile. This step allows the user to perform analysis tools successfully, such as the Mean Center tool.
Next, the Mean Centre tool was used to figure out the geographical mean center of Wisconsin – which is located in Wood County. After this, the tool was performed two further times, once by weighting the mean center by the 2010 population data, and once weighted by the 2015 population data.

Results
The first mean center created shows only the geographical mean center of Wisconsin, which is in Wood County. Using the tool in this way simply pinpoints the center point of all of the counties, then picks the central point in the state from this using the latitude and longitude of each point. Here, central tendency is measured spatially. It does not tell us much about the state in this context, but would be a useful tool when applied to other problems, such as looking for the origin location in a crime pattern.
On the other hand, the mean center when weighted by the population levels of each county for 2010 and 2015 are useful in identifying patters in population distribution and. We can see from the map that for both 2010 and 2015 the weighted mean centers of population was located in Green Lake county, south east of the unweighted mean center. This we can see has not changed significantly in the five years between data sets, which could be due to the financial crash in 2008, meaning that it is more difficult for people to buy or sell homes resulting in less movement in population trends. Also, this could reflect the job market here being high – the weighted mean center is close to Madison and Milwaukee, where the population in the surrounding counties will be larger due to commuters living in the surrounding suburbs. There is, however, a slight shift west in the weighted mean center for 2015 population, which could be due to expanding population in the west of the state where it borders Minnesota within commuting distance of the Twin Cities. 

Figure 5 - Map showing the geographic mean center of Wisconsin, along with mean centers weighted by 2010 and 2015 population data

Sources
US Census Bureau
Rogerson, P. A. 2015. Statistical Methods for Geography. Sage Publications: London