Home page for Jason Thompson's Blog

Jason Thompson's Blog

Stats Final Project: Deck of Cards

This is my final project for Udacity’s Introduction to Descriptive Statistics course, but with a twist.

The assignment was to start with a deck of regular playing cards and take at least 30 samples of three cards each from the deck, record the cards drawn and return the cards to the deck after each sample.

I’ve also been taking the Data Analyst with Python Career Track at DataCamp, so I thought I’d use Python with the Pandas library for this project. After all, who wants to spend all that time manually taking and recording 30 samples when Python can do all this in an instant?

What follows, then, is a report on my statistical findings for this assignment as well as a description of how I achieved the results using Python with the Pandas library.

Deck of Cards

There are many ways to simulate sampling from a deck of cards in Python, but I wanted to practice using Pandas, so my solution used Pandas’ DataFrames to model the deck of cards and the samples I took. I quickly threw together a spreadsheet with columns for suit, card and value with each row representing a single card. Here I could have built the deck in Python, but it’s easy to import a CSV file into Pandas using the from_csv() method of DataFrame. With that done, it was simple to create a histogram of the deck as per the project’s instructions.

 1# create deck of cars from spreadsheet
 2deck = pd.DataFrame.from_csv("card_deck.csv", index_col=None)
 4# re-order columns (of course I could have ensured that the 
 5# spreadsheet columns were in the correct order, but it's not
 6# uncommon to have fix your data with Pandas, so why not 
 7# practice?
 8cols = ['Card', 'Suit', 'Value']
 9deck = deck[cols]
11# Deck statistics
12deck_std = deck['Value'].std()
13deck_mean = deck['Value'].mean()
15# Create the histogram of values in the deck
16deck_hist = deck.plot(kind='hist', grid=True, legend=False)
17plt.xlabel('Card Value (Suits = 10 pts)')
18plt.title('Distribution of Values in Deck of Cards')
20plt.savefig(charts_dir + 'deck.svg')
Distribution of values for deck of cards

Cards are valued from one (Ace) to ten with suits also taking a value of ten, so there’s nothing surprising about this histogram. You can see that there are four cards for each value (one for each suit) except for ten, which has sixteen values (one ten and three face cards for each of the four suits). Although the assignment didn’t ask fo it, I included the deck mean and standard deviation. Knowing the deck mean will serve as a sanity check when looking at the summary statistics later, and I’ll need the standard deviation to calculate the summary statistics’ standard error. You can see the code for calculating these two values on lines 12 and 13 in the code sample above.

Random samples of three cards

Turns out taking a random sample of cards is as easy as using the sample() method on the deck data structure (see line 10, below).

 1def run_draws(deck, num_draws):
 2  """ Takes a deck of cards and draws given number of cards \
 3  num_draws times. Returns a DataFrame with a rows for the \
 4  sum and mean of each sample."""
 6  # Create a dataframe to hold the samples
 7  samples_summary = pd.DataFrame(columns=['Draw Number', 'Sum', 'Mean'])
 9    for item in range(num_draws) :
10        draw = deck.sample(3)
11        draw_number = item + 1
13        for row in draw :
14            draw['Draw Number'] = draw_number
16        # Create Summary of draw
17        value_sum = draw['Value'].sum()
18        value_mean = draw['Value'].mean()
19        draw_summary = []
20        draw_summary.append({'Draw Number': draw_number, 'Sum': value_sum,
21                              'Mean': value_mean})
23        samples_summary = samples_summary.append(draw_summary, ignore_index=True)
24        samples_summary.to_csv(saved_draws)
25    return samples_summary

You get a different result each time you run the program since the sample() method on DataFrame draws a random sample of a given size each time it’s called. This behaviour is part of the design, but in order to write up this report, I needed a set of data that didn’t change every time I ran the program. So at the end of the run_samples() function seen above, I save the results to a file and then, in the following code, I run run_samples() if the file isn’t yet present. Otherwise I open the file using from_csv()

1# run sample_count samples
3if saved_draws.is_file():
4    samples_summary = pd.dataframe.from_csv(saved_draws)
6    samples_summary = run_draws(deck, sample_count)

It’s worth noting here that the value of sample_count is 100. the assignment asked for at least 30 samples and with the computer doing all the grunt work, why not take 100?

Here are the first few rows of the resulting samples_summary data structure. I’ve omitted the first column, which is a unique index for each row.

Draw Num Sum Mean
1 19 6.333333
2 27 9.000000
3 16 5.333333
4 16 5.333333
5 25 8.333333

Next, I combined the deck histogram with the sample sums histogram to make comparisons between them easier.

Distribution of values for sample sums of three cards each

Unsurprisingly, the sample sums histogram looks different from the histogram for the deck of cards. The sample sums histogram has two samples of ten and one of five, so there is very little overlap between the two histograms. As seen below in the Descriptive Statistics section, because tens are so prevalent in the pack, the mean and medium of sums is near 20.

Descriptive Statistics

I calculated descriptive statistics for these samples using, as per the assignment, at least two measures of central tendency and two measures of variance. These calculations were simple to perform:

 1# Deck sample sums statistics
 2sums_mean = samples_summary['Sum'].mean()
 3sums_median = samples_summary['Sum'].median()
 4means_mean = samples_summary['Mean'].mean()
 5sums_q1 = samples_summary['Sum'].quantile(0.25)
 6sums_q3 = samples_summary['Sum'].quantile(0.75)
 7sums_iqr = sums_q3 - sums_q1
 8sums_std = samples_summary['Sum'].std(ddof=1)
 9sums_std = samples_summary['Sum'].std(ddof=1)
10sums_ste = (deck_std / math.sqrt(3))

Measures of central tendency

Note that I have taken the mean of sample means. This number should be approximately the same as the population mean (the deck mean), which was 6.54. So, with a mean of msample means of 6.58, we’re within four 100th. Good enough for government work, as they say.

Measures of variance

Making Estimates

Finally, we were asked to make a couple of estimates about future draws.

Between which two values will you find 90% of the values?

For my first stab at answering this question, I used the z-table provided by the course to find the values at the 5th and 95th percentiles and used the resulting z-scores to determine that 90% of values lie between 10.92 and 28.61

Another way of arriving at these values is to use DataFrame’s quantile() method as follows:

1sums_lq = samples_summary['Sum'].quantile(0.05)
2sums_hq = samples_summary['Sum'].quantile(0.95)

Using the quantile() method puts 90% of the values between 11.00 and 29.00. These are fairly close to those arrived at with the z-table, but not exact matches. The z-table I was using was only accurate to two decimal places, so that might account for some of the difference. I’m not sure what kind of rounding the quantile() method uses–perhaps I’ll explore the source code for that answer.

What is the approximate chance of obtaining value of at least 20?

The z-score of 20 would be 0.048, so using the z-table, I found that you’d have about a 52% of drawing three cards that added up to at least 20.

Final thoughts

As a side note, I publish this blog using Markdown, so I used the pystache templating library to output my stats as follows:

 1output_data = { 'deck_mean': deck_mean, 'deck_std': deck_std,  'sums_mean':
 2               sums_mean, 'sums_median': sums_median, 'sums_iqr':
 3               sums_iqr, 'sums_std': sums_std, 'sums_ste':
 4               sums_ste, 'means_mean': means_mean, 'sums_90_lower': sums_90_lower,
 5               'sums_90_upper': sums_90_upper, 'sums_lq': sums_lq, 'sums_hq':
 6               sums_hq }
 8# Round the floats in output data to 2 decimals
 9output_data = {k:'{0:.2f}'.format(v) for k, v in output_data.items()}
11# charts_dir is not a number so it would make the formatting operation above crash.
12output_data['charts_dir'] = charts_dir
14output = '''
16## Deck of Cards
18<img src="{{charts_dir}}deck.svg" ></figure>
20Mean of card values: {{deck_mean}}
22Standard deviation of card values: {{deck_std}}
24## Random samples of three cards
26<img src="{{charts_dir}}sums.svg"></figure>
28...(rest of the text not included)
32renderer = pystache.Renderer()
33print(renderer.render(output, output_data))

With that print statement on line 33, I printed out the rendered skeleton of the report to the console so I could use Bash’s I/O redirection to save it to a file. This output template could have been saved as a separate file and if I was developing a reporting system, I’d write some reusable code to allow this, but this quick and dirty approach worked for my current purposes.

There are likely other approaches to creating reports with Pandas data and plots and I’ll be exploring them as I continue my studies. As you an see above, I posted the plots as SVG images. It would be worth exploring Javascript libraries for future web-based projects. And for a PDF report, I’d take yet a different approach.

One other thing I would have liked to do is mark individual statistics such as the mean directly on the histograms. A quick search shows this is possible, so I’ll look into it next project.

I’m continuing on with Udacity’s free stats courses. I just started their Introduction to Inferential Statistics. As for DataCamp, I’ll also continue with their Python curriculum. One downside of their course offerings is that they don’t assign projects, but I’ll continue to use what I’m learning there to do new data projects.

What Does Illness Look Like? Part 2: Energy Envelope

This is a multipart series on my experiences with chronic fatigue syndrome/ fibromyalgia. (Update: I’ve since been diagnosed with fibromyalgia.) It’s partly a way of telling my friends what’s going on and partly a way of organizing my thinking on the subject. I hope that people suffering from similar illnesses can gain something from these posts, even if it’s just a matter of feeling a little less isolated.

Of the two causes, overexertion is the one I can control, so in the next part of this series, I’ll begin by describing my energy envelope and what that means for my day to day life

Close to eight months ago I ended the first essay in this series on the optimistic note that I’d be back in a week or two with the second essay. Learning to live within your daily allotment of energy shouldn’t be that hard, I reasoned.

I didn’t count on how hard it would be to figure out my daily allotment of energy in the first place. There isn’t a hack for figuring out how much energy you have, so you’ve got to do it the hard way: trial and error. But since I suffer from chronic fatigue, an error in interpreting my energy level can be costly. A healthy individual has some leeway in this regard. For example, when a healthy person overexerts herself one day, she can recover by resting the next. When I overdo it, my symptoms will flare up, sometimes for days.

Indeed, over the summer I got into a cycle of overexerting myself, suffering the fallout and then overexerting myself again to get beyond the fallout. That’s what we’re taught isn’t it? Just push a little harder and you’ll make it. Look at some of the older posts on this blog and you’ll see several pieces on long distance cycling. Tired after riding 100 kms? Just push yourself a little harder. Make it 150 kms! And your next 100 km ride is that much easier. Pushing yourself beyond your limits is ingrained in our culture and for exercise it tends to work well if practiced judiciously. But it’s not the right way to tackle fibromyalgia.

I have to work within the energy limitations I have on a given day. But how do I know what my limitations are? There’s no easy answer. The signs that I’m getting tired are hard to pick up. Take writing this essay. Can I write a few more sentences? To answer that question, I have to make a conscious effort to check in. Sometimes the signs are subtle. I’m in the flow of writing this piece right now, so I’m not paying attention to my body. I pause and note that my fingers are starting to get stiff. And I’m having more and more trouble keeping what I want to say straight in my mind. I’m forgetting words. But I love the feeling of flow I get from writing or making music or engaging in other creative activities, so it’s hard to stop and rest, even when the signs become obvious.

I’ve heard that it can take years to learn how to manage fibromyalgia and I’m learning firsthand why. For now, at least, I’m going to listen to what my body is telling me and stop writing. Will there be a third essay in this series. Probably. But I’m not going to encumber myself with a due date or a topic. In this way I’m slowly learning how to be easy on myself.

Maya Deren

Image of Maya Deren from her film, 'Meshes of the Afternoon'

Image of Maya Deren from her film, ‘Meshes of the Afternoon’

After recommending Maya Deren’s short film, Meshes of the Afternoon to a friend of mine, I took the opportunity to re-watch it this morning. I found it really influential as a film student and it still stands up today.

Meshes is probably her most accessible film, but all of them are great. Her later films, such as Ritual in Transfigured Time, are more abstract and more concerned with choreography (of the camera and of the actors/dancers).

I could go on and on about these films (I wrote an essay or two on the subject almost 20 years ago), but I’ll leave the work of analysis to you. Enjoy.

What Does Illness Look Like? Part 1

This is a multipart series on my experiences with chronic fatigue syndrome/ fibromyalgia. (Update: I’ve since been diagnosed with fibromyalgia.) It’s partly a way of telling my friends what’s going on and partly a way of organizing my thinking on the subject. I hope that people suffering from similar illnesses can gain something from these posts, even if it’s just a matter of feeling a little less isolated.

Before getting sick, I used to wake up at 5 a.m. for what I jokingly called my 20 percent time. As many of you know, Google popularized this notion of 20 percent time by letting employees spend Fridays working on their own personal projects. Waking up at 5 a.m. gave me time to work on my creative and computer programming projects before the rest of the family woke up at 7. In the same way that Google made some of these personal projects into products, I hoped that my personal work would eventually lead to new sources of income.

This worked out really well for me. During that time, I moved a lot of projects forward and, based on the programming work that I did, I got some job interviews as a programmer. But last last summer I started to get sick with what I now know is chronic fatigue syndrome/fibromyalgia (CFS/FM)1. With fatigue as one of my primary symptoms, waking up at 5 a.m. became difficult. Sure I could wake up that early, but my productivity would be low to say the least and I would suffer, both later on that day and in the coming days.

If I set August 1 as the day I got sick2, then it has taken 10 months to arrive at the point where my doctor and I are pretty sure I’ve got CFS/FM. By mid November, what had started as weekly episodes of a day or two, had progressed to the point of my being off work every day. I don’t want to get bogged down in the details of CFS/FM, but if you’re interested in learning more, this is the best guide I’ve found. Here’s a rundown of some of my symptoms:

There are other symptoms that come and go and they shift over time; what was dominant last week, might not be a factor this week. And there are episodes, lasting anywhere from a day to a week, when my symptoms are much worse. But I’m still symptomatic on my good days, so much so that I’m unable to work. These episodes have two causes that I’ve managed to identify: rapid weather change and overexertion.

Of the two causes, overexertion is the one I can control, so in the next part of this series, I’ll begin by describing my energy envelope and what that means for my day to day life3.

1 CFS and FM are closely related. One difference between the two, is that people with FM usually have pain as a dominant symptom.

2 It can hard to set an exact date. As I said earlier, my doctor and I are just getting to the diagnosis stage after more than 10 months. As I start to understand more about CFS/FM, I see previous illnesses in a new light -- could they be related? For example, what about that nausea and vertigo I went to the doctor about last spring? Still, insurance companies really like exact dates, so let's go with August 1.

3 Which is actually quite convenient, because I probably have to leave the rest of this post for another day. I felt relatively energetic when I started writing this, but my body is telling me to stop and rest now.

found poetry 1

“intercepting sycalls and notifying an exterior sandbox host” Codius