exds_00: Getting Started (Exercise)
Introduction:
Now that you are familiar with some of the basics of pandas
and have learned a little bit about how to manipulate data in the form of DataFrame
’s, it is time to put your skills to the test. At this point you should be familiar with the following:
- importing
pandas
- creating a
DataFrame
, underlying structure - reading from and writing to .csv files
- observing the structure of a specific
DataFrame
- selecting data by column and by row
- some basic analysis using
numpy
Additionally, if you have completed one of the recent projects, you will also be familiar with the survey.csv
data, which details course evaluation responses from a previous semester. Below you will see some tasks, each with increasing difficulty, that test your newfound data science skills. You may complete this exercise in a new Jupyter Notebook file or in a normal Python file.
Task 1:
As before, we have to get the data into a usable form for pandas
. Load the survey.csv
data into a DataFrame
called survey
.
Task 2:
It is a best practice to gain an understanding of the sturcture of data before trying to conduct any sort of analysis. Use the .info()
method to get a look at the “big picture” of your DataFrame
.
Task 3:
Often, you will deal with data that has ambiguous variable naming conventions. While survey.csv
is not one of those datasets, thankfully, it is still essential to understand what each column is trying to tell you about students! Look at the columns labeled kaki_effective
, interested
, and oh_visits
by themselves and describe what they represent below.
Task 4:
At first glance, this seems to be quite a large dataset with many variables, so it may be helpful to only look at columns of interest. Create a new DataFrame
called survey_new
which contains only the columns row_number
,pace
,difficulty
,understanding
,interested
,valuable
,grade
,and would_recommend
. Note that there are many ways to do this, some MUCH more efficient than others. A potential use for this new data would be if we want a more streamlined look at whether perception of difficulty aligns with a student’s grade.
Task 5:
Perhaps something we are interested in is the difference in difficulty
between first-year students and returning students. Subset the data such that only first-year students are included, and store this in a new DataFrame
called first_year
. Similarly, store returning students in a DataFrame
called returning
. Then, calculate the average grade and standard deviation using numpy
for these two groups. Write a brief statement about these two quantities.
Task 6: Statistics Challenge
An essential skill for any data scientist (or any professional that uses programming) is the ability to research new techniques and read documentation.
In this challenging task, we want to analyse the difference we saw above, but more rigorously! :)
Research hypothesis tests specifically for the difference of two means. Write a brief paragraph about your findings, especially detailing the t-distribution. Then look into the scipy
.ttest_ind()
method and read the documentation. Write a brief statement about your findings. Finally, conduct the hypothesis test for the null hypothesis that the mean difficulty
for first-years is equal to that for returning students. Report your findings!