{"cells":[{"cell_type":"markdown","source":["
"],"metadata":{"id":"-pZO6_XxsZJC"}},{"cell_type":"markdown","metadata":{"id":"-TM003HVDmy9"},"source":["# Statistics"]},{"cell_type":"markdown","metadata":{"id":"fpPtZgqIvuXz"},"source":["Sources and inspiration:\n","\n","\n","* https://www.kaggle.com/code/tirendazacademy/penguin-dataset-data-visualization-with-seaborn#Penguin-Dataset:-Data-Visualization-with-Seaborn\n","* https://seaborn.pydata.org/tutorial/categorical.html\n","* https://pandas.pydata.org/docs/user_guide/visualization.html\n","* https://levelup.gitconnected.com/statistics-on-seaborn-plots-with-statannotations-2bfce0394c00\n","\n","If running this from Google Colab, uncomment the cell below and run it. Otherwise, just skip it."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"5saSBc40voZF"},"outputs":[],"source":["# !pip install seaborn\n","# !pip scikit_posthocs\n","# !pip install watermark"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"cgfXd-tzqFqA","pycharm":{"name":"#%%\n"}},"outputs":[],"source":["import pandas as pd\n","import numpy as np\n","import seaborn as sns\n","from scipy.stats import mannwhitneyu, normaltest"]},{"cell_type":"markdown","metadata":{"id":"y2Wl1yTAY1dp"},"source":["## Introduction"]},{"cell_type":"markdown","metadata":{"id":"mKQgwkRfRA-q"},"source":["Many libraries are available in Python to clean, analyze, and plot data.\n","Python also has robust statistical packages which are used by thousands of other projects.*text kurzĂvou*\n","\n","We will work with the penguins dataset from seaborn."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"A6Gi_GYV-8Dm","outputId":"a7c8dbd4-5588-4d65-e530-46df563c7725"},"outputs":[{"data":{"text/html":["\n","\n","\n"," \n"," \n"," \n"," species\n"," island\n"," bill_length_mm\n"," bill_depth_mm\n"," flipper_length_mm\n"," body_mass_g\n"," sex\n"," \n"," \n"," \n"," \n"," 0\n"," Adelie\n"," Torgersen\n"," 39.1\n"," 18.7\n"," 181.0\n"," 3750.0\n"," Male\n"," \n"," \n"," 1\n"," Adelie\n"," Torgersen\n"," 39.5\n"," 17.4\n"," 186.0\n"," 3800.0\n"," Female\n"," \n"," \n"," 2\n"," Adelie\n"," Torgersen\n"," 40.3\n"," 18.0\n"," 195.0\n"," 3250.0\n"," Female\n"," \n"," \n"," 4\n"," Adelie\n"," Torgersen\n"," 36.7\n"," 19.3\n"," 193.0\n"," 3450.0\n"," Female\n"," \n"," \n"," 5\n"," Adelie\n"," Torgersen\n"," 39.3\n"," 20.6\n"," 190.0\n"," 3650.0\n"," Male\n"," \n"," \n","\n",""],"text/plain":[" species island bill_length_mm bill_depth_mm flipper_length_mm \\\n","0 Adelie Torgersen 39.1 18.7 181.0 \n","1 Adelie Torgersen 39.5 17.4 186.0 \n","2 Adelie Torgersen 40.3 18.0 195.0 \n","4 Adelie Torgersen 36.7 19.3 193.0 \n","5 Adelie Torgersen 39.3 20.6 190.0 \n","\n"," body_mass_g sex \n","0 3750.0 Male \n","1 3800.0 Female \n","2 3250.0 Female \n","4 3450.0 Female \n","5 3650.0 Male "]},"execution_count":3,"metadata":{},"output_type":"execute_result"}],"source":["penguins = sns.load_dataset(\"penguins\")\n","penguins_cleaned = penguins.dropna()\n","penguins_cleaned.head()"]},{"cell_type":"markdown","metadata":{"id":"1nof0GrHSlKH"},"source":["### Exploring the Data\n","\n","We already prepared the penguin dataset with `penguins_cleaned = penguins.dropna()`, but we should double check."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"t1uqIFnqsLe-","outputId":"4019d541-af4d-4095-bcd0-a953a4df6425"},"outputs":[{"data":{"text/plain":["species 0\n","island 0\n","bill_length_mm 0\n","bill_depth_mm 0\n","flipper_length_mm 0\n","body_mass_g 0\n","sex 0\n","dtype: int64"]},"execution_count":4,"metadata":{},"output_type":"execute_result"}],"source":["penguins_cleaned.isnull().sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":206},"id":"fGN_N1SwTLTQ","outputId":"70640773-5854-43cd-d097-46de28371b4c"},"outputs":[{"data":{"text/html":["\n","\n","\n"," \n"," \n"," \n"," species\n"," island\n"," bill_length_mm\n"," bill_depth_mm\n"," flipper_length_mm\n"," body_mass_g\n"," sex\n"," \n"," \n"," \n"," \n"," 0\n"," Adelie\n"," Torgersen\n"," 39.1\n"," 18.7\n"," 181.0\n"," 3750.0\n"," Male\n"," \n"," \n"," 1\n"," Adelie\n"," Torgersen\n"," 39.5\n"," 17.4\n"," 186.0\n"," 3800.0\n"," Female\n"," \n"," \n"," 2\n"," Adelie\n"," Torgersen\n"," 40.3\n"," 18.0\n"," 195.0\n"," 3250.0\n"," Female\n"," \n"," \n"," 4\n"," Adelie\n"," Torgersen\n"," 36.7\n"," 19.3\n"," 193.0\n"," 3450.0\n"," Female\n"," \n"," \n"," 5\n"," Adelie\n"," Torgersen\n"," 39.3\n"," 20.6\n"," 190.0\n"," 3650.0\n"," Male\n"," \n"," \n","\n",""],"text/plain":[" species island bill_length_mm bill_depth_mm flipper_length_mm \\\n","0 Adelie Torgersen 39.1 18.7 181.0 \n","1 Adelie Torgersen 39.5 17.4 186.0 \n","2 Adelie Torgersen 40.3 18.0 195.0 \n","4 Adelie Torgersen 36.7 19.3 193.0 \n","5 Adelie Torgersen 39.3 20.6 190.0 \n","\n"," body_mass_g sex \n","0 3750.0 Male \n","1 3800.0 Female \n","2 3250.0 Female \n","4 3450.0 Female \n","5 3650.0 Male "]},"execution_count":5,"metadata":{},"output_type":"execute_result"}],"source":["penguins_cleaned.head()"]},{"cell_type":"markdown","metadata":{"id":"3-gpNBPPsWqw"},"source":["There are 3 species of penquins. We can access their names by applying the `.unique` method on the 'species' column. It returns the unique values in that column."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"s7OjfV-oTt94","outputId":"02661aa4-c893-45e8-d537-22c2b090572e"},"outputs":[{"data":{"text/plain":["array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)"]},"execution_count":7,"metadata":{},"output_type":"execute_result"}],"source":["penguins_cleaned['species'].unique()"]},{"cell_type":"markdown","metadata":{"id":"vdc1o_nRTcdd"},"source":["We can get average measurements of a property split by categories with the `.pivot_table`."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":143},"id":"I8JtXQ6cT_2g","outputId":"674fa87d-408f-49ca-8315-ebf40cc2b6bf"},"outputs":[{"data":{"text/html":["\n","\n","\n"," \n"," \n"," species\n"," Adelie\n"," Chinstrap\n"," Gentoo\n"," \n"," \n"," sex\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," Female\n"," 37.257534\n"," 46.573529\n"," 45.563793\n"," \n"," \n"," Male\n"," 40.390411\n"," 51.094118\n"," 49.473770\n"," \n"," \n","\n",""],"text/plain":["species Adelie Chinstrap Gentoo\n","sex \n","Female 37.257534 46.573529 45.563793\n","Male 40.390411 51.094118 49.473770"]},"execution_count":8,"metadata":{},"output_type":"execute_result"}],"source":["penguins_cleaned.pivot_table('bill_length_mm', index='sex', columns='species', aggfunc='mean')"]},{"cell_type":"markdown","metadata":{"id":"7bOz4KTfjf6R"},"source":["### Splitting data into sub-groups\n","\n","Furhtermore, we can also prepare subsets for each penguin species by feeding a different binary mask to the dataframe."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"jKu-kNKvVEOd"},"outputs":[],"source":["Adelie_values = penguins_cleaned[penguins_cleaned['species']=='Adelie']\n","Chinstrap_values = penguins_cleaned[penguins_cleaned['species']=='Chinstrap']\n","Gentoo_values = penguins_cleaned[penguins_cleaned['species']=='Gentoo']"]},{"cell_type":"markdown","metadata":{"id":"O9vw7B9hlYke"},"source":["Lets explore values of each species."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":300},"id":"D7-b8W3YVZ4F","outputId":"fbb4ae50-dc60-4531-d117-0d85fcb7a397"},"outputs":[{"data":{"text/html":["\n","\n","\n"," \n"," \n"," \n"," bill_length_mm\n"," bill_depth_mm\n"," flipper_length_mm\n"," body_mass_g\n"," \n"," \n"," \n"," \n"," count\n"," 146.000000\n"," 146.000000\n"," 146.000000\n"," 146.000000\n"," \n"," \n"," mean\n"," 38.823973\n"," 18.347260\n"," 190.102740\n"," 3706.164384\n"," \n"," \n"," std\n"," 2.662597\n"," 1.219338\n"," 6.521825\n"," 458.620135\n"," \n"," \n"," min\n"," 32.100000\n"," 15.500000\n"," 172.000000\n"," 2850.000000\n"," \n"," \n"," 25%\n"," 36.725000\n"," 17.500000\n"," 186.000000\n"," 3362.500000\n"," \n"," \n"," 50%\n"," 38.850000\n"," 18.400000\n"," 190.000000\n"," 3700.000000\n"," \n"," \n"," 75%\n"," 40.775000\n"," 19.000000\n"," 195.000000\n"," 4000.000000\n"," \n"," \n"," max\n"," 46.000000\n"," 21.500000\n"," 210.000000\n"," 4775.000000\n"," \n"," \n","\n",""],"text/plain":[" bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n","count 146.000000 146.000000 146.000000 146.000000\n","mean 38.823973 18.347260 190.102740 3706.164384\n","std 2.662597 1.219338 6.521825 458.620135\n","min 32.100000 15.500000 172.000000 2850.000000\n","25% 36.725000 17.500000 186.000000 3362.500000\n","50% 38.850000 18.400000 190.000000 3700.000000\n","75% 40.775000 19.000000 195.000000 4000.000000\n","max 46.000000 21.500000 210.000000 4775.000000"]},"execution_count":10,"metadata":{},"output_type":"execute_result"}],"source":["Adelie_values.describe()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":300},"id":"It7rckCZVbme","outputId":"ecdc4b09-8b23-48ff-fdce-09e29ae3b4b4"},"outputs":[{"data":{"text/html":["\n","\n","\n"," \n"," \n"," \n"," bill_length_mm\n"," bill_depth_mm\n"," flipper_length_mm\n"," body_mass_g\n"," \n"," \n"," \n"," \n"," count\n"," 68.000000\n"," 68.000000\n"," 68.000000\n"," 68.000000\n"," \n"," \n"," mean\n"," 48.833824\n"," 18.420588\n"," 195.823529\n"," 3733.088235\n"," \n"," \n"," std\n"," 3.339256\n"," 1.135395\n"," 7.131894\n"," 384.335081\n"," \n"," \n"," min\n"," 40.900000\n"," 16.400000\n"," 178.000000\n"," 2700.000000\n"," \n"," \n"," 25%\n"," 46.350000\n"," 17.500000\n"," 191.000000\n"," 3487.500000\n"," \n"," \n"," 50%\n"," 49.550000\n"," 18.450000\n"," 196.000000\n"," 3700.000000\n"," \n"," \n"," 75%\n"," 51.075000\n"," 19.400000\n"," 201.000000\n"," 3950.000000\n"," \n"," \n"," max\n"," 58.000000\n"," 20.800000\n"," 212.000000\n"," 4800.000000\n"," \n"," \n","\n",""],"text/plain":[" bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n","count 68.000000 68.000000 68.000000 68.000000\n","mean 48.833824 18.420588 195.823529 3733.088235\n","std 3.339256 1.135395 7.131894 384.335081\n","min 40.900000 16.400000 178.000000 2700.000000\n","25% 46.350000 17.500000 191.000000 3487.500000\n","50% 49.550000 18.450000 196.000000 3700.000000\n","75% 51.075000 19.400000 201.000000 3950.000000\n","max 58.000000 20.800000 212.000000 4800.000000"]},"execution_count":11,"metadata":{},"output_type":"execute_result"}],"source":["Chinstrap_values.describe()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":300},"id":"6q-3KASyVdcF","outputId":"1864d1fb-e1f2-4fe9-8782-86d56f07a6c8"},"outputs":[{"data":{"text/html":["\n","\n","\n"," \n"," \n"," \n"," bill_length_mm\n"," bill_depth_mm\n"," flipper_length_mm\n"," body_mass_g\n"," \n"," \n"," \n"," \n"," count\n"," 119.000000\n"," 119.000000\n"," 119.000000\n"," 119.000000\n"," \n"," \n"," mean\n"," 47.568067\n"," 14.996639\n"," 217.235294\n"," 5092.436975\n"," \n"," \n"," std\n"," 3.106116\n"," 0.985998\n"," 6.585431\n"," 501.476154\n"," \n"," \n"," min\n"," 40.900000\n"," 13.100000\n"," 203.000000\n"," 3950.000000\n"," \n"," \n"," 25%\n"," 45.350000\n"," 14.200000\n"," 212.000000\n"," 4700.000000\n"," \n"," \n"," 50%\n"," 47.400000\n"," 15.000000\n"," 216.000000\n"," 5050.000000\n"," \n"," \n"," 75%\n"," 49.600000\n"," 15.750000\n"," 221.500000\n"," 5500.000000\n"," \n"," \n"," max\n"," 59.600000\n"," 17.300000\n"," 231.000000\n"," 6300.000000\n"," \n"," \n","\n",""],"text/plain":[" bill_length_mm bill_depth_mm flipper_length_mm body_mass_g\n","count 119.000000 119.000000 119.000000 119.000000\n","mean 47.568067 14.996639 217.235294 5092.436975\n","std 3.106116 0.985998 6.585431 501.476154\n","min 40.900000 13.100000 203.000000 3950.000000\n","25% 45.350000 14.200000 212.000000 4700.000000\n","50% 47.400000 15.000000 216.000000 5050.000000\n","75% 49.600000 15.750000 221.500000 5500.000000\n","max 59.600000 17.300000 231.000000 6300.000000"]},"execution_count":12,"metadata":{},"output_type":"execute_result"}],"source":["Gentoo_values.describe()"]},{"cell_type":"markdown","metadata":{"id":"7_ie4_CksWqy"},"source":["## Applying Statistics tests"]},{"cell_type":"markdown","metadata":{"id":"di_UCUwdXKDb"},"source":["### Normality Test"]},{"cell_type":"markdown","metadata":{"id":"PVcy5evnYtSL"},"source":["`normaltest()` test whether a sample differs from a normal distribution.\n","\n","This function tests the null hypothesis that a sample comes from a normal distribution."]},{"cell_type":"markdown","metadata":{"id":"WaIMJYl_ZqAp"},"source":["The p-value range is between 0 and 1,"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"UO0yB80vqFqO","outputId":"a54f5a1a-d8f6-4d66-ad84-09b81a301301","pycharm":{"name":"#%%\n"}},"outputs":[{"name":"stdout","output_type":"stream","text":["Adelie: 0.7046667395852243\n","Chinstrap: 0.9143668075479967\n","Gentoo: 0.002785628232779262\n"]}],"source":["from scipy.stats import normaltest\n","print(\"Adelie: \", normaltest(Adelie_values['bill_length_mm']).pvalue)\n","print(\"Chinstrap: \", normaltest(Chinstrap_values['bill_length_mm']).pvalue)\n","print(\"Gentoo: \", normaltest(Gentoo_values['bill_length_mm']).pvalue)\n"]},{"cell_type":"markdown","metadata":{"id":"jR6_9Nn3abKY"},"source":["Traditionally, in statistics, we need a p-value of less than 0.05 to\n","reject the null hypothesis. In this case, the 2 out of 3 species have p-value > 0.05. Because our p value is greater than 0.05, we cannot reject the null hypothesis. Therefore, we have not proven that the 2 data sets are different from normality.\n","\n","But what about the last?"]},{"cell_type":"markdown","metadata":{"id":"liLKelkYa2Sw"},"source":["Aren't we forgeting something? Do we know enough about the data?"]},{"cell_type":"markdown","metadata":{"id":"XndEb8ycnU2h"},"source":[""]},{"cell_type":"code","execution_count":null,"metadata":{"id":"NTJlvbMybCEo"},"outputs":[],"source":["# import matplotlib.pyplot as plt\n","# # Set up the matplotlib figure\n","# f, ax = plt.subplots(figsize=(10, 5))\n","\n","# sns.histplot(x = \"bill_length_mm\", data = penguins_cleaned, binwidth=1, hue=\"species\", kde=True)\n","# # plt.title(\"Bill Length\", size=20, color=\"red\") # would look wierd\n","# ax.set(title=\"Bill Length\")\n","\n","# sns.move_legend(\n","# ax, \"upper center\",\n","# bbox_to_anchor=(.5, 1), ncol=3, title=None, frameon=False,\n","# )\n","\n","# f.savefig('penguins_species_bill-length_PNG.png', dpi=300)"]},{"cell_type":"markdown","metadata":{"id":"iolWcFn0bgLo"},"source":["A normal distribution is symmetric about the mean. \n","A normal distribution also has a specific width for a given height.\n","\n","If you double the height, the width scales proportionally. However,\n","you could imagine stretching a bell curve out in weird ways without\n","changing its symmetry. You could have a sharp, pointy distribution,\n","or a fat, boxy one."]},{"cell_type":"markdown","metadata":{"id":"mKuk_-LFnjCE"},"source":[""]},{"cell_type":"code","execution_count":null,"metadata":{"id":"JhH952mmbqPn"},"outputs":[],"source":["# import matplotlib.pyplot as plt\n","\n","# # Set up the matplotlib figure\n","# f, ax = plt.subplots(figsize=(10, 5))\n","\n","# sns.histplot(x = \"bill_length_mm\", data = Gentoo_values, binwidth=1, hue=\"sex\", kde=True)\n","# # plt.title(\"Bill Length\", size=20, color=\"red\") # would look wierd\n","# ax.set(title=\"Bill Length of Gentoo\")\n","\n","# sns.move_legend(\n","# ax, \"upper center\",\n","# bbox_to_anchor=(.5, 1), ncol=3, title=None, frameon=False,\n","# )\n","\n","# f.savefig('penguins_gentoo_bill-length_PNG.png', dpi=300)"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"BsPrjqZFcBBv","outputId":"91b6c933-f0f1-475a-bea2-41614680fe7d"},"outputs":[{"name":"stdout","output_type":"stream","text":["Gentoo male: 0.0005453287292381343\n","Gentoo female: 0.9029679515828937\n"]}],"source":["Gentoo_values_male=Gentoo_values[Gentoo_values.sex=='Male']\n","Gentoo_values_female=Gentoo_values[Gentoo_values.sex=='Female']\n","\n","print(\"Gentoo male: \", normaltest(Gentoo_values_male['bill_length_mm']).pvalue)\n","print(\"Gentoo female: \", normaltest(Gentoo_values_female['bill_length_mm']).pvalue)"]},{"cell_type":"markdown","metadata":{"id":"Esjlxz45c49X"},"source":["Luckily we are able to test and compare sets of values, even if they do not come from Normal Distribution"]},{"cell_type":"markdown","metadata":{"id":"-iJhxc8TsWq5"},"source":["### Comparing 2 groups"]},{"cell_type":"markdown","metadata":{"id":"uwN3p34bcsDF"},"source":["#### Parametric Tests\n","\n","##### T-Test"]},{"cell_type":"markdown","metadata":{"id":"6BUomQcSfbkl"},"source":["`ttest_ind()` calculates the T-test for the means of two independent samples of scores.\n","\n","This is a test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.\n","\n","There is a catch! We need to have either same number of samples, or same variance. `ttest_ind()` can perform a standard independent 2 sample test that assumes equal population variances. If False, perform Welch’s t-test, which does not assume equal population variance.\n","\n","---\n","**Note**\n","*Using Student's original definition of the t-test, the two populations being compared should have the same variance. If the sample sizes in the two groups being compared are equal, Student's original t-test is highly robust to the presence of unequal variances. Welch's t-test is insensitive to equality of the variances regardless of whether the sample sizes are similar.*"]},{"cell_type":"markdown","metadata":{"id":"na3ztua1gqrP"},"source":["Perform Levene test for equal variances.\n","\n","The Levene test tests the null hypothesis that all input samples are from populations with equal variances."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"pBrWKB1dgUuk"},"outputs":[],"source":["from scipy.stats import ttest_ind, levene"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"N8Qy-aPUdJw2","outputId":"4d224e1e-38bb-4edc-b2f9-c7cbad1e748c"},"outputs":[{"name":"stdout","output_type":"stream","text":["Adelie vs Chinstrap, variance: LeveneResult(statistic=4.529733833453024, pvalue=0.03446512682844295)\n"]}],"source":["# pvalues with scipy:\n","stat_results = [levene(Adelie_values['bill_length_mm'], Chinstrap_values['bill_length_mm'])]\n","\n","print(\"Adelie vs Chinstrap, variance: \", stat_results[0])\n","\n","pvalues = [result.pvalue for result in stat_results]"]},{"cell_type":"markdown","metadata":{"id":"6azzEtg3iLB6"},"source":["With p-value < 0.05 , we reject the null hypothesis that all input samples are from populations with equal variances. So we need Welch’s t-test"]},{"cell_type":"markdown","metadata":{"id":"foRasXqAjkAp"},"source":["---\n","Null hypothesis is that 2 independent samples have identical average (expected) values."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"o1vYC0PniQi7","outputId":"69bcb07e-0031-47ed-8aaf-bc81a920e8f6"},"outputs":[{"name":"stdout","output_type":"stream","text":["Adelie vs Chinstrap, mean: TtestResult(statistic=-21.712498056635937, pvalue=3.1490764303457434e-41, df=108.17221912082128)\n"]}],"source":["# pvalues with scipy:\n","stat_results_ACh = [ttest_ind(Adelie_values['bill_length_mm'], Chinstrap_values['bill_length_mm'], equal_var=False)]\n","\n","print(\"Adelie vs Chinstrap, mean: \", stat_results_ACh[0])\n","\n","pvalues = [result.pvalue for result in stat_results_ACh]"]},{"cell_type":"markdown","metadata":{"id":"hVwFdAUWjtzI"},"source":["With p-value < 0.05 , we reject the null hypothesis. Now how we can plot this?"]},{"cell_type":"markdown","metadata":{"id":"TthTHYeKcuaP"},"source":["#### Non-parametric\n","\n","##### Mann-Whitney U Test"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"P1AR8Gd2dcpO","outputId":"ea7b7591-3640-4d0b-c731-5109fafaa9a5"},"outputs":[{"name":"stdout","output_type":"stream","text":["Gentoo male vs Gentoo female, bill_length_mm: MannwhitneyuResult(statistic=3125.5, pvalue=5.58594267834405e-13)\n"]}],"source":["from scipy.stats import mannwhitneyu\n","\n","# pvalues with scipy:\n","stat_results_GFM = [mannwhitneyu(Gentoo_values_male['bill_length_mm'], Gentoo_values_female['bill_length_mm'], alternative=\"two-sided\"),]\n","\n","print(\"Gentoo male vs Gentoo female, bill_length_mm: \", stat_results_GFM[0])\n","\n","pvalues = [result.pvalue for result in stat_results_GFM]"]},{"cell_type":"markdown","metadata":{"id":"fyK-kOiXj6Ze"},"source":["### Comparing more than 2 groups"]},{"cell_type":"markdown","metadata":{"id":"AN5wIjCTvFWa"},"source":["#### Analysis of Variance\n","\n","Now we have three samples, so a t-test is actually not appropriate. If we state the 0-Hypothesis that there is no difference between samples, we should apply a one-way ANOVA."]},{"cell_type":"markdown","metadata":{"id":"3Y9HoWv_wfN2"},"source":["Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the \"variation\" among and between groups) used to analyze the differences among means. ANOVA was developed by the statistician Ronald Fisher. ANOVA is based on the law of total variance, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means. **In other words, the ANOVA is used to test the difference between two or more means.**"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"7aBgPsKdvE9a"},"outputs":[],"source":["from scipy.stats import f_oneway\n","\n","stat_results_f_oneway = f_oneway(Adelie_values['bill_length_mm'], Chinstrap_values['bill_length_mm'], Gentoo_values['bill_length_mm'])"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"wgVCvkZWwUYx","outputId":"2a827ba3-8980-4a77-a3bd-d55cf6419d62"},"outputs":[{"data":{"text/plain":["F_onewayResult(statistic=397.29943741282835, pvalue=1.3809842053150027e-88)"]},"execution_count":17,"metadata":{},"output_type":"execute_result"}],"source":["stat_results_f_oneway"]},{"cell_type":"markdown","metadata":{"id":"xXJQhqC-wufJ"},"source":["#### Kruskal–Wallis test\n","The Kruskal–Wallis test is a **non-parametric** method for testing whether samples originate from the same distribution. It is used for comparing two or more independent samples of equal or different sample sizes. It extends the Mann–Whitney U test, which is used for comparing only two groups."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"lIiCDHqMxa0g"},"outputs":[],"source":["from scipy.stats import kruskal\n","\n","stat_results_kruskal = kruskal(Adelie_values['bill_length_mm'], Chinstrap_values['bill_length_mm'], Gentoo_values['bill_length_mm'])"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"Kteu-hZSxhZ8","outputId":"9457db2f-608b-45b8-e751-cd35950930a7"},"outputs":[{"data":{"text/plain":["KruskalResult(statistic=236.8992355590763, pvalue=3.6139705965512625e-52)"]},"execution_count":19,"metadata":{},"output_type":"execute_result"}],"source":["stat_results_kruskal"]},{"cell_type":"markdown","metadata":{"id":"7Zasv4RyxsLS"},"source":["We are looking at all comparisons at the same time. But we usually want to know which one makes the difference! **But here we need to include Multiple testing correction!**"]},{"cell_type":"markdown","metadata":{"id":"ZQhI3SxUx1Av"},"source":["#### Multiple testing correction\n","\n","For ANOVA there is Tukey. For non-parametric tests, there is Dunn’s test."]},{"cell_type":"markdown","metadata":{"id":"V9eYOcgA6UEw"},"source":["##### Tukey\n","\n"," This test uses pairwise post-hoc testing for **ANOVA** to determine whether there is a difference between the mean of all possible pairs using a studentized range distribution. This method tests every possible pair of all groups."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"81cffTajx6aY"},"outputs":[],"source":["from statsmodels.stats.multicomp import pairwise_tukeyhsd\n","\n","# For Tukey the dataframe needs to be melted\n","\n","data = penguins_cleaned[['species','bill_length_mm']]"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"aB4XYCB2zoab"},"outputs":[],"source":["# perform multiple pairwise comparison (Tukey HSD)\n","stat_results_pairwise_tukeyhsd = pairwise_tukeyhsd(endog=data['bill_length_mm'], groups=data['species'], alpha=0.05)"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"UV43AXiny-F_","outputId":"c1da7cee-8dc6-47ab-f597-491f505b84f2"},"outputs":[{"name":"stdout","output_type":"stream","text":[" Multiple Comparison of Means - Tukey HSD, FWER=0.05 \n","==========================================================\n"," group1 group2 meandiff p-adj lower upper reject\n","----------------------------------------------------------\n"," Adelie Chinstrap 10.0099 0.0 8.9828 11.0369 True\n"," Adelie Gentoo 8.7441 0.0 7.8801 9.6081 True\n","Chinstrap Gentoo -1.2658 0.0148 -2.3292 -0.2023 True\n","----------------------------------------------------------\n"]}],"source":["print(stat_results_pairwise_tukeyhsd)"]},{"cell_type":"markdown","metadata":{"id":"nNybO7cV5bkE"},"source":["##### Dunn's test\n","\n","If the results of a **Kruskal-Wallis** test are statistically significant, then it’s appropriate to conduct Dunn’s Test to determine exactly which groups are different."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"PijgOoRT6G_4"},"outputs":[],"source":["# !pip install scikit_posthocs"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"PVsfEvEp5aAA"},"outputs":[],"source":["data=[Adelie_values['bill_length_mm'].to_numpy(), Chinstrap_values['bill_length_mm'].to_numpy(), Gentoo_values['bill_length_mm'].to_numpy()]"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":852},"id":"tECveDfg5nxD","outputId":"eccd2fa2-2a43-4e92-b2e2-93e5bcf17bb1"},"outputs":[{"data":{"text/html":["\n","\n","\n"," \n"," \n"," \n"," 1\n"," 2\n"," 3\n"," \n"," \n"," \n"," \n"," 1\n"," 1.000000e+00\n"," 4.623888e-36\n"," 1.030275e-37\n"," \n"," \n"," 2\n"," 4.623888e-36\n"," 1.000000e+00\n"," 2.697766e-01\n"," \n"," \n"," 3\n"," 1.030275e-37\n"," 2.697766e-01\n"," 1.000000e+00\n"," \n"," \n","\n",""],"text/plain":[" 1 2 3\n","1 1.000000e+00 4.623888e-36 1.030275e-37\n","2 4.623888e-36 1.000000e+00 2.697766e-01\n","3 1.030275e-37 2.697766e-01 1.000000e+00"]},"execution_count":24,"metadata":{},"output_type":"execute_result"}],"source":["#perform Dunn's test using a Bonferonni correction for the p-values\n","import scikit_posthocs as sp\n","sp.posthoc_dunn(data, p_adjust = 'bonferroni', group_col='species', val_col='bill_length_mm')"]},{"cell_type":"markdown","metadata":{"id":"OV1g1q3FqFqY"},"source":["---\n","# (Bonus preview) Use statannotations to apply scipy test"]},{"cell_type":"markdown","metadata":{"id":"tEY7A5YlqFqY"},"source":["Finally, `statannotations` can take care of most of the steps required to run the test by calling `scipy.stats` directly\n","and annotate the plot.\n","The available options are\n","\n","- Mann-Whitney\n","- t-test (independent and paired)\n","- Welch's t-test\n","- Levene test\n","- Wilcoxon test\n","- Kruskal-Wallis test\n","\n","We will cover how to use a test that is not one of those already interfaced in `statannotations`.\n","If you are curious, you can also take a look at the usage\n","[notebook](https://github.com/trevismd/statannotations/blob/master/usage/example.ipynb) in the project repository."]},{"cell_type":"markdown","metadata":{"id":"vsdfj_gM5qbt"},"source":[""]},{"cell_type":"markdown","metadata":{"id":"wyUoyKXHj2UQ"},"source":["---"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"iH1jL0baGMJW","outputId":"dd7307d6-eb5b-4ed0-c233-32574ddb0c40"},"outputs":[{"name":"stdout","output_type":"stream","text":["Last updated: 2023-08-25T10:49:29.031581+02:00\n","\n","Python implementation: CPython\n","Python version : 3.9.17\n","IPython version : 8.14.0\n","\n","Compiler : MSC v.1929 64 bit (AMD64)\n","OS : Windows\n","Release : 10\n","Machine : AMD64\n","Processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel\n","CPU cores : 16\n","Architecture: 64bit\n","\n","watermark : 2.4.3\n","numpy : 1.23.5\n","pandas : 2.0.3\n","seaborn : 0.12.2\n","matplotlib : 3.7.2\n","scipy : 1.11.2\n","statannotations: 0.4.4\n","\n"]}],"source":["from watermark import watermark\n","watermark(iversions=True, globals_=globals())\n","print(watermark())\n","print(watermark(packages=\"watermark,numpy,pandas,seaborn,matplotlib,scipy,statannotations\"))"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"c-OuhB6MsWrE"},"outputs":[],"source":[]}],"metadata":{"colab":{"provenance":[],"toc_visible":true},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.9.17"}},"nbformat":4,"nbformat_minor":0}