{"cells":[{"cell_type":"markdown","source":["
"],"metadata":{"id":"8B0bvhGssIL5"}},{"cell_type":"markdown","metadata":{"id":"Vkf1B-vMwpVB"},"source":["# Exploratory Data Analysis"]},{"cell_type":"markdown","metadata":{"id":"fpPtZgqIvuXz"},"source":["Inspiration and some of the parts came from: Python Data Science [GitHub repository](https://github.com/jakevdp/PythonDataScienceHandbook/tree/master), [MIT License](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/LICENSE-CODE) and [Introduction to Pandas](https://colab.research.google.com/notebooks/mlcc/intro_to_pandas.ipynb) by Google, [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)\n","\n","If running this from Google Colab, uncomment the cell below and run it. Otherwise, just skip it."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"5saSBc40voZF"},"outputs":[],"source":["#!pip install seaborn\n","#!pip install watermark"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"KoPnqdBKsHas"},"outputs":[],"source":["import pandas as pd\n","import seaborn as sns\n","from scipy import stats"]},{"cell_type":"markdown","metadata":{"id":"ZkUd2sa-yP5e"},"source":["## Learning Objectives:\n","\n"," * descriptive statistics/EDA\n"," * corr matrix\n","\n","For this notebook, we will use the california housing dataframes."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":206},"executionInfo":{"elapsed":359,"status":"ok","timestamp":1692082289947,"user":{"displayName":"Martin Schätz","userId":"14609383414092679868"},"user_tz":-120},"id":"av6RYOraVG1V","outputId":"6d1929a3-fec1-4e28-c968-b5d78c224fe9"},"outputs":[{"data":{"text/html":["\n","\n","\n"," \n"," \n"," \n"," longitude\n"," latitude\n"," housing_median_age\n"," total_rooms\n"," total_bedrooms\n"," population\n"," households\n"," median_income\n"," median_house_value\n"," \n"," \n"," \n"," \n"," 0\n"," -114.31\n"," 34.19\n"," 15.0\n"," 5612.0\n"," 1283.0\n"," 1015.0\n"," 472.0\n"," 1.4936\n"," 66900.0\n"," \n"," \n"," 1\n"," -114.47\n"," 34.40\n"," 19.0\n"," 7650.0\n"," 1901.0\n"," 1129.0\n"," 463.0\n"," 1.8200\n"," 80100.0\n"," \n"," \n"," 2\n"," -114.56\n"," 33.69\n"," 17.0\n"," 720.0\n"," 174.0\n"," 333.0\n"," 117.0\n"," 1.6509\n"," 85700.0\n"," \n"," \n"," 3\n"," -114.57\n"," 33.64\n"," 14.0\n"," 1501.0\n"," 337.0\n"," 515.0\n"," 226.0\n"," 3.1917\n"," 73400.0\n"," \n"," \n"," 4\n"," -114.57\n"," 33.57\n"," 20.0\n"," 1454.0\n"," 326.0\n"," 624.0\n"," 262.0\n"," 1.9250\n"," 65500.0\n"," \n"," \n"," ...\n"," ...\n"," ...\n"," ...\n"," ...\n"," ...\n"," ...\n"," ...\n"," ...\n"," ...\n"," \n"," \n"," 16995\n"," -124.26\n"," 40.58\n"," 52.0\n"," 2217.0\n"," 394.0\n"," 907.0\n"," 369.0\n"," 2.3571\n"," 111400.0\n"," \n"," \n"," 16996\n"," -124.27\n"," 40.69\n"," 36.0\n"," 2349.0\n"," 528.0\n"," 1194.0\n"," 465.0\n"," 2.5179\n"," 79000.0\n"," \n"," \n"," 16997\n"," -124.30\n"," 41.84\n"," 17.0\n"," 2677.0\n"," 531.0\n"," 1244.0\n"," 456.0\n"," 3.0313\n"," 103600.0\n"," \n"," \n"," 16998\n"," -124.30\n"," 41.80\n"," 19.0\n"," 2672.0\n"," 552.0\n"," 1298.0\n"," 478.0\n"," 1.9797\n"," 85800.0\n"," \n"," \n"," 16999\n"," -124.35\n"," 40.54\n"," 52.0\n"," 1820.0\n"," 300.0\n"," 806.0\n"," 270.0\n"," 3.0147\n"," 94600.0\n"," \n"," \n","\n","17000 rows × 9 columns\n",""],"text/plain":[" longitude latitude housing_median_age total_rooms total_bedrooms \\\n","0 -114.31 34.19 15.0 5612.0 1283.0 \n","1 -114.47 34.40 19.0 7650.0 1901.0 \n","2 -114.56 33.69 17.0 720.0 174.0 \n","3 -114.57 33.64 14.0 1501.0 337.0 \n","4 -114.57 33.57 20.0 1454.0 326.0 \n","... ... ... ... ... ... \n","16995 -124.26 40.58 52.0 2217.0 394.0 \n","16996 -124.27 40.69 36.0 2349.0 528.0 \n","16997 -124.30 41.84 17.0 2677.0 531.0 \n","16998 -124.30 41.80 19.0 2672.0 552.0 \n","16999 -124.35 40.54 52.0 1820.0 300.0 \n","\n"," population households median_income median_house_value \n","0 1015.0 472.0 1.4936 66900.0 \n","1 1129.0 463.0 1.8200 80100.0 \n","2 333.0 117.0 1.6509 85700.0 \n","3 515.0 226.0 3.1917 73400.0 \n","4 624.0 262.0 1.9250 65500.0 \n","... ... ... ... ... \n","16995 907.0 369.0 2.3571 111400.0 \n","16996 1194.0 465.0 2.5179 79000.0 \n","16997 1244.0 456.0 3.0313 103600.0 \n","16998 1298.0 478.0 1.9797 85800.0 \n","16999 806.0 270.0 3.0147 94600.0 \n","\n","[17000 rows x 9 columns]"]},"execution_count":5,"metadata":{},"output_type":"execute_result"}],"source":["california_housing_dataframe = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\", sep=\",\")\n","california_housing_dataframe"]},{"cell_type":"markdown","metadata":{"id":"dRGUa2xD_-kt"},"source":["## Exploring Data"]},{"cell_type":"markdown","metadata":{"id":"5tP90yYg8vP7"},"source":["As shown above, after loading a large `DataFrame`, it may be a bit hard to have a good overview of what is inside it just by looking at a few rows. Thus, the `DataFrame.describe` method is useful to show interesting statistics about a `DataFrame`."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":300},"executionInfo":{"elapsed":411,"status":"ok","timestamp":1692082316437,"user":{"displayName":"Martin Schätz","userId":"14609383414092679868"},"user_tz":-120},"id":"pEn_CnT28vQJ","outputId":"e68ecb73-8a09-46fe-b456-7be9ed403b7c"},"outputs":[{"data":{"text/html":["\n","\n","\n"," \n"," \n"," \n"," longitude\n"," latitude\n"," housing_median_age\n"," total_rooms\n"," total_bedrooms\n"," population\n"," households\n"," median_income\n"," median_house_value\n"," \n"," \n"," \n"," \n"," count\n"," 17000.000000\n"," 17000.000000\n"," 17000.000000\n"," 17000.000000\n"," 17000.000000\n"," 17000.000000\n"," 17000.000000\n"," 17000.000000\n"," 17000.000000\n"," \n"," \n"," mean\n"," -119.562108\n"," 35.625225\n"," 28.589353\n"," 2643.664412\n"," 539.410824\n"," 1429.573941\n"," 501.221941\n"," 3.883578\n"," 207300.912353\n"," \n"," \n"," std\n"," 2.005166\n"," 2.137340\n"," 12.586937\n"," 2179.947071\n"," 421.499452\n"," 1147.852959\n"," 384.520841\n"," 1.908157\n"," 115983.764387\n"," \n"," \n"," min\n"," -124.350000\n"," 32.540000\n"," 1.000000\n"," 2.000000\n"," 1.000000\n"," 3.000000\n"," 1.000000\n"," 0.499900\n"," 14999.000000\n"," \n"," \n"," 25%\n"," -121.790000\n"," 33.930000\n"," 18.000000\n"," 1462.000000\n"," 297.000000\n"," 790.000000\n"," 282.000000\n"," 2.566375\n"," 119400.000000\n"," \n"," \n"," 50%\n"," -118.490000\n"," 34.250000\n"," 29.000000\n"," 2127.000000\n"," 434.000000\n"," 1167.000000\n"," 409.000000\n"," 3.544600\n"," 180400.000000\n"," \n"," \n"," 75%\n"," -118.000000\n"," 37.720000\n"," 37.000000\n"," 3151.250000\n"," 648.250000\n"," 1721.000000\n"," 605.250000\n"," 4.767000\n"," 265000.000000\n"," \n"," \n"," max\n"," -114.310000\n"," 41.950000\n"," 52.000000\n"," 37937.000000\n"," 6445.000000\n"," 35682.000000\n"," 6082.000000\n"," 15.000100\n"," 500001.000000\n"," \n"," \n","\n",""],"text/plain":[" longitude latitude housing_median_age total_rooms \\\n","count 17000.000000 17000.000000 17000.000000 17000.000000 \n","mean -119.562108 35.625225 28.589353 2643.664412 \n","std 2.005166 2.137340 12.586937 2179.947071 \n","min -124.350000 32.540000 1.000000 2.000000 \n","25% -121.790000 33.930000 18.000000 1462.000000 \n","50% -118.490000 34.250000 29.000000 2127.000000 \n","75% -118.000000 37.720000 37.000000 3151.250000 \n","max -114.310000 41.950000 52.000000 37937.000000 \n","\n"," total_bedrooms population households median_income \\\n","count 17000.000000 17000.000000 17000.000000 17000.000000 \n","mean 539.410824 1429.573941 501.221941 3.883578 \n","std 421.499452 1147.852959 384.520841 1.908157 \n","min 1.000000 3.000000 1.000000 0.499900 \n","25% 297.000000 790.000000 282.000000 2.566375 \n","50% 434.000000 1167.000000 409.000000 3.544600 \n","75% 648.250000 1721.000000 605.250000 4.767000 \n","max 6445.000000 35682.000000 6082.000000 15.000100 \n","\n"," median_house_value \n","count 17000.000000 \n","mean 207300.912353 \n","std 115983.764387 \n","min 14999.000000 \n","25% 119400.000000 \n","50% 180400.000000 \n","75% 265000.000000 \n","max 500001.000000 "]},"execution_count":6,"metadata":{},"output_type":"execute_result"}],"source":["california_housing_dataframe.describe()"]},{"cell_type":"markdown","metadata":{"id":"pFuzC-Gh8vQK"},"source":["Another useful function is `DataFrame.head`, which displays the first few records of a `DataFrame`. You can give it a number of rows to display."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":206},"executionInfo":{"elapsed":501,"status":"ok","timestamp":1692082318640,"user":{"displayName":"Martin Schätz","userId":"14609383414092679868"},"user_tz":-120},"id":"s3ND3bgOkB5k","outputId":"0487c669-1316-4f15-8bf7-7a6c30840053"},"outputs":[{"data":{"text/html":["\n","\n","\n"," \n"," \n"," \n"," longitude\n"," latitude\n"," housing_median_age\n"," total_rooms\n"," total_bedrooms\n"," population\n"," households\n"," median_income\n"," median_house_value\n"," \n"," \n"," \n"," \n"," 0\n"," -114.31\n"," 34.19\n"," 15.0\n"," 5612.0\n"," 1283.0\n"," 1015.0\n"," 472.0\n"," 1.4936\n"," 66900.0\n"," \n"," \n"," 1\n"," -114.47\n"," 34.40\n"," 19.0\n"," 7650.0\n"," 1901.0\n"," 1129.0\n"," 463.0\n"," 1.8200\n"," 80100.0\n"," \n"," \n"," 2\n"," -114.56\n"," 33.69\n"," 17.0\n"," 720.0\n"," 174.0\n"," 333.0\n"," 117.0\n"," 1.6509\n"," 85700.0\n"," \n"," \n"," 3\n"," -114.57\n"," 33.64\n"," 14.0\n"," 1501.0\n"," 337.0\n"," 515.0\n"," 226.0\n"," 3.1917\n"," 73400.0\n"," \n"," \n"," 4\n"," -114.57\n"," 33.57\n"," 20.0\n"," 1454.0\n"," 326.0\n"," 624.0\n"," 262.0\n"," 1.9250\n"," 65500.0\n"," \n"," \n"," 5\n"," -114.58\n"," 33.63\n"," 29.0\n"," 1387.0\n"," 236.0\n"," 671.0\n"," 239.0\n"," 3.3438\n"," 74000.0\n"," \n"," \n"," 6\n"," -114.58\n"," 33.61\n"," 25.0\n"," 2907.0\n"," 680.0\n"," 1841.0\n"," 633.0\n"," 2.6768\n"," 82400.0\n"," \n"," \n"," 7\n"," -114.59\n"," 34.83\n"," 41.0\n"," 812.0\n"," 168.0\n"," 375.0\n"," 158.0\n"," 1.7083\n"," 48500.0\n"," \n"," \n"," 8\n"," -114.59\n"," 33.61\n"," 34.0\n"," 4789.0\n"," 1175.0\n"," 3134.0\n"," 1056.0\n"," 2.1782\n"," 58400.0\n"," \n"," \n"," 9\n"," -114.60\n"," 34.83\n"," 46.0\n"," 1497.0\n"," 309.0\n"," 787.0\n"," 271.0\n"," 2.1908\n"," 48100.0\n"," \n"," \n","\n",""],"text/plain":[" longitude latitude housing_median_age total_rooms total_bedrooms \\\n","0 -114.31 34.19 15.0 5612.0 1283.0 \n","1 -114.47 34.40 19.0 7650.0 1901.0 \n","2 -114.56 33.69 17.0 720.0 174.0 \n","3 -114.57 33.64 14.0 1501.0 337.0 \n","4 -114.57 33.57 20.0 1454.0 326.0 \n","5 -114.58 33.63 29.0 1387.0 236.0 \n","6 -114.58 33.61 25.0 2907.0 680.0 \n","7 -114.59 34.83 41.0 812.0 168.0 \n","8 -114.59 33.61 34.0 4789.0 1175.0 \n","9 -114.60 34.83 46.0 1497.0 309.0 \n","\n"," population households median_income median_house_value \n","0 1015.0 472.0 1.4936 66900.0 \n","1 1129.0 463.0 1.8200 80100.0 \n","2 333.0 117.0 1.6509 85700.0 \n","3 515.0 226.0 3.1917 73400.0 \n","4 624.0 262.0 1.9250 65500.0 \n","5 671.0 239.0 3.3438 74000.0 \n","6 1841.0 633.0 2.6768 82400.0 \n","7 375.0 158.0 1.7083 48500.0 \n","8 3134.0 1056.0 2.1782 58400.0 \n","9 787.0 271.0 2.1908 48100.0 "]},"execution_count":7,"metadata":{},"output_type":"execute_result"}],"source":["california_housing_dataframe.head(10)"]},{"cell_type":"markdown","metadata":{"id":"2O6_QUmu9Ncp"},"source":["Or `DataFrame.tail`, which displays the last few records of a `DataFrame`:"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":206},"executionInfo":{"elapsed":558,"status":"ok","timestamp":1692082320554,"user":{"displayName":"Martin Schätz","userId":"14609383414092679868"},"user_tz":-120},"id":"YWzM1PxE9Nc0","outputId":"7a26815d-5b4a-4a25-881c-7e3d1ae381ee"},"outputs":[{"data":{"text/html":["\n","\n","\n"," \n"," \n"," \n"," longitude\n"," latitude\n"," housing_median_age\n"," total_rooms\n"," total_bedrooms\n"," population\n"," households\n"," median_income\n"," median_house_value\n"," \n"," \n"," \n"," \n"," 16995\n"," -124.26\n"," 40.58\n"," 52.0\n"," 2217.0\n"," 394.0\n"," 907.0\n"," 369.0\n"," 2.3571\n"," 111400.0\n"," \n"," \n"," 16996\n"," -124.27\n"," 40.69\n"," 36.0\n"," 2349.0\n"," 528.0\n"," 1194.0\n"," 465.0\n"," 2.5179\n"," 79000.0\n"," \n"," \n"," 16997\n"," -124.30\n"," 41.84\n"," 17.0\n"," 2677.0\n"," 531.0\n"," 1244.0\n"," 456.0\n"," 3.0313\n"," 103600.0\n"," \n"," \n"," 16998\n"," -124.30\n"," 41.80\n"," 19.0\n"," 2672.0\n"," 552.0\n"," 1298.0\n"," 478.0\n"," 1.9797\n"," 85800.0\n"," \n"," \n"," 16999\n"," -124.35\n"," 40.54\n"," 52.0\n"," 1820.0\n"," 300.0\n"," 806.0\n"," 270.0\n"," 3.0147\n"," 94600.0\n"," \n"," \n","\n",""],"text/plain":[" longitude latitude housing_median_age total_rooms total_bedrooms \\\n","16995 -124.26 40.58 52.0 2217.0 394.0 \n","16996 -124.27 40.69 36.0 2349.0 528.0 \n","16997 -124.30 41.84 17.0 2677.0 531.0 \n","16998 -124.30 41.80 19.0 2672.0 552.0 \n","16999 -124.35 40.54 52.0 1820.0 300.0 \n","\n"," population households median_income median_house_value \n","16995 907.0 369.0 2.3571 111400.0 \n","16996 1194.0 465.0 2.5179 79000.0 \n","16997 1244.0 456.0 3.0313 103600.0 \n","16998 1298.0 478.0 1.9797 85800.0 \n","16999 806.0 270.0 3.0147 94600.0 "]},"execution_count":8,"metadata":{},"output_type":"execute_result"}],"source":["california_housing_dataframe.tail()"]},{"cell_type":"markdown","metadata":{"id":"ALnk--Ap37Ny"},"source":["## Correletaion Matrix\n","\n","Consider the table of measurements below."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"lWtYeuES-Jtq","outputId":"afdceba3-5d99-42d3-efd3-ef0b3b2c6bd3"},"outputs":[{"data":{"text/html":["\n","\n","\n"," \n"," \n"," \n"," area\n"," mean_intensity\n"," minor_axis_length\n"," major_axis_length\n"," eccentricity\n"," extent\n"," feret_diameter_max\n"," equivalent_diameter_area\n"," bbox-0\n"," bbox-1\n"," bbox-2\n"," bbox-3\n"," \n"," \n"," \n"," \n"," 0\n"," 422\n"," 192.379147\n"," 16.488550\n"," 34.566789\n"," 0.878900\n"," 0.586111\n"," 35.227830\n"," 23.179885\n"," 0\n"," 11\n"," 30\n"," 35\n"," \n"," \n"," 1\n"," 182\n"," 180.131868\n"," 11.736074\n"," 20.802697\n"," 0.825665\n"," 0.787879\n"," 21.377558\n"," 15.222667\n"," 0\n"," 53\n"," 11\n"," 74\n"," \n"," \n"," 2\n"," 661\n"," 205.216339\n"," 28.409502\n"," 30.208433\n"," 0.339934\n"," 0.874339\n"," 32.756679\n"," 29.010538\n"," 0\n"," 95\n"," 28\n"," 122\n"," \n"," \n"," 3\n"," 437\n"," 216.585812\n"," 23.143996\n"," 24.606130\n"," 0.339576\n"," 0.826087\n"," 26.925824\n"," 23.588253\n"," 0\n"," 144\n"," 23\n"," 167\n"," \n"," \n"," 4\n"," 476\n"," 212.302521\n"," 19.852882\n"," 31.075106\n"," 0.769317\n"," 0.863884\n"," 31.384710\n"," 24.618327\n"," 0\n"," 237\n"," 29\n"," 256\n"," \n"," \n","\n",""],"text/plain":[" area mean_intensity minor_axis_length major_axis_length eccentricity \\\n","0 422 192.379147 16.488550 34.566789 0.878900 \n","1 182 180.131868 11.736074 20.802697 0.825665 \n","2 661 205.216339 28.409502 30.208433 0.339934 \n","3 437 216.585812 23.143996 24.606130 0.339576 \n","4 476 212.302521 19.852882 31.075106 0.769317 \n","\n"," extent feret_diameter_max equivalent_diameter_area bbox-0 bbox-1 \\\n","0 0.586111 35.227830 23.179885 0 11 \n","1 0.787879 21.377558 15.222667 0 53 \n","2 0.874339 32.756679 29.010538 0 95 \n","3 0.826087 26.925824 23.588253 0 144 \n","4 0.863884 31.384710 24.618327 0 237 \n","\n"," bbox-2 bbox-3 \n","0 30 35 \n","1 11 74 \n","2 28 122 \n","3 23 167 \n","4 29 256 "]},"execution_count":23,"metadata":{},"output_type":"execute_result"}],"source":["blobs_statistics = pd.read_csv('../../data/blobs_statistics.csv', index_col=0)\n","blobs_statistics.head()"]},{"cell_type":"markdown","metadata":{"id":"kjZrpPdcsHaw"},"source":["After measuring many features / properties, it is often common that some of them are strongly correlated and may not bring much new information. In pandas, we can calculate correlation among columns like this."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":4,"status":"ok","timestamp":1692083624021,"user":{"displayName":"Martin Schätz","userId":"14609383414092679868"},"user_tz":-120},"id":"SKuQJYgB-CvX","outputId":"62a97fbc-5989-4591-9dff-a1360930dbe7"},"outputs":[{"data":{"text/html":["\n","\n","\n"," \n"," \n"," \n"," area\n"," mean_intensity\n"," minor_axis_length\n"," major_axis_length\n"," eccentricity\n"," extent\n"," feret_diameter_max\n"," equivalent_diameter_area\n"," bbox-0\n"," bbox-1\n"," bbox-2\n"," bbox-3\n"," \n"," \n"," \n"," \n"," area\n"," 1.000000\n"," 0.548612\n"," 0.890649\n"," 0.895282\n"," -0.192147\n"," -0.267454\n"," 0.916652\n"," 0.975964\n"," -0.066508\n"," -0.081937\n"," 0.034083\n"," -0.003961\n"," \n"," \n"," mean_intensity\n"," 0.548612\n"," 1.000000\n"," 0.657131\n"," 0.440678\n"," -0.362592\n"," -0.011555\n"," 0.487183\n"," 0.611103\n"," 0.015188\n"," 0.217484\n"," 0.069184\n"," 0.266504\n"," \n"," \n"," minor_axis_length\n"," 0.890649\n"," 0.657131\n"," 1.000000\n"," 0.664507\n"," -0.566486\n"," -0.037872\n"," 0.716706\n"," 0.937795\n"," -0.163017\n"," -0.056785\n"," -0.077817\n"," 0.015790\n"," \n"," \n"," major_axis_length\n"," 0.895282\n"," 0.440678\n"," 0.664507\n"," 1.000000\n"," 0.168454\n"," -0.551362\n"," 0.995196\n"," 0.880909\n"," -0.010743\n"," -0.128821\n"," 0.093556\n"," -0.057776\n"," \n"," \n"," eccentricity\n"," -0.192147\n"," -0.362592\n"," -0.566486\n"," 0.168454\n"," 1.000000\n"," -0.432629\n"," 0.103529\n"," -0.272402\n"," 0.257938\n"," -0.060467\n"," 0.253671\n"," -0.076793\n"," \n"," \n"," extent\n"," -0.267454\n"," -0.011555\n"," -0.037872\n"," -0.551362\n"," -0.432629\n"," 1.000000\n"," -0.517428\n"," -0.278453\n"," -0.076688\n"," 0.048511\n"," -0.128149\n"," 0.019310\n"," \n"," \n"," feret_diameter_max\n"," 0.916652\n"," 0.487183\n"," 0.716706\n"," 0.995196\n"," 0.103529\n"," -0.517428\n"," 1.000000\n"," 0.911211\n"," -0.025173\n"," -0.122607\n"," 0.080054\n"," -0.049283\n"," \n"," \n"," equivalent_diameter_area\n"," 0.975964\n"," 0.611103\n"," 0.937795\n"," 0.880909\n"," -0.272402\n"," -0.278453\n"," 0.911211\n"," 1.000000\n"," -0.107059\n"," -0.096706\n"," -0.004660\n"," -0.018489\n"," \n"," \n"," bbox-0\n"," -0.066508\n"," 0.015188\n"," -0.163017\n"," -0.010743\n"," 0.257938\n"," -0.076688\n"," -0.025173\n"," -0.107059\n"," 1.000000\n"," 0.050957\n"," 0.993418\n"," 0.053563\n"," \n"," \n"," bbox-1\n"," -0.081937\n"," 0.217484\n"," -0.056785\n"," -0.128821\n"," -0.060467\n"," 0.048511\n"," -0.122607\n"," -0.096706\n"," 0.050957\n"," 1.000000\n"," 0.032728\n"," 0.996062\n"," \n"," \n"," bbox-2\n"," 0.034083\n"," 0.069184\n"," -0.077817\n"," 0.093556\n"," 0.253671\n"," -0.128149\n"," 0.080054\n"," -0.004660\n"," 0.993418\n"," 0.032728\n"," 1.000000\n"," 0.041855\n"," \n"," \n"," bbox-3\n"," -0.003961\n"," 0.266504\n"," 0.015790\n"," -0.057776\n"," -0.076793\n"," 0.019310\n"," -0.049283\n"," -0.018489\n"," 0.053563\n"," 0.996062\n"," 0.041855\n"," 1.000000\n"," \n"," \n","\n",""],"text/plain":[" area mean_intensity minor_axis_length \\\n","area 1.000000 0.548612 0.890649 \n","mean_intensity 0.548612 1.000000 0.657131 \n","minor_axis_length 0.890649 0.657131 1.000000 \n","major_axis_length 0.895282 0.440678 0.664507 \n","eccentricity -0.192147 -0.362592 -0.566486 \n","extent -0.267454 -0.011555 -0.037872 \n","feret_diameter_max 0.916652 0.487183 0.716706 \n","equivalent_diameter_area 0.975964 0.611103 0.937795 \n","bbox-0 -0.066508 0.015188 -0.163017 \n","bbox-1 -0.081937 0.217484 -0.056785 \n","bbox-2 0.034083 0.069184 -0.077817 \n","bbox-3 -0.003961 0.266504 0.015790 \n","\n"," major_axis_length eccentricity extent \\\n","area 0.895282 -0.192147 -0.267454 \n","mean_intensity 0.440678 -0.362592 -0.011555 \n","minor_axis_length 0.664507 -0.566486 -0.037872 \n","major_axis_length 1.000000 0.168454 -0.551362 \n","eccentricity 0.168454 1.000000 -0.432629 \n","extent -0.551362 -0.432629 1.000000 \n","feret_diameter_max 0.995196 0.103529 -0.517428 \n","equivalent_diameter_area 0.880909 -0.272402 -0.278453 \n","bbox-0 -0.010743 0.257938 -0.076688 \n","bbox-1 -0.128821 -0.060467 0.048511 \n","bbox-2 0.093556 0.253671 -0.128149 \n","bbox-3 -0.057776 -0.076793 0.019310 \n","\n"," feret_diameter_max equivalent_diameter_area \\\n","area 0.916652 0.975964 \n","mean_intensity 0.487183 0.611103 \n","minor_axis_length 0.716706 0.937795 \n","major_axis_length 0.995196 0.880909 \n","eccentricity 0.103529 -0.272402 \n","extent -0.517428 -0.278453 \n","feret_diameter_max 1.000000 0.911211 \n","equivalent_diameter_area 0.911211 1.000000 \n","bbox-0 -0.025173 -0.107059 \n","bbox-1 -0.122607 -0.096706 \n","bbox-2 0.080054 -0.004660 \n","bbox-3 -0.049283 -0.018489 \n","\n"," bbox-0 bbox-1 bbox-2 bbox-3 \n","area -0.066508 -0.081937 0.034083 -0.003961 \n","mean_intensity 0.015188 0.217484 0.069184 0.266504 \n","minor_axis_length -0.163017 -0.056785 -0.077817 0.015790 \n","major_axis_length -0.010743 -0.128821 0.093556 -0.057776 \n","eccentricity 0.257938 -0.060467 0.253671 -0.076793 \n","extent -0.076688 0.048511 -0.128149 0.019310 \n","feret_diameter_max -0.025173 -0.122607 0.080054 -0.049283 \n","equivalent_diameter_area -0.107059 -0.096706 -0.004660 -0.018489 \n","bbox-0 1.000000 0.050957 0.993418 0.053563 \n","bbox-1 0.050957 1.000000 0.032728 0.996062 \n","bbox-2 0.993418 0.032728 1.000000 0.041855 \n","bbox-3 0.053563 0.996062 0.041855 1.000000 "]},"execution_count":24,"metadata":{},"output_type":"execute_result"}],"source":["blobs_statistics.corr()\n"]},{"cell_type":"markdown","metadata":{"id":"23_l5V_DBGrp"},"source":["It can be hard to read in numeric format. I wonder if there is beter way how to look at the data?\n","\n","Below we take a quick shortcut to seaborn to show how the correlation can be displayed as a heatmap."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":620},"executionInfo":{"elapsed":991,"status":"ok","timestamp":1692083686368,"user":{"displayName":"Martin Schätz","userId":"14609383414092679868"},"user_tz":-120},"id":"A0KoCKpwBG9A","outputId":"aa6d5813-d204-496c-adf7-ebf9f97d3395"},"outputs":[{"data":{"text/plain":[""]},"execution_count":25,"metadata":{},"output_type":"execute_result"},{"data":{"image/png":"","text/plain":[""]},"metadata":{},"output_type":"display_data"}],"source":["# calculate the correlation matrix on the numeric columns\n","corr = blobs_statistics.select_dtypes('number').corr()\n","\n","# plot the heatmap\n","sns.heatmap(corr, cmap=\"Blues\", annot=False)"]},{"cell_type":"markdown","metadata":{"id":"fH1zusN7GKCx"},"source":["**Watermark**"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":312,"status":"ok","timestamp":1692083689688,"user":{"displayName":"Martin Schätz","userId":"14609383414092679868"},"user_tz":-120},"id":"iH1jL0baGMJW","outputId":"b40c2332-1e58-487f-d385-12bb80706639"},"outputs":[{"name":"stdout","output_type":"stream","text":["Last updated: 2023-08-24T14:26:10.347260+02:00\n","\n","Python implementation: CPython\n","Python version : 3.9.17\n","IPython version : 8.14.0\n","\n","Compiler : MSC v.1929 64 bit (AMD64)\n","OS : Windows\n","Release : 10\n","Machine : AMD64\n","Processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel\n","CPU cores : 16\n","Architecture: 64bit\n","\n","watermark : 2.4.3\n","numpy : 1.23.5\n","pandas : 2.0.3\n","seaborn : 0.12.2\n","pivottablejs: 0.9.0\n","\n"]}],"source":["from watermark import watermark\n","watermark(iversions=True, globals_=globals())\n","print(watermark())\n","print(watermark(packages=\"watermark,numpy,pandas,seaborn,pivottablejs\"))"]}],"metadata":{"colab":{"provenance":[]},"kernelspec":{"display_name":"Python 3 (ipykernel)","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.9.17"}},"nbformat":4,"nbformat_minor":0}
17000 rows × 9 columns