Chapter 3 Data transformation
3.1 Load data
As our datasets are well organized in csv files, we just call read.csv(...) to load in wine dataset. Observe that we do have 12 columns in total (11 attributes and 1 outcome).
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 1 7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56
## 2 7.8 0.88 0.00 2.6 0.098 25 67 0.9968 3.20 0.68
## 3 7.8 0.76 0.04 2.3 0.092 15 54 0.9970 3.26 0.65
## 4 11.2 0.28 0.56 1.9 0.075 17 60 0.9980 3.16 0.58
## 5 7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56
## 6 7.4 0.66 0.00 1.8 0.075 13 40 0.9978 3.51 0.56
## alcohol quality
## 1 9.4 5
## 2 9.8 5
## 3 9.8 5
## 4 9.8 6
## 5 9.4 5
## 6 9.4 5
3.2 Data statistics
Now let's get some metrics about the attributes and outcome.
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900 Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200 Median :0.07900 Median :14.00 Median : 38.00
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539 Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500 Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol quality
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
3.3 Categorize quality column
Here, we cut quality to three factor levels as follows and store it as quality.category back to the dataframe:
poor:quality< 5average:quality= 5 or 6good:quality> 6
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 1 7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56
## 2 7.8 0.88 0.00 2.6 0.098 25 67 0.9968 3.20 0.68
## 3 7.8 0.76 0.04 2.3 0.092 15 54 0.9970 3.26 0.65
## 4 11.2 0.28 0.56 1.9 0.075 17 60 0.9980 3.16 0.58
## 5 7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56
## 6 7.4 0.66 0.00 1.8 0.075 13 40 0.9978 3.51 0.56
## alcohol quality quality.category
## 1 9.4 5 average
## 2 9.8 5 average
## 3 9.8 5 average
## 4 9.8 6 average
## 5 9.4 5 average
## 6 9.4 5 average