Chapter 3 Data transformation

3.1 Load data

As our datasets are well organized in csv files, we just call read.csv(...) to load in wine dataset. Observe that we do have 12 columns in total (11 attributes and 1 outcome).

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates
## 1           7.4             0.70        0.00            1.9     0.076                  11                   34  0.9978 3.51      0.56
## 2           7.8             0.88        0.00            2.6     0.098                  25                   67  0.9968 3.20      0.68
## 3           7.8             0.76        0.04            2.3     0.092                  15                   54  0.9970 3.26      0.65
## 4          11.2             0.28        0.56            1.9     0.075                  17                   60  0.9980 3.16      0.58
## 5           7.4             0.70        0.00            1.9     0.076                  11                   34  0.9978 3.51      0.56
## 6           7.4             0.66        0.00            1.8     0.075                  13                   40  0.9978 3.51      0.56
##   alcohol quality
## 1     9.4       5
## 2     9.8       5
## 3     9.8       5
## 4     9.8       6
## 5     9.4       5
## 6     9.4       5

3.2 Data statistics

Now let's get some metrics about the attributes and outcome.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar     chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900   Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200   Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539   Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500   Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol         quality     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

3.3 Categorize quality column

Here, we cut quality to three factor levels as follows and store it as quality.category back to the dataframe:

  • poor: quality < 5

  • average: quality = 5 or 6

  • good: quality > 6

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates
## 1           7.4             0.70        0.00            1.9     0.076                  11                   34  0.9978 3.51      0.56
## 2           7.8             0.88        0.00            2.6     0.098                  25                   67  0.9968 3.20      0.68
## 3           7.8             0.76        0.04            2.3     0.092                  15                   54  0.9970 3.26      0.65
## 4          11.2             0.28        0.56            1.9     0.075                  17                   60  0.9980 3.16      0.58
## 5           7.4             0.70        0.00            1.9     0.076                  11                   34  0.9978 3.51      0.56
## 6           7.4             0.66        0.00            1.8     0.075                  13                   40  0.9978 3.51      0.56
##   alcohol quality quality.category
## 1     9.4       5          average
## 2     9.8       5          average
## 3     9.8       5          average
## 4     9.8       6          average
## 5     9.4       5          average
## 6     9.4       5          average