Task 2: Machine Learning automatic resolution

Website:
Kurs: Artificial Intelligence
Buch: Task 2: Machine Learning automatic resolution
Gedruckt von: Svečio paskyra
Datum: Mittwoch, 20. Mai 2026, 02:49

Beschreibung

Use some R packages for machine learning to predict SDG indicators

1. Task 1: Install R language

The aim of this programming language is to provide the user, who is also a programmer, with a vast library of mathematical and statistical functions and facilities for manipulating vectors and arrays. The interface with the R language is the command line, in which the user enters instructions, and it is possible to draw a parallel with an advanced calculating machine, but you can of course save programs and operating status.

The R language is free, so it can be downloaded and installed from the official website, http://www.r-project.org/, where there are also various resources. You can also use RStudio, which contains the R language with an Integrated Development Environment (IDE) (https://www.rstudio.com/products/rstudio/).


 

Action:
ação ícone

 Install R language in your computer

Expected:
resultado ícone


2. Task 2: Flash introduction to language R


 Action:
ação ícone

 Expected:
resultado ícone

Initial testes, try to write:

> 45
> 5*4
> "Hellow World"
> c(23, 56, 73)

 

The R language works like a command line console. To print a string, simply type the string, also in quotes. It's an interpreted language, so it doesn't have to be compiled, nor is there an executable file generated; the R program itself reads the instructions and executes them.

The last instruction above builds a vector of elements, which is returned.

 Action:
ação ícone

 Expected:
resultado ícone

> # assigning values to variables
> x=46
> y=5*4
> text="Hello World!"
> vector=c(23, 56, 73)
> aux
> x
> ls()
> x;y
> cat("Value of X: ", x, " Value of Y: ", y)

 

Comments in R start with # and continue to the end of the line. Notice the attempt to declare the aux variable, without the assignment. It doesn't make sense in R. Also note that the type of the variables is not specified. The ls() function lists all the objects that are defined. In this case it's only the x, y, text and vector objects, since the aux object hasn't been created.

The rm(<object>) function allows you to remove an object. When you exit the application, using q(), you can save the working environment so that you can continue later. The penultimate command is actually two commands on a single line. To separate two commands on one line, use a semicolon. In this case, the values of the variables x and y have been requested.

To get a result more similar to printf in C language, you can use the cat function, which must intersperse strings with variables.

You can redirect R's input and output with functions source(<file with commands>) and sink(<file with output>), but it's just as easy to insert the commands in a text file and simply copy/paste them into R so that the commands are executed.

The files are read/written from the directory in which R is run from.

 Action:
ação ícone

 Expected:
resultado ícone

> year <- 1345
> if(year%%4==0 && year%%100!=0 || year%%400==0) "Leap" else "Common"
 

In place of the equality operator for assignment, the <- operator was used. In R, this operator must be used for the assignment, and there is also the -> operator, meaning that the variable to be assigned is on the right, and the expression on the left.

Conditionals can be used in much the same way as in C. The division remainder operator is %%, and the integer division is %/%. The logical operators in C and R are the same.

In the R language, it doesn't make sense to know the size of variables, as it is a loosely typed language. In this type of language, the memory occupied by a variable or data structure should not be the programmer's concern. However, there are basic variable modes in R: numeric; complex; logical; character. To obtain the basic mode of a variable, use the mode(<object>) function.

 Action:
ação ícone

 Expected:
resultado ícone

> sum=0; i=1
> while(i<=4) { sum=sum+i*i; i=i+1 }
> sum
 

You can see a use of while, in a similar way to the C language. In this case, the first 4 square numbers are added together. You can use blocks of code via curly braces. An instruction can have several lines of code, but you can only edit each line at a time, so it's preferable to edit it in a file, copying it as soon as the code is ready.

 Action:
ação ícone

 Expected:
resultado ícone

> Leap <- function(year) year%%4==0 && year%%100!=0 || year%%400==0
> Leap(2344)
 

The definition of a function in R also consists of assigning an expression like:

function(<arguments>) expression

The expression can have a block of instructions, so that it can contain more than one command.
Once a function has been defined, it can be used inside other expressions.

 Action:
ação ícone

 Expected:
resultado ícone

> sum<-0
> for(i in 1:4) sum <- sum+i*i
> sum
 

The R language also allows for for loops, which are identical to while loops but with the iterator variable, just like the C language. Note, however, the notation "1:4". The variable i will have values from 1 to 4, and the expression will be evaluated for each of i's values.

 Action:
ação ícone

 Expected:
resultado ícone

> 1:4
> c(1,2,3,4)
 

These two expressions are the same and can be used to quickly build vectors with sequential content.

In R, there is also the repeat loop followed by an expression, which has no exit condition. The output of the cycle must use the "break" instruction, which exists in R with the same meaning as in C, and the "next" instruction is equivalent to "continue" in C. Usually, it is not advisable to use this type of instruction in either C or R, so the cycles to use should be the for and while loops.

 Action:
ação ícone

 Expected:
resultado ícone

> vector <- c(12, 45, 66, c(23, 455, 6))
> vector
> sum(vector)
> mean(vector)
 

In R you can easily define vectors. However, using a vector inside another vector adds the vector as an element (instead of appending its elements to the original vector). You can use loops for sums, averages, variances, etc., or simply use the functions already implemented for this purpose, without the need for loops, as in the example.

 Action:
ação ícone

 Expected:
resultado ícone

> vector[1]
> vector[vector>25]
> vector[vector>25] <- 0
> vector
 

To access an element of the vector, you can indicate the index (starting at 1), but you can also put in an expression based on the values in the vector, and thus define another vector whose elements satisfy the condition. The value is accessed with the name of the vector. You can even use this result to reset values, as shown in the last expression.

 Action:
ação ícone

 Expected:
resultado ícone

> matrix <- array(0, dim=c(10,10))
> matrix
> matrix[,1]
> matrix[1,]
 

The code above defines a 2-dimension array, 10 by 10, with its elements initialized at 0. Creating arrays can easily be done using the array function. The first argument could have a vector with initial values to place in the array, instead of 0. The second argument, "dim", has the dimensions of the array. You can define 2 and 3 dimensions, or more, but usually you don't need more than 3 dimensions. Accessing a position in the array is done by placing all the indices, separated by commas. In this case there is a difference from C.

The array visualization shows the column and row headings, and also reveals a simple notation for extracting part of the array. The index [,1] returns a vector with the first column of the array, while the index [1,] returns a vector with the first row of the array.

3. Task 3: Structuring data to train a learning method

Learning methods require information to be structured into elements/observations/cases/instances, and each observation must have characteristics/indicators/properties that are measured. You want to know the classification (or the estimation of an indicator) of an element based on its features. The variable you want to know can be binary, discrete or real. We will focus our examples on binary variables, as this is the purest case of learning. For the sake of ease and affinity to statistical methods, we can call independent variables the characteristics/indicators/properties, the dependent variable what we want to know, and observations the elements/observations/cases/instances.

For supervised learning methods to work, there must be past observations with classifications, so all the independent variables and the value of the dependent variable must be known, and a given method can be trained to classify unseen observations. This information can be in the form of an array, with the independent variables in columns and the observations in rows.

To load into R, we can place the data in a .csv file and read it (data.csv):

 Action:
ação ícone

 Expected:
resultado ícone

> setwd("set working folder")
> data = read.csv("data.csv", header = TRUE, sep=";")
> data


 

This is an example data file with student achievements in a given class. The columns in this file are as follows: Case; Materials; FAs; Interventions; Evaluations; Grade. The first column only has the case ID, which is not relevant, and the last column is the dependent variable with the grade. All the others are the independent variables from which we want to predict the student's grade.

 Action:
ação ícone

 Expected:
resultado ícone

> data[,2:5]

> data[,6]


 

As we can see, extraction of independent and dependent variables is easy, from a .csv formatted file.

4. Task 4

Decision trees, K nearest neighbours, Neural networks

4.1. Task 4.1: Decision Trees

DocumentationrandomForest function - RDocumentation


Action:
ação ícone

Expected:
resultado ícone

> install.packages('randomForest')
> library('randomForest')

 

Instalation of a package is easy, just select a mirror close to you, and wait. 

Action:
ação ícone

Expected:
resultado ícone

> # ensure that the data is discrete (c.c. are considered linear) > for(i in 2:6) data[,i]<-factor(data[,i]) > test <- randomForest(x=data[,2:5], y=data[,6], ntree=1, importance=TRUE) > print(test) > getTree(test, 1) # shows tree information, must be drawn > test$type # confirm that it was a classification and not a regression > predict(test, data[,2:5]) 


 

The method was called with a single decision tree. We can see in the confusion matrix, and error rate estimation on 33%. We can get the decision tree, one line for each node, with left/right doughter, or 0 if is a final node, and in that case a prediction will exist. We can also use predict, to predict a specific set of data, in this case we use the same data for trainning, and get for each case, the predict values. 

Action: repeat with training and test sets

4.2. Task 4.2: K nearest neighbours

Documentation: knn3 function - RDocumentationknnreg function - RDocumentation



Action:
ação ícone

Expected:
resultado ícone

> install.packages('caret')
> library('caret')

> test <- knn3(x=data[,2:5], y=factor(data[,6]), k=3)
> print(test)  # shows number of cases for each value
> predict(test, data[,2:5])

 

The prediction return a probability of each classification. 

Action: repeat with training and test sets

4.3. Task 4.3: Neural networks

Documentation: nnet function - RDocumentation

Action:
ação ícone

Expected:
resultado ícone

> install.packages('nnet')
> library('nnet')

> # subtract 1 from y to ensure values are between 0 and 1
> test <- nnet(x=data[,2:5], y=data[,6]-1, size=1) 
> print(test)     # network detail (weights) isn't shown
> test$wts        # weights
> predict(test, data[,2:5])


 

The prediction have probabilities to be one of the categories.

Action: repeat with training and test sets