Practical assignment: Applying methods of machine learning

Task:

For this assignment, you will select a dataset and will apply methods of unsupervised and supervised learning. The goal of

the assignment is to develop students' skills to use machine learning algorithms and analyse the obtained results. The

deliverable is a report prepared by the student on the completion of the assignment.

To develop the assignment, the student must use the Orange tool.

Before starting to work on your assignment, you must find and choose a dataset on the web. Some of the well-known

repositories are the following:

When selecting the dataset, take into consideration the following aspects:

• select a dataset that is suitable for classification task;

• it is preferable to select a dataset that is already given in the format of .csv datafile;

• the dataset should be well-documented (there should be information about who created the set, when and what

the data source is);

• the dataset should be of reasonable size (at least 200 data objects);

• the dataset should be deeply annotated (there should be information about which features are stored and what

they mean);

• the number of features should be between 5-15;

• the dataset should be labelled;

• you should avoid datasets that contain a lot of Boolean types (true/false, 1/0, etc.) feature values. It is preferable

to use datasets with continuous and/or discrete (with more than 2 values) feature values;

• you should avoid datasets of unlabelled data (e.g. text corpora and raw images).

Part 1 – Pre-processing/Exploring the data

To complete this part of the assignment, you will need to take the following steps:1. Selecting and describing the dataset based on the information given in the repository/database where the dataset

was located.

2. If the dataset you have acquired from the repository is not in a format that is easy to work with (like a comma-

separated-values, or .csv, file), convert it into the needed format. Your dataset file should consist of an n×d table,

where d is the number of dimensions of the data and n is the number of data objects. Your columns should be

arranged in the following way: data object ID, the class label of the data object, and then the collected feature

values.

3. If the values of any feature are textual values (e.g. yes/no, positive/neutral/negative, etc.), they must be

transformed into numerical values.

4. If some data objects are missing values of features, it is necessary to find a way to obtain them by studying

additional sources of information.

5. Representing your training dataset visually and statistically:

a) you must create at least two 2- or 3-dimensional scatter plots illustrating the separability of classes in your

dataset based on different features; the student should avoid using the data object ID as a variable in the

scatterplot;

b) you must create at least 2 histograms showing the separation of classes for the features of interest;

c) you must show 2 distributions for the features of interest;

d) you must calculate statistics on your data (at least the central tendency and the dispersion of the feature values).

Include the following information in the report:

• description of the dataset (providing references to the sources of information used):

- title, source, author and/or owner of the dataset;

- description of the problem domain of the dataset;

- licensing regarding the dataset (if any);

- the way how the dataset was collected;

• description of the content of the dataset (providing references to the sources of information used):

- the number of data objects in the dataset;

- the number of classes in the dataset, the meaning of each class and the way of representing classes

(explanation of the labels assigned to classes); if the data set provides several possible data

classifications, then the report must clearly identify which classification is considered in the assignment;

- the number of data objects belonging to each class;

- the number and meaning of features in the dataset, as well as their value types and ranges (this

information should be presented in a table consisting of the feature representation, its meaning, value

type and range of values available in the dataset);

- a snippet of the structure of your datafile in which the columns of your datafile and class labels are

shown together with some data objects;

• conclusions coming from the analysis of scatter plots, histograms and distributions (from Step 5 in Part I) about the

separability of your classes (remember to include your graphs in the report). Try to answer the following questions:

- Whether classes in your dataset are balanced, or is one class (several classes) prevailing? It is determined

by how many data objects belong to each class.

- Does the visual representation of the data allow the structure of the data to be seen? It is a question of

whether data objects belonging to different classes ar

Essay Due?We’ll write it for you!
Any subject
Min 3.hour delivery
Pay if satisfied

Project status: Closed

Get help from this expert right now!

The Essayst

Verified writer

programming, writing, research, tests, math

(6)reviews