Practical assignment: Applying methods of machine learning
Task:
For this assignment, you will select a dataset and will apply methods of unsupervised and supervised learning. The goal of
the assignment is to develop students' skills to use machine learning algorithms and analyse the obtained results. The
deliverable is a report prepared by the student on the completion of the assignment.
To develop the assignment, the student must use the Orange tool.
Before starting to work on your assignment, you must find and choose a dataset on the web. Some of the well-known
repositories are the following:
When selecting the dataset, take into consideration the following aspects:
• select a dataset that is suitable for classification task;
• it is preferable to select a dataset that is already given in the format of .csv datafile;
• the dataset should be well-documented (there should be information about who created the set, when and what
the data source is);
• the dataset should be of reasonable size (at least 200 data objects);
• the dataset should be deeply annotated (there should be information about which features are stored and what
they mean);
• the number of features should be between 5-15;
• the dataset should be labelled;
• you should avoid datasets that contain a lot of Boolean types (true/false, 1/0, etc.) feature values. It is preferable
to use datasets with continuous and/or discrete (with more than 2 values) feature values;
• you should avoid datasets of unlabelled data (e.g. text corpora and raw images).
Part 1 – Pre-processing/Exploring the data
To complete this part of the assignment, you will need to take the following steps:1. Selecting and describing the dataset based on the information given in the repository/database where the dataset
was located.
2. If the dataset you have acquired from the repository is not in a format that is easy to work with (like a comma-
separated-values, or .csv, file), convert it into the needed format. Your dataset file should consist of an n×d table,
where d is the number of dimensions of the data and n is the number of data objects. Your columns should be
arranged in the following way: data object ID, the class label of the data object, and then the collected feature
values.
3. If the values of any feature are textual values (e.g. yes/no, positive/neutral/negative, etc.), they must be
transformed into numerical values.
4. If some data objects are missing values of features, it is necessary to find a way to obtain them by studying
additional sources of information.
5. Representing your training dataset visually and statistically:
a) you must create at least two 2- or 3-dimensional scatter plots illustrating the separability of classes in your
dataset based on different features; the student should avoid using the data object ID as a variable in the
scatterplot;
b) you must create at least 2 histograms showing the separation of classes for the features of interest;
c) you must show 2 distributions for the features of interest;
d) you must calculate statistics on your data (at least the central tendency and the dispersion of the feature values).
Include the following information in the report:
• description of the dataset (providing references to the sources of information used):
- title, source, author and/or owner of the dataset;
- description of the problem domain of the dataset;
- licensing regarding the dataset (if any);
- the way how the dataset was collected;
• description of the content of the dataset (providing references to the sources of information used):
- the number of data objects in the dataset;
- the number of classes in the dataset, the meaning of each class and the way of representing classes
(explanation of the labels assigned to classes); if the data set provides several possible data
classifications, then the report must clearly identify which classification is considered in the assignment;
- the number of data objects belonging to each class;
- the number and meaning of features in the dataset, as well as their value types and ranges (this
information should be presented in a table consisting of the feature representation, its meaning, value
type and range of values available in the dataset);
- a snippet of the structure of your datafile in which the columns of your datafile and class labels are
shown together with some data objects;
• conclusions coming from the analysis of scatter plots, histograms and distributions (from Step 5 in Part I) about the
separability of your classes (remember to include your graphs in the report). Try to answer the following questions:
- Whether classes in your dataset are balanced, or is one class (several classes) prevailing? It is determined
by how many data objects belong to each class.
- Does the visual representation of the data allow the structure of the data to be seen? It is a question of
whether data objects belonging to different classes ar
programming, writing, research, tests, math