UNR-IDD Dataset

IMPORTANT

You may redistribute, republish, and mirror the UNR-IDD dataset in any form. However, any use or redistribution of the data must include a citation to the UNR-IDD paper in the following link: https://ieeexplore.ieee.org/abstract/document/10059640

Please click here to download the dataset

Abstract

With the expanded applications of modern-day networking, network infrastructures are at risk from cyber-attacks and intrusions. Multiple datasets have been proposed in the literature that can be used to create Machine Learning (ML) based Network Intrusion Detection Systems (NIDS). However, many of these datasets suffer from sub-optimal performance and do not adequately and effectively represent all types of intrusions. Another problem with these datasets is the low accuracy of tail classes. To address these issues, we propose the University of Nevada - Reno Intrusion Detection Dataset (UNR-IDD) that provides researchers with a wider range of samples and scenarios.

Introduction

Modern computer networks and their connected applications have reached unprecedented growth with implementations like the internet of things, smart homes, and software-defined networks. However, this has also increased the potential for network intrusions, which are a continuous threat to network infrastructures as they attempt to compromise the major principles of computing systems: availability, authority, confidentiality, and integrity. These threats are difficult to detect unaided, as they display indistinguishable network traffic patterns as normal functionality. To provide enhanced protection against intrusions, the usage of machine learning for NIDS has gained traction in the last decade as various open-sourced datasets have been proposed and established by multiple research groups globally.

However, a common problem that has been identified with many of these datasets is inadequate modeling of tail classes. Another limitation of the current datasets is that they mostly depend on flow level statistics, which can limit the transferability of the NIDS solutions to other network configurations. Lastly, some existing datasets suffer from incomplete or missing records. These records or samples must be ignored or dropped from the overall dataset, which leads to sub-optimal performance.

To address these limitations, we propose the University of Nevada - Reno Intrusion Detection Dataset (UNR-IDD), a NIDS dataset. The main difference between UNR-IDD and existing datasets is that UNR-IDD consists primarily of network port statistics. These refer to the observed port metrics recorded in switch/router ports within a networking environment. The dataset also includes delta port statistics which indicates the change in magnitude of observed port statistics within a time interval. Compared to datasets that primarily use flow level statistics, these port statistics can provide a fine-grained analysis of network flows from the port level as decisions are made at the port level versus the flow level. This can lead to rapid identification of potential intrusions. We also address the limitation of the presence of tail classes. Our dataset ensures that there are enough samples for ML classifiers to achieve high F-Measure scores, uniquely. Our proposed dataset also ensures that there are no missing network metrics and that all data samples are filled.

Data Collection Setup

Testbed Configuration

To set up the testbed, we use Open Network Operating System (ONOS) SDN controller (API version 2.5.0) alongside Mininet for the network topology generation. ONOS uses the Open Service Gateway Initiative (OSGi) service component at runtime for the creation and activation of components and for auto-wiring components together, which makes it easy to create and deploy new user-defined components without altering the core constituents. Mininet creates the desired virtual network, and runs a real kernel, switch, and application code, on a single machine, thereby generating a realistic testbed environment. We also implemented our ONOS application (component) to collect network statistics. Specifically, we gathered delta and cumulative port, flow entry, and flow table statistics for each connected Open vSwitch in the Mininet topology. We created a custom Mininet topology using Mininet API (version 2.3.0) with Open Flow (OF) 14 protocol deployed to the switches. The generated SDN topology consists of 10 hosts and 12 switches., which is presented in the associated figure.

Flow Simulation

IPerf is used to create TCP and UDP data streams simulating network flows in virtual and real networks using dummy payloads. By using the Mininet API and IPerf, we created a Python script to simulate realistic network flows. Once every 5 seconds, we initiated Iperf traffic between a randomly chosen source-destination host pair with a bandwidth of 10 Mbps and a duration of 5 seconds. We then simulate flows under normal and intrusion conditions to gather data in every possible scenario. To ensure that each normal and anomaly category is minimally variable and adequately represented, we execute the same number of flows while simulating each scenario.

Data Collection

We create a custom application to collect and log the available statistics that are captured periodically (once in every 5 seconds) from OpenFlow (OF) switches. The statistics are collected through by means of OFPPortStatsRequest and OFPPortStatsReply messages between controller and switches. The delta port statistics are computed on the controller side by taking the difference between the last two collected data instances. We create a key-value map of this data by gathering it from the data storage service, using the "Device Service" API provided by ONOS. After this, we logged the map of the collected statistics to a Javascript Object Notation (.json) file with a name N_i.json.

Captured Data Features

Port Statistics

The corresponding table shows the collected port statistics and their descriptions per port on every switch in the simulated SDN. These statistics relay the collected metrics and magnitudes from every single port within the SDN when a flow is simulated between two hosts.

Delta Port Statistics

The corresponding table illustrates the collected delta port statistics and their descriptions per port on every switch. These delta statistics are used to capture the change in collected metrics from every single port within the SDN when a flow is simulated between two hosts. The time interval for these observed metrics is configured as 5 seconds, which can provide greater detail in detecting intrusions.

Flow Entry and Flow Table Statistics

Additionally, we also collect some flow entry and flow table statistics to work in conjunction with the collected port statistics as seen in the corresponding table. These metrics provide information about the conditions of switches in the network and can be collected in any network setting.

Labels

The following intrusions are simulated in the data collection phase:

TCP-SYN Flood: A Distributed Denial of Service (DDoS) attack where attackers target hosts by initiating many Transmission Control Protocol (TCP) handshake processes without waiting for the response from the target node. By doing so, the target device's resources are consumed as it has to keep allocating some memory space for every new TCP request.
Port scan: An attack in which attackers scan available ports on a host device to learn information about the services, versions, and even security mechanisms that are running on that host.
Flow Table Overflow: An attack that targets network switches/routers where attacks compromise the functionality of a switch/router by consuming the flow tables that forward packets with illegitimate flow entries and rules so that legitimate flow entries and rules cannot be installed.
Blackhole: An attack that targets network switches/routers to discard the packets that pass through, instead of relaying them on to the next hop.
Traffic Diversion: A attack that targets network switches/routers to reroute the direction of packets away from their destination, intending to increase travel time and/or spying on network traffic through a man-in-the-middle scenario.

These intrusion types were selected for this dataset as they are common cyber attacks that can occur in any networking environment. Also, these intrusion types cover attacks that can be launched on both network devices and end hosts. This dataset can be broken down into two different machine learning classification problems: binary and multi-class classification.

Binary Classification

The goal of binary classification is to differentiate intrusions from normal working conditions. Binary classification can estimate if a network is under attack but does not provide any information about the type of attack. The labels for binary classification in UNR-IDD are illustrated in the accompanying table.

Multi-class Classification

The goal of multi-class classification is to differentiate the intrusions not only from normal working conditions but also from each other. Multi-class classification helps us to learn about the root causes of network intrusions. The labels for multi-class classification in UNR-IDD are illustrated in the accompanying table.