Sophos and ReversingLabs released SoReL-20M which is a database containing 20 million Windows Portable Executable files, including 10 million malware samples.
The SoReL-20M database includes a set of curated and labeled samples and security-relevant metadata that could be used as a training dataset for a machine learning engine used in anti-malware solutions.
The dataset contains metadata, labels, and features for 20 million Windows Portable Executable files, including 10 million disarmed malware samples available for download for the purpose of research on feature extraction to drive industry-wide improvements in security.
SOREL-20M is the first production-scale malware research dataset publicly available that has been released for the purpose of research for malware detection via machine learning.
According to the experts, a large number of curated and labeled samples are very expensive and difficult to get. The majority of works on malware detection is based on private, internal datasets that could not be shared and so the results cannot be directly compared to each other.
The dataset contains features for each malware that have been extracted based on the EMBER 2.0 dataset, labels, detection metadata, and complete binaries,
Experts also released a set of pre-trained PyTorch (https://pytorch.org/) models and LightGBM (https://github.com/Microsoft/LightGBM) models trained on this dataset.
Sophos also released scripts that allow to load and iterate over the data, as well as to load, train, and test the models.
However, when training sets like SoReL-20M are publicly available, it could be misused by sophisticated attackers who can use them to create new threats. But Sophos stated that well-resourced attackers could already have access to easy to use and cost-effective malware datasets.
So, it is necessary to give security researchers this dataset and help them to build a new generation of tools that could be effective for malware detection.
Reversinglabs said that the introduction of machine learning technologies represents a significant step for threat detection and that these systems are only as good as the datasets they have access to.
All this data gives the customers a well-defined dataset of threat intelligence to leverage in their defenses, and as part of their threat hunting programs, to both block active attacks and search for threats that may otherwise be invisible to the traditional security stack.
Image Credits : Sophos