Definition: Data mining is the search for patterns and statistical dependencies in large datasets. It is often associated with machine learning or knowledge discovery.
Application: DM is applied in biology, medicine, and various industries for personalized recommendations, personalized medicine, etc.
Vision: Tailor treatments to patients’ genetic/molecular properties to increase efficacy. Drugs often work for only a fraction of patients due to genetic variability.
Technological Advances: Genome sequencing has become rapid and scalable, identifying variations among millions of genetic bases.
Goals: Detect correlations between diseases, drug responses, and genetic variations.
\( p = 1 \) recovers the Manhattan distance, \( p = 2 \) recovers the Euclidean distance, and for \( p \rightarrow \infty \), it converges to the Chebyshev distance.
String Similarity (k-mer): Quantifies similarity by representing strings as histograms of k-mer frequencies. Example: the strings "downtown" and "known" can be compared based on their 3-mer substrings.
Graph Comparison Problems: Key problems include determining if two graphs are identical (graph isomorphism) or finding if one graph is contained within another (subgraph isomorphism).
Weisfeiler-Lehman Kernel: Efficient method for graph comparison by iterating over node neighborhoods, compressing them, and relabeling based on sorted labels.
This method is highly scalable and is commonly used in chemoinformatics and bioinformatics.
Data Mining in Genetics: Searches for disease-associated loci in genomes. Challenges include:
Missing Heritability: Many diseases show weak correlations with genetic loci due to small sample sizes, environmental factors, and oversimplified models.
Interaction Search: Efficient algorithms are needed for exploring interactions between millions of genetic loci without exhaustive enumeration.
Graph Kernels: Algorithms for comparing large graphs (like biological networks) are critical in drug discovery and gene interaction studies.
Future Trends: Increasing use of wearable devices, electronic health records, and indirect monitoring through social media is expected to fuel personalized medicine.