{% extends "layout.html" %} {% block content %}
Story-style intuition: The Efficient Librarian
Imagine our Supermarket Detective (from the Apriori guide) has a new colleague, an efficient librarian. The detective uses a "horizontal" approach: they go through each shopping receipt one by one to see what's inside. The librarian uses a "vertical" approach. Instead of looking at receipts, they create an index card for every single item in the store. On the card for "Milk," they simply list the ID number of every receipt that contains milk. To find out how many people bought {Milk, Bread} together, they just take the two cards and find the common receipt IDs. This is the core idea of Eclat. It's often much faster because finding common IDs between two lists is a very quick operation.
The Eclat Algorithm (Equivalence Class Clustering and bottom-up Lattice Traversal) is an efficient algorithm for frequent itemset mining. Unlike Apriori, which scans the database horizontally (transaction by transaction), Eclat uses a vertical data format and finds frequent itemsets by intersecting transaction ID lists. This approach can be significantly faster, especially for dense datasets.
Example:
TID(Milk) = {T1, T3, T4}
TID(Bread) = {T1, T2, T4, T5}
The Librarian's Smart Trick: The librarian's method for finding the support of a combined itemset is incredibly fast. To find the support of {Milk, Bread}, they don't need to look at any receipts. They just take the two index cards and find the common numbers (the intersection).
The core principle of Eclat is that the support of a larger itemset can be computed directly by intersecting the tidsets of its smaller subsets. The size of the resulting intersection is the support count.
$$ \text{Support}(X \cup Y) = |TID(X) \cap TID(Y)| $$
Example:
TID(Milk) = {T1, T3, T4}
TID(Bread) = {T1, T2, T4, T5}
TID({Milk, Bread}) = TID(Milk) ∩ TID(Bread) = {T1, T4}
Support({Milk, Bread}) = |{T1, T4}| = 2.
Eclat uses a depth-first search (DFS) approach to explore the search space of itemsets.
| Feature | Eclat | Apriori |
|---|---|---|
| Data Format | Vertical (Item → {TID1, TID2, ...}) | Horizontal (TID → {Item1, Item2, ...}) |
| Search Method | Depth-First Search (DFS) | Breadth-First Search (BFS) |
| Main Operation | Tidset intersection. | Candidate generation and database scanning. |
| Performance | Generally faster, especially on dense datasets. | Can be slow due to repeated database scans and large candidate sets. |
Since Eclat is less common in libraries like `scikit-learn`, here's a conceptual Python example using a library called `pyECLAT`. The logic mirrors the algorithm steps: we prepare the data, create an Eclat object, and call `fit()` to get the frequent itemsets.
# NOTE: You would need to install pyECLAT first: pip install pyECLAT
import pandas as pd
from pyECLAT import ECLAT
# --- 1. Create a Sample Dataset in the right format ---
# The data is a DataFrame where each row is a transaction.
# NaN values are used for padding.
data = {'Transaction': [1, 2, 3, 4, 5],
'Items': [['Milk', 'Beer', 'Diapers'],
['Bread', 'Butter', 'Milk'],
['Beer', 'Diapers', 'Milk', 'Cola'],
['Bread', 'Butter', 'Beer', 'Diapers'],
['Bread', 'Butter']]}
df = pd.DataFrame(data)
# --- 2. Initialize and Run the Eclat Algorithm ---
# We create an ECLAT object from our transactions data.
eclat_instance = ECLAT(data=df['Items'])
# You can see the binary (one-hot encoded) format it uses internally
# print(eclat_instance.df_bin)
# --- 3. Find Frequent Itemsets ---
# We set min_support to 0.4, meaning itemsets in at least 2 of the 5 transactions.
# The 'fit' method does all the work of intersecting tidsets.
min_support = 0.4
rule_indices, rule_supports = eclat_instance.fit(min_support=min_support,
min_combination=1, # Min number of items in an itemset
max_combination=3) # Max number of items
print("--- Frequent Itemsets (Support >= 40%) ---")
print(rule_supports)
1. Apriori scans data horizontally (it reads each transaction to see what items it contains). Eclat uses a vertical format (it looks at each item to see which transactions it appeared in).
2. The support is the size of the intersection of the tidsets: |{1, 2, 5} ∩ {2, 4, 5, 6}| = |{2, 5}| = 2.
3. The main disadvantage is high memory consumption, as the tidsets for very frequent items can become extremely large lists containing millions of transaction IDs.
4. Eclat uses a Depth-First Search (DFS) strategy to traverse the lattice of potential itemsets.
The Story: Decoding the Efficient Librarian's Index