MEGA DatA Lab

Model Estimation, Graphical Algorithms, Data Analysis Lab

EECS, University of California, Irvine, CA 92697.


Learning hidden group dynamics via conditional latent tree models (CLTM)

“Are you going to the party: Depends, who else is coming? [Learning hidden group dynamics via conditional latent tree models] ” by F. Arabshahi, F. Huang, A. Anandkumar, C. T. Butts and S. M. Fitzhugh, Appearing in the proceedings of the International Conference on Data Mining (ICDM), IEEE, Nov 2015
Download: ArXiv version.

The goal of this project is to track and predict the evolution of time series. Let us give some examples of such time evolving data. Consider the performance of students in an online course (like a course on coursera). Students can address different questions corresponding to different topics throughout the course. According to their answers and the topics that they have addressed we can now track and predict their learning behavior. This can be of valuable informtaion to the teachers as they go through the course, since they can understand which topics need more time and which topics are easier for the students to follow. As another example, consider the evolution of network attendees in an online social network like Twitter through time. We can model a Twitter network by a social graph in which the attendees are the nodes of the graph and the communication ties are direct messages between the nodes. As expected the nodes and the edges of such a graph change drastically at different time points. We are interested in tracking the learning behavior of the students and the dynamics of the Twitter network.

Now how do we predict such dynamics? Our claim is that there are hidden groupings in the data that represent this dynamic behavior as well as some exogenous factors, a.k.a. covariates. In order to make this more clear let's give some examples for the the Educational data and Twitter data explained previously. We claim that knowing the day of the week matters in performing prediction as student are more reluctant to work in the weekend. This also holds for the Twitter network, as the tendency of the users to be active on the network is affected by day of week. To be more clear, the behvior of the network on a Saturday is different from the behavior of the network on a Tuesday and considering such exogenous informtaion can help prediction. As another example it is useful to know which concept is being addressed by the students in any given day. However, these exogenous factor are not enough to track the dynamic behavior of the users. We are claiming that there are hidden groupings in the data the drive the dynamcs. For example useen groups of strong and weak learners in the educational data that tend to act similarly. Or unseen friendships or communities that the users belong to in a Twitter network.

We present condictional latent tree models (CLTM) to account for the effect of the covariates as well as the effect of the hidden groupings. The problem, therefore, reduces to a structured prediction problem. The overall idea of such a model is that the students/Twitter users are dependent random variables whose depdendence structure is represented by a conditional latent tree undirected graph, whose structure and latent nodes are learned from the data conditioned on the covariates. This is an unsupervised learning task. Once we have the dependency structure of the data we perform expectation maximization to maximize the likelihood of the data. More details about the model and how it is trained is found here.

We consider three datasets to indicate the performance of our method. The Educational dataset is the performance of students on an online psycology course on the Stanford Open Learning Library and is available on CMU datashop. It contains 2,493,612 records of 5,615 students throughout a 92 day long semester. We consider a subsset of 244 students that loyally stay throughout the whole semester. Another dataset is a subset of the Twitter network with 333 participants that discuss an emergency management topic #smemchat. The network was observed for 6 months from Dec 1st 2012 to Apr 20th 2013. The last dataset is Freeman's beach dataset that includes a one-month observation of a total number of 95 beach goers that go for windsurfing on a southern California beach.

Links to the visualization of the learning algortithm and the performance

The final learned tree over the knowledge components is given here. The yellow nodes are the inserted hidden variables learned by the algorithm and the blue nodes are the labeled knowledge components covered in the course. You can zoom into the tree for a better look at the labels and the learned relations among them

The process of the structure learning algorithm can be viewed here. You can see the steps of the learning algorithm as the latent tree evolves over iterations. You can also zoom into different parts of the learned tree and explore the relevant parts that are grouped together.

Finally, you can look at the performance of strong and weak clusters of different students clustered by the student tree shown for each concept cluster here.