Empirical analysis of collective human behavior for extraordinary events in blogosphere

To uncover underlying mechanism of collective human dynamics, we survey more than 1.8 billion blog entries and observe the statistical properties of word appearances. We focus on words that show dynamic growth and decay with a tendency to diverge on a certain day. After careful pretreatment and fitting method, we found power laws generally approximate the functional forms of growth and decay with various exponents values between -0.1 and -2.5. We also observe news words whose frequency increase suddenly and decay following power laws. In order to explain these dynamics, we propose a simple model of posting blogs involving a keyword, and its validity is checked directly from the data. The model suggests that bloggers are not only responding to the latest number of blogs but also suffering deadline pressure from the divergence day. Our empirical results can be used for predicting the number of blogs in advance and for estimating the period to return to the normal fluctuation level.

To uncover an underlying mechanism of collective human dynamics, we survey more than 1.8 billion blog entries and observe the statistical properties of word appearances. We focus on words that show dynamic growth and decay with a tendency to diverge on a certain day. After careful pretreatment and the use of a fitting method, we found power laws generally approximate the functional forms of growth and decay with various exponents values between −0.1 and −2.5. We also observe news words whose frequencies increase suddenly and decay following power laws. In order to explain these dynamics, we propose a simple model of posting blogs involving a keyword, and its validity is checked directly from the data. The model suggests that bloggers are not only responding to the latest number of blogs but also suffering deadline pressure from the divergence day. Our empirical results can be used for predicting the number of blogs in advance and for estimating the period to return to the normal fluctuation level.

I. INTRODUCTION
Collective behavior in human society has attracted considerable interest in the past decade [1][2][3][4][5][6][7][8][9][10][11][12][13][14], because developments in information technology have enabled the storage of large volumes of high-frequency human activity data, for instance, detecting bubbles in stock exchange activities [6], modeling dealer behavior using real data in foreign exchange markets [7], and empirical analysis of consumer behavior in supermarkets and convenience stores using purchase history and point-of-sale (POS) data [8,9]. Human activity data that is collected from the web, for example, YouTube videos and the social network service Facebook, are analyzed to not only explain basic individual human behavior but also elucidate hidden network structures in the society [10,11].
Here we also use the data from the Web to uncover nontrivial mechanism of collective human activities. Because word frequency on the Web is expected to immediately reflect the real social mood, it has attracted increasing attention among many academic and industrial researchers. In fact, it is stored electronically and analyzed widely. For example, the Library of Congress in the United States, which is the largest library in the world, has been archiving the entire public tweet of Twitter, a microblogging system since 2007 (http://blog.twitter.com/2010/04/tweet-preservation.html).
A blog is a type of Web site that is maintained by an individual with entries displayed chronologically with time stamps. The term blog originated from the combination of * sano.yukie@nihon-u.ac.jp web and log and was popularized around the year 2000 when free blog services began to be provided by Internet service companies. A blogger, who is an owner of a blog site, can easily upload his or her "entries" any time, and readers can easily post comments on the blog page. This interactive quality has contributed to the success of blogs; they are now widely used as basic social communication tools. The whole blog community is often called the blogosphere, and its scientific study is expected to be a promising new field of science, as huge amount of records are compiled as digital data.
In this study, we analyze the keyword appearance rate in blogs in which the functional forms of growth and decay around the peak are approximated by power-law functions of time. Similar power laws have been established in other fields of human activity. For example, a power law can describe a decrease in online book sales with an exponent that depends on endogenous or exogenous shocks [12]. Relaxation in audience number for online movies can also be described by power laws with various values of exponents that reflect the quality of the content [10]. Alfi et al. found that growth in conference registration numbers is also approximated by a power law diverging at the deadline [13].
In Sec. II, we describe the analyzed data and Japanese blogs. In Sec. III, we introduce our pretreatment procedures and peaked words. In Sec. IV, we focus on the time evolution of these peaked words and prove that they grow and decay with power laws. To reproduce power laws, we introduce a simple model of posting blogs in Sec. V. In Sec. VI, we discuss the predictability of our model from the standpoint of application, and the final section is devoted to conclusions.

II. DATA DESCRIPTION
The data analyzed in this study were obtained from the blogosphere written in Japanese over a period of 4 years, from November 1, 2006, to October 31, 2010. According to the technical report by the Internet search engine company FIG. 1. (Color online) Temporal change of the word frequency of "April fool" per week. The results are from Google Trends, which is targeted worldwide, and our blog data "Kuchikomi@kakaricho," which is targeted only in Japan. The number of blogs is normalized by the whole number.
Technorati (http://technorati.com), which tracked more than 70 million blogs worldwide in 2007, the share of Japanese blogs is 37%, the largest among all languages. Although we analyze only the Japanese blogosphere, we show an example in which the dynamic properties in Japanese and English are considerably similar. Figure 1 shows the temporal change of the frequency of the English "April fool" observed by Google Trends (http://www.google.com/trends) and surveyed worldwide compared to the number of blog entries containing the corresponding Japanese. In both cases, we confirm that there is a clear peak on the week including April Fools' Day.
In blogosphere research, it is important to note the existence of spam blogs. They are automatically generated blogs in which the same words are repeated multiple times, mainly for the purpose of advertising. As the share of spams in the Japanese blogosphere is said to be 40%, it is important to exclude spams from the data. We used a new Internet service called "Kuchikomi@kakaricho" (http://kakaricho.jp) to collect the data. This service provides an application programming interface (API) that counts the number of entries in which a given target word appeared in a given period by using a search engine technology with a spam filter. There are three levels of spam filtering and we apply the middle level, which is known to remove most of the spams while keeping most of the human blogs untouched. The API counts the number of entries in the blogs such that if one entry includes the target word multiple times, the word is counted only once.
The API started crawling the blogosphere on November 1, 2006, and covered major blog service providers. It covers more than 1.8 billion blog entries in 15 million blogs accounting for 90% of the Japanese blogosphere.
For analysis of Japanese we introduced a pretreatment to separate Japanese words that are not separated by spaces. Here we use the commonly used Japanese morphological analyzer "MeCab" (http://mecab.sourceforge.net/) to individually separate words according to a dictionary. By adding words to its dictionary, this software can treat multiword phrases such as April Fool as one word, April-Fool. Most of the words used in this study are already listed in the software's dictionary as one word, except names of people.

III. PEAKED WORDS
In the blogosphere, there are special words whose frequency grows or decays around a peak day such as April Fool with the peak on April 1. In the following discussion, we denote these words as "peaked words" and analyze their functional forms of growth and decay.

A. Pretreatment
We first apply the following pretreatment to the data to exclude both trivial circadian human activity patterns and systematic noise. In this subsection, we mainly focus on statistics of blogosphere itself, not peaked words.
Time shift: In the blogosphere, although a day starts at 00:00:00, there are many bloggers who are active at midnight. Therefore, we examine the complete circadian activity pattern and introduce a type of correction pretreatment for our daily data. For this purpose, we randomly chose the data of 10 000 bloggers with the details of their activities time stamped in seconds. By counting the number of entries posted at every hour, a circadian activity pattern is plotted in Fig. 2. The solid line shows the 24-h activity pattern obtained directly from the data. However, we discovered there are a certain number of blog entries with time stamps that are exactly 00:00:00. We consider this time stamp to be caused by an artificial systematic spec or error, and we exclude this data from the statistics when capturing the circadian pattern. The red bars in Fig. 2 show the revised circadian activity pattern. Using a 24-h clock, we find that blogging activity is lowest around 4:00, and, thus, we consider the start of a day at 05:00 to be reasonable. Because the share of activity in the interval between 00:00 and 05:00 is  Because of the circadian effect, the data of the day after the peak is always higher than that before the peak without modification. approximately 10% of the complete activity of a day, we can correct the daily number of blog entries, including the j -th target word at the t-th dayx j (t), by the following equation: where the weight is set as w = 0.9. With this modification, we can determine the time-shifted time series. In Fig. 3, open circles show the original daily data in which w = 1.0 in Eq. (1), and colored circles show time-shifted data in which w = 0.9. The time-shifted data show a more symmetric pattern than the original data. We also apply this procedure to determine the time series of the total number of blog entries per day x (t).
To clarify the effect of this time-shift procedure, we also show results without this time-shift procedure in Appendix A.
Normalization: There are nonuniform and nonstationary properties in the total number of entries per day [14]. For example, there was a sudden drop in February 2007 that was caused by search engine software's system maintenance. In order to reduce the systematic fluctuations caused by such nonuniform properties, we apply the following normalization procedure. There is already a method to separate internal and external noises [15], which simply deducts the external factor depending on its ratio of the total traffic. They assume that each traffic Here we simply divide x j (t) by the total number, x (t). The normalized number of entries for the j -th word on the t-th day is defined by , where x denotes the mean value of x (t) that is averaged over the entire observation period. This normalized quantity is proportional to the probability that a blog contains the j -th word on the t-th day, and it is not necessarily an integer.
By introducing this normalization, the fluctuations caused by the aforementioned nonuniform properties can be reduced. In this study, we measure the word frequency using this normalization procedure.

B. Word selection
We determine candidates for peaked words in the following three categories.

Event:
We selected the names of 14 public holidays and 16 major annual events in Japan. The appearance for these words grows and decays around the date of the event. In addition, these are words affiliated with an event, such as Santa Claus for Christmas and we can observe similar growth and decay behaviors for those words. However, in this analysis we neglected such affiliated words.
Date: We selected dates such as May 9, resulting in 365 words. There are many blog entries that announce some special day, e.g., birthdays and festivals. Growth and decay of these words always show a clear peak at the date.
News: A word such as earthquake occurs suddenly right after the occurrence of the event and the word appearance rate generally decays slowly. In order to observe the functional form of such decay after a significant event, we selected names of the places impacted by earthquakes. We also selected 33 names of famous people who died suddenly. In addition, we included the names of the Japanese scientists who received a Nobel Prize during our observation period.

IV. DYNAMICS OF PEAKED WORDS
We call the slopes before the peak day, fore-slopes and those after the peak day after-slopes, and we examine both in this section. As no standard method is known for checking the validity of approximation by a power-law time evolution for given time series, we apply a statistical test for power-law function introduced by Preis et al. [16] that is based on a Kolomogorov-Smirnov statistical test [17].

A. Method
We define the number of days in each slope by the number of consecutive days in which the word frequency is larger than the median valuex j from the peak. The median value is estimated throughout the entire observation period. We then approximate the functional form of the slopes using two models, a power law and an exponential law, The parameters of these models, α j , A j , β j , and B j , are determined by use of the least-squares method. The fitting region is [t c ± 1,t c ± n], where n is the number of days in slope. We then apply the Kolmogorov-Smirnov goodness-offit test to choose the better model. It was originally used as a statistical test for distributions. Here we apply it for evaluation of the statistical fitness of the functional form of the time series. For both models we calculate the KS statistic D, representing the deviation and defined as where X (empirical) j (t) is the cumulative number of the empirical value which is counted from the data and X (model) j (t) is the cumulative number which is calculated from the model. In both cases, numbers are normalized by X j (t c ± 1). By comparing the values of D for both models, the power-law model is accepted if the D value for the power law is smaller. In the case that the power law is accepted, we check the validity of the model as introduced in Ref. [16]. First, we generated a data set of 1000 synthetic data points. One data set contains n data points. Synthetic data points are generated randomly following the normal distribution with the mean value best estimated from the model x (model) j (t) and the standard deviation is σ (x (model) j (t)) as follows: where a = √ X 2 c X = 0.08 is a constant parameter characterizing the fluctuation in the number of all bloggers which is determined independently of the word [see Appendix B for theoretical derivation of Eq. (5)]. For each synthetic time series, we compare its D value with that of the empirical one. We count the number of cases in which the D value for the synthetic time series is larger. If the number of such cases are less than 100 from the 1000 synthetic samples, we accept the power-law model as q = 0.1. Contrary to the ordinary sense of p value, the power-law hypothesis is considered to be valid for larger q. Thus, if the q is close to 1, then the difference between empirical data and the model can be attributed to statistical fluctuation alone and we accept power-law hypothesis. If the q is smaller than 0.1, we reject the power-law hypothesis. We change the border of the fitting region n from 5 days to a maximum slope length. The value of power exponent, α j , is given by the value for the case with the largest n.  Table I.

B. Results
The absolute value of the power exponents of the afterslopes is larger than that of the fore-slopes in 58% of the 65 samples for Event and 80.6% of the 603 samples for Date. For Date, we confirm a significant difference between fore-slopes and after-slopes by use of a t test with p < 2 × 10 −16 while it is rejected with p = 0.80 for Event. The number of days of the after-slopes is larger than that of the fore-slopes in 55% of the 65 samples for Event and 65.8% of the 603 samples for Date. For Date, we confirm significant difference between fore-slopes and after-slopes by use of the KS test with p < 2 × 10 −16 while it is rejected with p = 0.22 for Event.
In the case of the news words, there is no fore-slope and we cannot compare the values of the exponents before and after the peak. The absolute values of the exponent after the peak tend to be estimated as smaller for high impact news because of the effect of sequential broadcasts after the news. For example, in the case of the sudden death of the world-famous entertainer Michael Jackson, which marked the peak day, there was a funeral service after a few days and a memorial CD released after a few weeks. Both can be regarded as aftershocks that remind us of the main news. Because of such repetition, the keyword appearance rate after the peak day is enhanced, the decay of the word appearance becomes slower, and the power exponent tends to take a smaller value.

C. An extreme case: Tsunami
The power-law decay per day of the word tsunami in the Japanese blogosphere is shown in Fig. 6(a). The peak day was March 12, 2011, the day after the quake, with 142 617 posts or 12.6% of all blog posts in raw data. After pretreatment with time shift and normalization, the estimated power exponent α j is 0.67 with A j = 61788 (n = 50) using Eq. (2). It is expected to take approximately 8623 days (∼23.4 years) to return to the normal fluctuation level if we simply broaden the power-law function. The normal fluctuation level was 140 appearances per day, estimated from the data 1 month before the quake.  Although most of the news words decay in approximately 10 days, the case of tsunami is a rare exception because the number of entries is still 10 times higher than before the peak, even a year after the quake.
Twitter also shows a similar power-law behavior, even though the time resolution differs. Figure 6(b) shows the number of tweets measured per hour that include tsunami, calculated based on 1 397 783 tweets. We believe that this type of power law reflects the robustness of the empirically observed dynamics of collective human behavior.

V. THE MODEL
In this section, we propose a simple dynamic model to describe the typical power-law growth and decay of frequency of blogs with peaked words. There is already a simple model to describe people's universal behavior before a deadline by assuming pressure inversely proportional to the remaining time [13]. As this simple model can describe only the special case α = 1, a kind of utility function that includes the tendency to postpone the action is introduced to describe the general case. Here, we introduce another approach to describe the general case. We introduce the following two assumptions for the number changes of blogs including the j -th target word at the t-th day, x j (t) = x j (t + 1) − x j (t), increments for fore-slope and decrements for after-slope.
1. The pressure from the peak day t c works inversely proportional to the time, 1/|t c − t| [13].
2. The number of changes x j (t) is proportional to the number of blogs including the j -th target word, x j (t). We can write these two assumptions into mathematical form in the continuous case as we assume x j (t) dx j (t) dt . The time evolution of blogs for the fore-slope is given as where f (t) is an independent noise with a zero mean. The value α (fore) j > 0 is a proportionality factor that describes the effect of the above-mentioned two assumptions. Similarly, the decrement of the after-slope is given as where α (after) j > 0 is also a proportionality factor that describes the effect of the two assumptions. Because we know that blogs decrease after t c , we add a negative sign to Eq. (7). It is easy to confirm that both Eqs. (6) and (7) derive the power-law divergence, Eq. (2), in the case with no noise term f (t). Thus, for fore-slopes and x j (t) ∝ (t − t c ) −α (after) j for after-slopes. In the case that there is no pressure from the peak day t c , blog dynamics follow Eq. (3) of the exponential law.
As a check of our assumption, we rewrite Eqs. (6) and (7) into the following form without the noise term f (t), and we calculate the left-hand-side and right-hand-side values from the real data. Note that t c is not necessarily an integer since the divergence point is expected to exist in a single day time period, We surveyed all 1341 keywords for after-slopes as listed in Table I, and the median, upper, and lower quantile points are plotted in Fig. 7. As this figure shows, the median and quantile points fit well with the theoretical curve. This means that for the majority of words the relation Eq. (8) holds, implying that blog number changes are proportional to the number of recently written blogs, x j (t), and it is also promotional to 1/(t − t c ). Now we know that the above relation [Eq. (8)] holds as a whole system; however, there remain two scenarios to realize this: The first case is where the main bloggers forming the peaked behavior are repeaters and the assumptions hold for each blogger individually, and the second case is where the main bloggers are newly joined bloggers and the assumptions hold for general bloggers, implying the existence of collective interaction in the blogosphere. In order to clarify which is the right scenario, we pay attention to randomly chosen 30 000 bloggers whose activities can be traced precisely. For all these bloggers we observe the days when they posted the typical keyword Marine Day. We count the total number of blogs including this keyword among these bloggers for each week as plotted in Fig. 8 (bottom); we also count the number of bloggers who posted the keyword for the first time, and the ratio of the number of new comers over the total number in the week is plotted in Fig. 8 (top). As known from this figure we confirm that the share of repeaters in the peaked behavior is generally less than half, namely the power-law behavior is formed mainly by newly joined bloggers. Similar results are confirmed also for some other typical keywords. This fact implies that the second scenario is correct and the factor α characterizes the strength of influence of written blogs to general bloggers representing the existence of interaction in the blogosphere.
For fore-slopes there is a natural reason of appearance of factor, 1/(t c − t), in Eq. (6) explained by the deadline effect [13], that is, a blogger who plans to post the keyword Marine Day may think that there are t c − t days before the deadline and a posting date can be chosen from t c − t candidates, so the probability of posting a blog on the day is proportional to 1/(t c − t). This effect can be regarded as a universal property for each blogger individually.
On the other hand, the reason for after-slopes is less obvious. For a blogger who wants to post the keyword after the event, the probability of writing a blog including the keyword might be proportional to the decay of strength of memory. In the field of psychology, the functional form of memory decay is usually approximated by a nonlinear function [18], and here, as a simplest assumption, we introduce the inverse power law of memory decay from the deadline, 1/(t − t c ), which has the same functional form as the case of fore-slopes. With this assumption we can explain the nontrivial exponents of power-law behaviors of blogs by introducing the factor α that describes the strength of influence of written blogs to general bloggers.

VI. PREDICABILITY OF FREQUENCY
As an application of this study, we explore the possibility of estimating the word frequency in the near future. In Fig. 9, we show an example of the prediction of the blog frequency of Marine Day in 2008. In this case, we already have the information about the peak days to be July 21, 2008; thus, we can fix the divergence point t c . From the data, we find that the slope period starts on April 28, 85 days before t c , as the normalized frequency continuously exceeds the median value from this day. In Fig. 9(a), the case of prediction for 20 days before the divergence point using 65 data points with Eq. (2) is shown by the red line. In Fig. 9(b), the case of prediction for 5 days before the peak day is shown. The prediction error becomes smaller for shorter prediction period as expected.
Note that a small difference in estimation of the exponent α j makes a big difference near the peak; thus, the number of data points plays an important role in its accuracy.

VII. CONCLUSIONS
By analyzing a large database of Japanese blogs, we showed that the functional forms of growth and decay of word appearance that peaked on a certain day are generally approximated by power laws with the various exponents values between −0.1 and −2.5. The values of the power exponents depend on the category of words such as Event, Date, and News. In the case of Event and Date, clarification of asymmetry in the power exponents of the fore-slope and after-slope is an interesting subject for future research on collective human behavior. In the case of News, the power law can be observed only after the peak, and its power exponent depends on its impact. In the case of significant news such as the March 11 earthquake in 2011, the absolute value of the power exponent is clearly smaller than 1.
We also checked the validity of our simple model that indicates that bloggers change their probability of posting proportional to the number of blogs and inversely proportional to the time interval from the peak. By checking the data of bloggers' detailed activities, we confirmed that the peaked behavior mainly consists of newly posted bloggers. This implies that there exists a kind of global interaction between the newcomers and the keywords which makes the numbers of new comers and keywords proportional.
In addition, these power functions can be observed also in Twitter, and it suggests that these power-law behaviors are universal in social phenomena. An agent-based mathematical model will be used to reproduce these empirical properties of blogger activity in the near future [19]. without a time shift for the word Marine Day, as mentioned in Sec. IV B. There is no major change in power exponent α j for fore-slope and after-slope. However, for the value of intercept A j , we can find major deviation, especially for foreslope (A j = 3171 with a time shift and A j = 2273 without a time shift). In Fig. 11 and Table II, we summarize the whole samples.

APPENDIX B: MODIFIED RANDOM DIFFUSION MODEL
We introduce a modified random diffusion model, which is used in Eq. (5). The random diffusion model was originally introduced to describe diffusion properties of random walkers on a given network [20,21], and two of the authors (Y.S. and M.T.) have modified the model to be applicable to the fluctuations in word appearance in the blogosphere [22]. In our modified random diffusion model, we assume that there are two states, active and nonactive, for each blogger, and the number of active bloggers fluctuates randomly each day. Each active blogger randomly decides to post a blog including the j -th word. There is a key parameter in this stochastic process; the share of the j -th word c j is defined by the following equation: where x j (t) is the number of blog entries including the j -th word on the t-th day. X(t) is the number of active bloggers on the t-th day and the brackets show the mean over all instances. We assume that the number of active bloggers X(t), X(t) 0 fluctuates randomly following an independent probability density distribution φ(X) with finite moments. The probability of posting x j entries is calculated using a Poisson distribution with the mean number c j X given as follows: When x j is small, a Poisson distribution is approximated by a Bernoulli distribution that assumes x j = 0 with a probability 1 − c j X and x j = 1 with a probability c j X. Thus, we have the following evaluations for an arbitrary distribution of φ(X): For x j ≈ 2, P (x j 2|c j ) ≈ 0, thereby P (x j |c j ) is approximated by the Poisson distribution with both the mean and the variance given by c j X . For x j 1, the Poisson distribution can be approximated by a normal distribution, By introducing a new variable, y j = x j c j X , Eq. (B4) becomes When x j = c j X 1, the weight function in the integral can be approximated by Dirac's δ function as P (y j |c j ) ∞ 0 φ(X)δ y j − X X dX.
Therefore, we have the following simple evaluation, for x j ,