You are on page 1of 26

Journal of Housing Research

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/rjrh20

A Word to the Wise Analyzing the Impact of


Textual Strategies in Determining House Pricing

Vincenzo Alfano & Massimo Guarino

To cite this article: Vincenzo Alfano & Massimo Guarino (2022): A Word to the Wise Analyzing
the Impact of Textual Strategies in Determining House Pricing, Journal of Housing Research, DOI:
10.1080/10527001.2021.2013058

To link to this article: https://doi.org/10.1080/10527001.2021.2013058

Published online: 19 Jan 2022.

Submit your article to this journal

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=rjrh20
JOURNAL OF HOUSING RESEARCH
https://doi.org/10.1080/10527001.2021.2013058

A Word to the Wise Analyzing the Impact of Textual


Strategies in Determining House Pricing
Vincenzo Alfanoa and Massimo Guarinob
a
Department of Economics, Westminster International University in Tashkent, Tashkent, Uzbekistan;
b
Department of Literary, Linguistic and Comparative Studies, University of Napoli, l’Orientale,
Naples, Italy

ABSTRACT KEYWORDS
Some of the mechanisms through which house prices are deter- House pricing; text
mined to remain unclear. This work aims to shed some light on the analysis; Napoli
subject by analyzing the impact of text structure and given key- JEL CODES
words in the announcements of house sales over the internet. It O18; R21; R31
seems that, especially in trade among private individuals, the mar-
keting of the announcement can make the difference. By retrieving
data regarding houses on sale in Italy, one of the countries in which
the housing market is still overwhelmingly composed of deals
between private individuals (and not a market shared by a few big
companies), we derived via OLS and fractional response probit esti-
mation the impact of the text structure and several keywords. Our
results show that there is no mark-up due to verbs and punctuation,
or transport- and tourism-related keywords, while an abundance of
nouns and adjectives or keywords related to investment, panorama,
and cultural heritage have a positive (or expected) impact on the
price. This suggests that using many nouns and adjectives in writing
a house sale announcement helps to sell the property at a higher
price, and also that while the latter set of keywords play a role in
determining house prices, announcements leveraging transport
opportunities and touristic opportunities do not.

Introduction
The aim of this paper is to investigate the impact of sales advertisements in determining
house prices. In this work, we focus especially on text structure, in terms of the relative
frequency of parts of speech (POS), and on the presence of certain keywords.
Surprisingly, while how house prices are set is still not entirely clear to the scientific
community (Marsh & Gibb, 2011), little attention has been paid by scholars to the impact
of the characteristics of sales announcements (i.e., on one of the primary sources of mar-
keting) on the price. The most prominent and recent works on the subject are Nowak
and Smith (2017) and Goodwin et al. (2018). The contribution of Novak and Smith is
mainly focused on improving the hedonic price model in the U.S real estate market
including unigrams (i.e., single words) and bigrams (i.e., a couple of co-occurring words)

CONTACT Vincenzo Alfano vincenzo.alfano@unina.it Department of Economics, Westminster International


University in Tashkent, Tashkent, Uzbekistan.
ß 2022 American Real Estate Society
2 V. ALFANO AND M. GUARINO

collected from the public remarks on properties. More specifically, one of their empirical
findings reveals what words affecting positively and negatively a property price and to
what extent.
While this certainly is one of the first attempts to use textual information to provide
information about the property that is not observable using only the objective data, on
the other hand neither information on the seller writing style (i.e., the use of punctu-
ation, number of adjectives, nouns, verb and adverbs) nor the effect of specific variables
(i.e., the presence of a panoramic view) are provided.
The work of Goodwin et al. (2018) is probably the first contribution aiming at raising
awareness of the importance of the announcement writing style on buyers’ behaviour in
terms of words desirability. Their major findings unveil the potential role played by fac-
tors as race, gender, income, and education in influencing a buyer’s interpretation of the
language in a real estate listing. Because the choice of words is likely to impact the mar-
keting outcomes, they thus suggest a more adaptive strategy for real estate agents
when writing a property description. Although its importance in the field, the work
doesn’t use a price model to determine whether certain words or word-variable (i.e., vari-
able made form textual information) have an impact on property prices. Furthermore,
only buyer-side words or style’s desirability are provided; an analysis of the actual writing
style of the sellers (focused on the syntactic analysis) could add some useful insights on
the opportunity for real estate agents to implement a more user-oriented market-
ing strategy.
Previous contributions include use of transaction based real estate data to study the
determination of price (Abraham & Schauman, 1991; Ambrose & Shen, 2021; Case et al.,
1991; Palmquist, 1984; Peek & Wilcox, 1991).
We consider this a very important topic for several reasons: principally, there is no
doubt that the housing market plays a very important role in shaping the economy over-
all. It has been suggested that housing renovation and construction boosts the economy
by increasing the house sales rate, as well as employment and expenditures (Ahmed &
Moustafa, 2016). The housing market also affects the demand for other related industries,
such as construction supplies and household durables (Li & Leatham, 2011).
Furthermore, houses are a popular form of investment in Europe, and even more so in
Italy (Abate & Losa, 2016), meaning it is of the utmost importance to understand what
influences house prices because private investments are also concerned by this.
In this respect, it is important to highlight that the existing literature has not yet
agreed on how house prices are set. Indeed, although the scientific literature has long
paid attention to the topic, there are still many doubts about the theoretical functioning
of the mechanism through which house prices are determined. Research examining the
differences in house prices within a given city often focuses on estimating the (implicit)
price at which the market clears, i.e., when buyers and sellers are willing to exchange a
contract for structural features, accessibility levels, and neighborhood amenities. This lit-
erature often uses the hedonic price approach (Cheshire & Sheppard, 1998; Rosen, 1974).
In this work, we will empirically analyze the impact of the style of the online
announcement of sales. We expect this impact to be higher on private buyers than on
professional market operators since the former should be more sensitive to the way an
announcement is written than the latter, who are less likely to be influenced by textual
JOURNAL OF HOUSING RESEARCH 3

strategy and more interested in facts. In other words, we expect that while professional
people working in this specific market have a more objective approach in the estimation
of the prices of the subject of their expertise, private operators may be more subject to
good (or bad) marketing strategies, misevaluating the objective value of a house (assum-
ing such a price always exists).
This is also suggested by the fact that several websites are devoted to advising on the
best real estate marketing strategy. These often encourage people to take advantage of
descriptive words (mostly adjectives) that may help generate a better offer. Here are
some examples that help better clarify how professionals advise writing an appropriate
real estate announcement: “The use of descriptive words may require using a thesaurus,
but it’s well worth the extra effort”;1 “Sprinkle the copy (i.e., the raw version of the
announcement) with real estate adjectives – give your copy some style and help paint a
mental picture for the reader”;2 “It takes real effort to be creative and think of descriptive
words that will actually make your property sound appealing”; “Here’s What I Want To
Give You: 200 Powerful Adjectives That Will Boost The Desirability Of Your Property.”3
In this respect, we think that our work can contribute to the house price literature and
also be a valuable starting point in opening a sub-strand of the related literature for at
least two reasons.
First, local estate markets are characterized by sellers who try to appeal to local prefer-
ences (by using meaningful keywords) when writing announcements, in order to high-
light their relevance for buyers.
Second, even if we are interested in gauging the relevance of certain keywords in pre-
dicting higher prices in the estate market, under the assumption that on average the
houses are sold at the price advertised (or at a linear function of this price), our method-
ology also allows us to infer to what extent the real estate market demand is sensitive
to the language used in the descriptions of houses.
As already stated by some pioneering works (Goodwin et al., 2018; Nowak & Smith,
2017), this is a prominent research area aimed at revealing the need for sellers to pay
attention to what kind of description people look for when seeking a house.
Another aim of this research is to keep raising awareness of the relevance of text ana-
lysis in obtaining insights from the markets and more specifically from the real estate
market. Indeed, this approach can be helpful for researchers working on house price
modeling, and, through textual information retrieval techniques, enable the detection of
house features that would otherwise be hidden. This is a very important problem for cur-
rent research in house pricing. It can also shed some light on how and to what extent
the use of language can affect the setting of prices.
Before moving on to the details of our analysis, it is important to summarize the state
of the art of the economic theoretical literature. In this regard, we may claim that the
scientific community mostly agrees (and uses as an explicative paradigm in the vast
majority of research papers on this topic) that the standard way to explain how house
prices are set is derived from the neoclassical paradigm (Marsh & Gibb, 2011), although
it is important to observe that there are some notable exceptions (Earls, 2005). The neo-
classical paradigm, in its many applications, assumes the rationality of agents. This
assumption has been challenged various times from many sides: for instance, Daniel
et al. (1998) claimed that asset-pricing models based on the assumption of agent
4 V. ALFANO AND M. GUARINO

rationality are unable to explain patterns that are regularly observed in the real world by
economists. A number of alternative assumptions have thus been taken into account to
cope with the limits of the consumer rational behavior hypothesis (e.g. the so-called
“bounded rationality assumption”). This is based on the idea that each agent has a lim-
ited ability to access and process information, and thus it is unlikely one will find com-
plete rationality when observing real-world data. Conlisk (1996) suggests that agents
modeled in this way are more likely to behave like real-world agents, especially in com-
parison with “completely rational” ones. Think for instance to a listing agent intentionally
leaving out information for self-serving reasons, which is for sure a rational behaviour,
but possibly one hard to be modelled.
In any case, even if the neoclassical approach continues to be the mainstream stand-
ard in economics, alternative theories flourish. Indeed, many economists, especially in
the new millennium (among others: Shiller, 2005; Akerlof & Shiller, 2009; Stiglitz, 2010)
have stressed the need to move economic modeling toward an approach that empha-
sizes the non-conformity of human actions to rational principles. It is important to note
that the critique of agent rationality is not a new one in economics. Among these cri-
tiques is that of John Maynard Keynes, one of the discipline’s founding fathers. His chal-
lenge of the rational agent approach is implicit in his description of the choice-making
behavior of economic agents as being largely determined by “animal spirits,” i.e., high-
lighting that human beings are affected by irrational exuberance or pessimism. These
terms have been re-adapted by practitioners of behavioral economics, an approach that
tries explicitly to model the non-rationality of agents. This discipline has been described
as the marriage between psychology and economics, insofar as behavioral economists
try to use insights from psychology to model the behavior of economic agents more
realistically.
One of the most promising and interesting fields of research in the house pricing sub-
field is the contamination between different disciplines, and especially the application of
lexicology (through text data analysis) to the mechanism through which house prices are
set (see Gentzkow & Shapiro, 2010 for a pioneering approach to the topic, and
Gentzkow et al., 2019 for a more recent application). Previous contributions also include
attempt to use information from property advertisements to value real estate (Goodwin
et al., 2014; Levitt & Syverson, 2008; Pryce & Oates, 2008).
In recent decades, most of the demand for the housing market has moved to adver-
tise offerings over the internet. Accordingly, many specialized websites have arisen, offer-
ing a virtual location in which potential buyers may compare different houses and select
the ones they are interested in. This is a large and, perhaps more importantly, a growing
market (Cherif, 2013), which offers a very important source of data. It seems legitimate
to assume that, aside from summarizing the main characteristics of the house, as hap-
pens in several other markets, the description in the announcement is also meant to pre-
sent the property to the reader (i.e., the potential buyer) in the best possible way, and at
the very least to attract her or his attention. Is there some textual structure or set of spe-
cific keywords that are correlated to higher prices? In other words, is there a relationship
between the use of the language made in the description of the house and the price of
the property? Does sales announcement marketing matter, and how?
JOURNAL OF HOUSING RESEARCH 5

Figure 1. Share of population living in owner-occupied dwellings in EU (source: Eurostat).

In this study, we aim to answer these questions via quantitative analysis, contributing
to the literature that applies Natural Language Processing (NLP) tools on real estate
advertisement to model house prices (Liu et al., 2020; Shen & Ross, 2019). We choose to
use Italy as the setting of our empirical analysis, since it is one of the EU and Western
countries with the largest percentages of citizens who own a house, as can be seen in
Figure 1 (over 80% of Italians prefer buying to renting,4 and over 72% of Italians live in
an owner-occupied dwelling, a number above the EU average of 69.3%)5, and thus there
is a large market for private buyers and sellers, which seems to us the best framework in
which to find data about agents interested and affected by house descriptions (as
already explained, we expect that private buyers will be more affected by this mechan-
ism than professionals in the field). In this respect, we believe that it is possible to state
that there are no many differences between the Italian real estate market and that of
the US, or at least that these are comparable, as also suggested in research from the
National Association of Home Builders6.
We focus our analysis on Naples, a city that has not seen many new buildings in
recent decades; its geographic characteristics, with the city trapped between a volcano
and the sea, make it hard to imagine that the offer of houses may expand in any sub-
stantial way, thus influencing the price dynamics. At the same time, Naples is interesting
insofar as it is one of the cities with the highest levels of growth in tourism in Italy (and
in Italy arrivals increased from over 98,000,000 in 2010 to about 128,000,000 in 2018,
according to ISTAT estimates)7. In addition to this, anecdotal evidence suggests it is
experiencing gentrification, especially in the growth of bed and breakfasts and other
small accommodation businesses.8 Furthermore, Naples is the third-largest city in Italy in
6 V. ALFANO AND M. GUARINO

terms of population (with around one million inhabitants): this should ensure an ideal
size for such an analysis, being big enough to have heterogeneity in its different neigh-
borhoods, but not so big as to include neighborhoods in the analysis that are not
really comparable.
By retrieving data from a popular real estate website, performing text analysis on both
raw and processed text data, and running some regressions on the datasets built in this
way, we test the importance of the presence of given words and the text composition of
the sales announcement on the final price of a house. More precisely, we estimate an
equation thought to assess the impact of the presence of a specific structure or sets of
words on the house price, with several different specifications, via both Ordinary Least
Square (OLS) and fractional response probit estimators. By inserting the usual variables
supposed to determine house prices in the analysis, we aim to control for the impact of
text characteristics (i.e., relative frequencies of the part of the discourse and the presence
of some sets of given keywords) on house pricing.
In order to perform this analysis, we collected the text of several single sale announce-
ments to investigate the hidden effects of other relevant information on house pricing.
Indeed, under the assumption that all (or at least most) of the heterogeneity in prices
due to the characteristics of the house is captured by the control variables, and the rest
by the error term, the variable explaining the use of the words in the announcement
and its composition should capture the rest of the price difference, and thus the differ-
ence due to the characteristics of the announcement.

Data
We built our dataset through web scraping techniques, using Python’s Beautiful Soup
library, which enables us to collect meaningful information from www.immobiliare.it (a
website with a similar scope and very relatable to www.realtor.com per the US market).
This is possibly the most important and frequently visited Italian website for selling or
renting properties.9
Given the presence of a relevant variety of property types, we gathered data with
respect only to full-ownership flats for sale in Naples, excluding from the dataset those
under judicial auction (in order to avoid biases due to houses priced very differently
from their market value). Moreover, as the website allows us to filter the location of flats,
all the observations were grouped into 14 different urban areas, the neighborhoods into
which Immobiliare.it divides the city. While the choice to use a website platform as a
source of data allows us to gather a more realistic corpus for text analysis with respect
to a real estate market, some constraints, unfortunately, limit our analysis. For instance,
our data lack the actual selling price, and also the time on the market: in this respect,
our dataset differs from the MLS database used by many scholars in their researches.
Nonetheless, while we recognize this lack of information and warn the reader about it,
we also believe that the quality of the text analysis offered by our data makes our contri-
bution valuable to the literature.
More specifically, for each observation, we recorded its ID, the sale announcement
name, the selling price posted by the agents (as already said it is not possible to deter-
mine the actual selling price), the size (expressed in square meters), and the number of
JOURNAL OF HOUSING RESEARCH 7

rooms and toilets10. For each observed flat, we also noted the floor on which it is
located. These values range from 1 to 11 and allow us to identify certain peculiarities as
well (for instance penthouses, or flats located on the ground floor, or in basements).
Furthermore, and this is possibly the most important and original contribution of this
study to the literature, we also collected the text of the single sale announcement in
order to investigate the hidden effects of further relevant information on house pricing
through the analysis of the included keywords.
To this end, text data were pre-processed following the most common Natural
Language Processing (NLP) techniques. The pre-processing procedure took the following
three steps:

1. Tokenization. This is the most basic step in NLP; it is the process of decomposing
a text value into a sequence of words or, more technically, “tokens” (for instance
the string text “Hello World!” would be split into the following three base units of
text, i.e., tokens: ‘Hello’ - ‘World’ - ‘!’) (Manning & Schu€tze, 1999). Besides, we
removed punctuation while applying tokenization so that our tokens only
included words.
2. Normalization. Tokenized texts were normalized by putting all the characters in
lower case, so that there would be no difference between the words “The” and
“the,” for example (Bird et al., 2009).
3. Italian stop-words removal. Tokenized and normalized texts were further prepro-
cessed by removing Italian stop-words, i.e., a set of the most common words of a
language that are often found in sentences. Given their high frequency of occur-
rence, their removal allows for better detection of the keywords11 (Silva &
Ribeiro, 2003).

Moreover, we also checked for the occurrence of bigrams, in order to find sequences
of two contiguous elements from a set of tokens (that is to say, two adjacent tokens
that can be considered as one token based on their co-occurrence in the corpus, i.e., the
set of all the tokens. For instance, the tokens “room” and “large” are treated as a unique
token “large_room” if their co-occurrence is relatively high. We provide more details on
the bigram formation algorithm in the following section).
The whole dataset contains 5,577 observations, gathered on 6 October 2019.
Unfortunately, variations were not available for all the observations (i.e., there are obser-
vation with missing or incomplete data on prices, floors number, size, etc. or even
announcements). In order to avoid biases we proceed with a listwise deletion, and thus
ended up with a final sample of 5,040 observations.
While Table 1 below provides descriptive statistics of the data, Table 2a shows sum-
mary statistics on the tokenization for both the pre-processed and raw (i.e., not fully pre-
processed) corpora. For raw corpora we mean that each announcement was normalized
(put in lower case) and tokenized, keeping stop words and punctuation tokens as well.
Given that we are interested in investigating the role played by other meaningful
information in house pricing, we built some dichotomous variables from the text of the
announcements. For each announcement, these dummy variables take the value of 1 if
any element from a set of specific keywords occurs (see Appendix A), and 0 otherwise.
8

Table 1. Descriptive statistics of variables.


Variable Label Observations Mean Std. Dev. Min Max
Natural Logarithm of the price lnprice 5,040 12.39126 .8576682 6.016157 15.68731
Natural Logarithm of the price per square meter lnpricesqmt 5,040 7.846102 .5589894 2.104134 12.25486
Share of the price of the most expensive of the frpr 5,040 .2143096 .1583343 .0002733 1
neighborhood
Share of the price per square meter of the most frprsqmt 5,040 .3042956 .211933 .0008573 1
expensive of the neighborhood
Number of rooms rooms 5,040 3.419841 1.157357 1 5
Area of the house in square meters area 5,040 106.6248 56.73743 1 875
Square of the area areasqr 5,040 14587.35 23555.81 1 765625
Number of toilets toilet 5,040 1.536111 .6473154 1 3
V. ALFANO AND M. GUARINO

Dichotomous variable equal to 1 if the house dfflo 5,040 .2892857 .4534758 0 1


is on the first floor
Dichotomous variable equal to 1 if the house dgflo 5,040 .1099206 .3128218 0 1
is on the ground floor
Dichotomous variable equal to 1 if the house daflo 5,040 .027381 .163207 0 1
is on a penthouse
Dichotomous variable equal to 1 if the house’s cultural_heritage 5,040 .0126984 .1119806 0 1
announcement text expresses being close to
cultural heritage places
Dichotomous variable equal to 1 if the house’s transport 5,040 .2811508 .4496055 0 1
announcement text expresses being close
to transportation
Dichotomous variable equal to 1 if the house’s investment 5,040 .1464286 .3535704 0 1
announcement text expresses being an
investment opportunity
Dichotomous variable equal to 1 if the house’s touristic 5,040 .0597222 .236995 0 1
announcement text expresses being useful
per tourism
Dichotomous variable equal to 1 if the house’s panoramic 5,040 .1359127 .3427298 0 1
announcement text expresses being panoramic
Share of item in the text that are nouns nouns 5,040 .2756813 .0350169 .1111111 .4827586
Share of item in the text that are verbs verbs 5,040 .0784999 .0247386 0 .2222222
Share of item in the text that are adjectives adjectives 5,040 .1169727 .0348362 0 .3809524
Share of item in the text that are punctuaction marks punctuation 5,040 .1344977 .0432985 0 .3461539
JOURNAL OF HOUSING RESEARCH 9

Table 2a. Summary statistics for the entire corpus and per the announcement.
Max mean Total corpus
Processed tokens 290 70.19 353,764
Un-processed tokens 569 138.015 695,596
Nr. of announcements 5,040

Table 2b. Part of speech tags classification over the entire raw corpus (not fully prepro-
cessed tokens).
POS_tag Relative Freq.
adposition 0,127
Noun 0,268
Adverb 0,029
Proper Noun 0,068
Verb 0,082
Adjective 0,115
Determiner 0,069
Punctuaction 0,132
Auxiliary 0,009
conjunction 0,037
numeral 0,037
pronoun 0,023
other 0,002
interjection 0,000
subordinating conjunction 0,002
symbol 0,000
particle 0,000

More specifically, we have five variables of this kind: cultural_heritage, transport, invest-
ment, touristic, and panoramic (additional detail will be provided about this below).
Furthermore, in order to assess the effect of the announcement style on house pricing,
we counted the absolute frequencies of four-part of speech (POS) tags for each
announcement with those commonly used in the NLP literature Python library “SpaCy.”12
Part of speech tagging is one of the most basic tasks in NLP and consists of annotat-
ing each word within a single document (announcement) with its most suitable syntac-
tic category.
We were able to detect up to 16 POS tag categories for the Italian language within
the examined corpus (see Table 2b). Pre-existent literature on NLP machine learning
techniques suggests taking the following tags into account: adjectives, nouns, adverbs,
pronouns, and verbs (Daoud et al., 2010; Daud et al., 2018; Whittle et al., 2014).
Given the nature of our text (which is mostly descriptive, as shown in the previous
section), and the low mean length of each announcement (see Tables 2a and b), we
used only the most relevant POS tags for this task, i.e., nouns (with the exclusion of
proper nouns), verbs, adjectives, and punctuation. Indeed, while announcement readabil-
ity can suffer from ineffective use of punctuation, there are no specific reasons to believe
that other POS, such as adverbs or pronouns, play a pivotal role in the writing style of
real estate announcements.
In this respect, we chose to run the algorithm on the raw tokens so as not to lose
information on punctuation and the original volume of each considered lexical category
(obviously, we do not provide for the detection of bigrams in this task).
10 V. ALFANO AND M. GUARINO

By dividing the number of each lexical category by the total number of tokens in the
announcement, we obtain the share of POS tags in each announcement, creating a vari-
able weighting of the importance of each POS tag for each text, normalized according
to its full length.
A preliminary examination of the data suggests that the quality of an announcement
does have an impact on the price. Indeed, there seems to be a positive correlation
between the number of pre-processed tokens (i.e., “word tokens”) in the sale announce-
ment and the log of its price (Figure 2). This suggests that houses sold at higher prices
have longer descriptions in the sale announcement. However, at this stage of the ana-
lysis, it may be due either to greater care in the marketing for more expensive houses or
also to a better market collocation of houses with better announcements.
Finally, it is important to state that a potential problem in our dataset is the absence
of objective information about the characteristics of the houses. Indeed, while our ana-
lysis is aimed at measuring the impact of the quality of the announcement, it may
instead be argued that our regressions caught the effects of the quality of the houses.
While this is certainly a limitation of the analysis that may not be solved with the avail-
able data, and which suggests that our results be taken with caution, it is important to
specify that there are reasons to believe that the quality of the announcement makes a
difference regardless of the specific properties of a house. Indeed, there is great hetero-
geneity in the quality of the announcements, and not all the houses are valorized as
much as they could be. For instance, in our sample 685 houses are described as having
panoramic views, while a total of 138 are on the top floor. Of this subsample, only 42
are described as having panoramic views. This suggests that there is indeed a degree of
heterogeneity in the descriptions of the houses and that by controlling for all the
important sources of heterogeneity of the prices, our analysis may actually capture the
importance of some keywords.
Another way of seeing our analysis is that we focus on the price requested by the
seller, rather than on the sale price. While this can undoubtedly cause problems, it is still
interesting since there is of course a relationship between the requested price and the
final price, and it seems legitimate to imagine that the requested price is related to the
announcement and its quality.

Methodology
Bigram Formation Model
A bigram formation algorithm allows for the detection of meaningful expressions made
of two words and helps to reduce the number of tokens to extract more relevant infor-
mation from text data (e.g. “large room” and “panoramic view,” are both treated as one
token, respectively “large_room” and “panoramic_view”).
Generally, the bigram algorithm predicts the probability that the next token wi is
found on the basis of the preceding token wi1 : In terms of probability, this leads to the
approximation given by the Markov assumption for which the probability of a word
depends only on the previous one. If wi and wi1 are two contiguous words, then the
conditional probability of wi given wi1 is thus the following:
JOURNAL OF HOUSING RESEARCH 11

Figure 2. Correlation between log price and number of tokens.

Pðwi , wi1 Þ
Pðwi jwi1 Þ ¼ (1)
Pðwi1 Þ

where Pðwi , wi1 Þ is the joint probability of the two words within the corpus, i.e., the set
of all the tokens. It is not the aim of this article to go into detail discussing bigrams, n-
gram algorithms, and the Markovian processes in language modeling; for a detailed
description see Jurafsky (2000).
It is nonetheless important to state that while the aforementioned formula is theoretic-
ally correct, it puts a severe constraint on the formation of a meaningful bigram by not
taking into account, for instance, the so-called “collocations” phenomenon,13 or the pres-
ence of unknown stop-words as well as of rare words14 among the tokens of a possible
candidate couple (wi , wi2 ).
To this end, we relied on the “Gensim” Python library (Rehurek & Sojka, 2010), a widely
used and comprehensive framework for text analysis and bigram detection that supports
phrase modeling, i.e., a more general and accurate approach that predicts words by
learning combinations (co-occurrences) of tokens that together represent meaningful
multi-word concepts.15
More specifically, we formed bigrams using the scoring function (i.e., the function that
scores the relevance of a bigram) implemented in Gensim and provided by Mikolov
et al. (2013), which gives two arguments as parameters: the minimum count for a single
pair of two words and a threshold value. The first prevents the scoring function from
forming inconsistent bigrams from rare words, and the second filters the output by
returning only relevant bigrams.
12 V. ALFANO AND M. GUARINO

In other words, given a pair of two words ðwi , wj Þ within a corpus of N tokens, follow-
ing Mikolov et al. (2013), the formula our phrase models use to establish whether the
two tokens constitute a bigram is:
ðcountðwi , wj Þ  @Þ  N
> threshold (2)
countðwi Þ  countðwj Þ
where @ is the minimum occurrences value before a word can be considered to form a
bigram (the default value in the Gensim library is 5 and the default threshold value is
10). Noticeably, the higher both values are, the more selective the algorithm will be (i.e.,
fewer bigrams found).
While the default values are also a standard in the scientific literature (Mrini et al.,
2017; Plotnikova et al., 2015) but are usually referred to long documents, given the short
mean length of each announcement (Pomikalek & Rehurek, 2007), we chose to set, after
some tests that had proven the robustness of this choice, a very high precautionary
value only for the threshold (100), to limit the bigrams to those that are semantically
meaningful (thus trying to exclude mere “algorithmic” findings).16

The Econometric Model


In order to estimate the impact of the presence of a given keyword on the value of the
houses, we express the houses’ value as a function of several variables. More precisely,
we estimated the value of a house through the following equation:
y ¼ a þ b1 Size þ b2 Char þ b3 Text þ b4 Neigh þ e (3)
where y is our dependent variable indicating the value of the house. As explained above, it
is very hard to explain the impact of each and any variable on the actual price, because of
the different importance of the characteristics in houses of different dimensions. Indeed, the
final price, as we stated, is the result of a very complex interaction of several mechanisms.
For this reason, we chose to consider different transformations of the price in an attempt to
avoid the problem and reduce biases and noise in the estimates. Depending on the case, in
the different regressions y may be operationalized via four different variables, namely:

 LnPrice, which equals the natural logarithm of the price;


 LnPriceSqMt, which equals the natural logarithm of the price per square meter;
 FrPr, which is the price share computed with respect to the most expensive house in
the same neighborhood: it is a strictly positive variable (no house is given away for
free), and assumes the value of 1 for the most expensive house of each given neigh-
borhood. This variable captures the heterogeneity in prices under the assumption
that different neighborhoods are too different to be compared: indeed, in that case,
the relative price is with regard to houses of the same neighborhood. It has been
argued that the location differential in house prices will not only be captured by
accessibility effects (Ahlfeldt, 2011; Shen & Karimi, 2016), but also by local area effect
as defined by the street network (Law, 2017). If this is true, this variable captures the
price differences by taking into account the structure of the city, and thus cleaning
the estimates from the previous (potential) error;
JOURNAL OF HOUSING RESEARCH 13

 FrPrSqMt, which is the same as FrPr, but using prices per square meter instead of
absolute prices.

Continuing the description of the equation, we modeled the price as a function of sev-
eral variables. These are:

 Size is a matrix meant to measure the dimension of the house. It is composed of


three different variables, namely Rooms, the number of rooms in the house (which
we expect to be positively related to the house price); Area, the area of the house
expressed in square meters (which we also expect to have a positive coefficient); and
finally AreaSqr, the square of the previous variable (which we expect instead to have
a coefficient with a negative sign since the price of a very large house is higher in a
less than proportional way compared to smaller ones). The last two variables are
included in the matrix to control for the impact of size on the price, which the scien-
tific literature suggests as being non-linear (i.e., the relation between prices and
houses size is concave) because of so-called subdivision costs (Lin & Evans, 2000).
Please also note that these last two variables are not included in the matrix, while
the dependent variable refers to the price per square meter since there would other-
wise be an endogeneity problem for obvious reasons;
 Char is a matrix meant to discriminate for the characteristics of the house. It is
composed of four variables, namely: Toilets, expressing the number of toilets in
the house; and three dichotomous variables indicating whether the house is on a
specific floor, usually correlated to a different price. These variables are equal to 1
if the house is on the specified floor, or to 0 otherwise, and are: First Floor,
Ground Floor (for which we expect a negative coefficient given that these floors
are less attractive to the potential buyer), and Penthouse (for which we expect
instead a positive coefficient, due to the prestige associated with buying
a penthouse);
 Text is either a matrix of five variables, meant to discriminate for the keywords
chosen in the sale of house announcement or four shares of part of the discourse.
More precisely, the variables belonging to the first matrix are:

– the variable cultural_heritage, for which the hypothesis is an announcement stress-


ing the proximity to an important monument or a site of cultural interest. In this
regard, we considered the co-occurrences of site-specific keywords in each
announcement (see Appendix B). More precisely, the keywords set was con-
structed according to the ranking of the thirty most popular monuments and his-
torical sites in Naples on Trip Advisor on 7 December 2019 (over 599 sites of
historical interest – the ranking is unchanged from October 2019 until now).17
– the variable transport, which is meant to capture the intention of advertising a
house as being near a local public transport facility (such as train stations, high-
ways, underground and bus lines);
– the variable investment, which accounts for a flat on sale as being advertised as
already rented, thus indicating its use as a means to generate income flow (for
this variable we found the relevance of the bigram “investment_use”);
14 V. ALFANO AND M. GUARINO

– the variable touristic, which detects the effect of a house being advertised as useful
for touristic flows. It considers both the house location (houses whose intrinsic values
are notably associated with the possibility of exploiting touristic attendances) and its
current economic destination (e.g. houses already in use as B&Bs and thus income
generators). This is particularly interesting since, as we said, anecdotal evidence sug-
gests that Naples, and especially its historical center, is suffering from gentrification,
driven by Airbnb and other forms of short-time renting aimed at tourists;
– finally, the variable panorama, which reflects the importance for the sellers of eliciting
buyers’ interests, highlighting the special view that can be enjoyed from a house.
The variables belonging to the second matrix, on the other hand, alternatively used as
Text in another set of regressions, are:

– the variable Nouns, measuring the share of nouns in the sale announcement. It is
computed as the total number of nouns divided by the total length in tokens
(pre-processed) of the announcement. While it is interesting to see its effect on
the price, it is hard to predict its sign beforehand.
– the variable Verbs, measuring the share of verbs in the sale announcement. It is
computed as the total number of verbs divided by the total length in tokens (pre-
processed) of the announcement. Once again, while it is interesting to see its
effect on the price, it is hard to predict its sign beforehand.
– the variable Adjectives, measuring the share of adjectives in the sale announce-
ment. It is computed as the total number of adjectives divided by the total length
in tokens (pre-processed) of the announcement. We expect this variable to have a
positive effect on the price, given the pronounced tendency of marketers to
encourage the use of adjectives to increase buyers’ interest.
– the variable Punctuation, measuring the share of punctuation signs in the sale
announcement. It is computed as the total number of punctuation signs divided by
the total length in tokens (pre-processed) of the announcement. This variable may tell
us the importance of correct punctuation in the announcements for the buyers.
Indeed, a quicker style is frequently adopted on the internet, more often than not
loose in terms of grammar, and punctuation is one of the first victims of this
“fast” style.

 Neigh is a matrix of thirteen dichotomous variables, one for each neighborhood


(with the exception of Barra-Ponticelli-Teduccio, excluded in order to avoid falling
into the so-called dummy trap issue), to control for the variance due to the differ-
ent parts of the city. These variables are equal to 1 if the house is in the given
neighborhood, and 0 otherwise.
 e, as usual, is the error term.

While this econometric specification certainly relies on the more consolidated literature
on hedonic pricing (Goodman, 1978), it is important to highlight that our focus is on the
impact of the language used in the sale announcement, rather than on the characteris-
tics of the houses, which we mainly use as a control.
JOURNAL OF HOUSING RESEARCH 15

We expect this model to be able to assess the impact of the presence of the keywords
or the text structure of the announcement, proxied by the variables explained in the
Text matrix, which are alternatively inserted in the regressions since by controlling for
the usual factors that should impact on the price of a house, the coefficient of these var-
iables should express their impact as a percentage of the final price (since the depend-
ent variable is expressed in logarithm form).

Results
In order to assess the impact of the text announcement on the house price, we estimate
our equation via many different regressions. First, we run a baseline model using the OLS
method as estimator, presented in Table 3. We add each and every single matrix to the esti-
mates one at a time, to control whether including certain variables in the model impacts
the sign and statistical significance of the results, thus suggesting that there is a problem in
the specification. The only matrix that is an exception to this is the Neigh matrix, which was
always included, in order to discriminate between different neighborhoods that have distinct
baseline prices. For this reason, apart from the Neigh matrix, the first specification for each
model only includes the Size matrix, while the second adds the matrix Char, the third Text,
and the fourth and last has all three matrices at once. The matrix Text is included in both its
specifications, namely with the dummy signaling the presence of given keywords (3.3 and
3.4), or the share of a given part of the discourse (3.5 and 3.6).
As can be seen in Table 3, all the variables have the expected signs, and in several specifi-
cations, all of them also have a notable statistical significance. To this respect, please notice
that the regressions presented have robust standard errors18. Furthermore, the R squared
has a very high value in the most complete specifications (3.4 and 3.6), suggesting that the
model describes the phenomenon well. Notably, in all the specifications the price increases
with the increase of the size of the house, the number of rooms and toilets; on the other
hand, it decreases with the square of the area (suggesting that bigger houses have prices
less than proportionally higher with respect to smaller houses), and there is a penalty in the
price for being on the first or ground floor, with the latter penalty higher than the former.
On the other hand, if the house is on the top floor of a building, there is no increase in
price. This result is surprising but may find an explanation in the fact that the vast majority
of houses in the city center and historical center of Napoli, which is a very old city, do not
have an elevator.19 Thus, the value of being on the top floor and being able to enjoy a
beautiful view and peaceful setting is possibly a benefit that is canceled out in the price by
the need to climb many flights of stairs.
Coming to the first matrix of text analysis variables (specifications 3.3 and 3.4), which refer
to the sales announcement to a famous monument or spot of cultural interest, or a pano-
ramic view, these are related to an increase in price. On the other hand, language referring
to the house as an investment opportunity is related to a decrease in the price. This is prob-
ably because a reference to an investment opportunity is very often a way of expressing
that the flat is currently rented, and Italian law prohibits the landlord from ending the lease
contract before its expiration, even in the event of a sale, so the value of the house is thus
reduced since it is not immediately available for the buyer to use for her or his own pur-
poses. Furthermore, the investment dummy-variable is mainly conceived to detect all those
16

Table 3. OLS regression - log price over determinants.


(3.1) (3.2) (3.3) (3.4) (3.5) (3.6)
LogPrice LogPrice LogPrice LogPrice LogPrice LogPrice
Rooms 0.172 0.135 0.157 0.125 0.171 0.135
(12.18) (10.70) (11.95) (10.45) (12.39) (10.86)
Area 0.0110 0.00988 0.0103 0.00931 0.0110 0.00987
(11.71) (11.39) (11.81) (11.38) (11.99) (11.70)
AreaSqr 0.0000119 0.0000106 0.0000108 0.00000974 0.0000119 0.0000106
(-4.87) (-4.79) (-4.83) (-4.72) (-5.00) (-4.94)
Toilets 0.0993 0.0978 0.0976
(7.71) (7.80) (7.60)
First Floor 0.0795 0.0638 0.0796
(-7.12) (-5.81) (-7.13)
V. ALFANO AND M. GUARINO

Ground Floor 0.357 0.316 0.354


(-15.68) (-14.27) (-15.70)
Penthouse 0.0637 0.0838 0.0582
(-1.29) (-1.69) (-1.18)
Cultural Heritage 0.215 0.216
(3.52) (3.61)
Transport 0.0159 0.00853
(1.35) (0.75)
Investment 0.256 0.227
(-13.37) (-12.39)
Touristic 0.00261 0.00970
(-0.09) (0.36)
Panoramic 0.144 0.120
(9.56) (7.88)
Share of Nouns 0.630 0.645
(3.45) (3.74)
Share of Verbs 0.165 0.154
(0.60) (0.59)
Share of Adjectives 0.825 0.696
(5.08) (4.48)
Share of punctuation 0.114 0.0952
(-0.82) (-0.72)
Dummy Neighborhood YES YES YES YES YES YES
Constant 10.20 10.34 10.31 10.42 9.936 10.08
(272.93) (283.83) (276.98) (291.15) (124.35) (132.41)
Observations 5040 5040 5040 5040 5040 5040
R2 0.797 0.813 0.810 0.823 0.799 0.815
t statistics in parentheses  p < 0.1,  p < 0.05,  p < 0.01.
JOURNAL OF HOUSING RESEARCH 17

announcements related to the selling of already-rented flats that can generate income with-
out any extra cost in the medium-long term (i.e., the cost of finding a tenant): as a conse-
quence, the selling price is also affected by a discounting effect. Finally, the prices of houses
marketed as touristic in their announcements are not statistically significantly different from
others, such as houses advertised as being close to transport facilities. While this result may
be surprising, the former finding suggests that the gentrification which seems to affect
Napoli according to anecdotal evidence, is possibly not so important in determining house
prices: houses sold as touristic or as appropriate for establishing as tourism-related busi-
nesses (such as a bed and breakfast or a flat to be rented on a short-term basis to tourists)
are not sold on the market for higher prices20. On the other hand, for the latter result, the
statistical non-significance on the price of transport facilities advertised in the house
announcement is possibly due to the many supposed transport opportunities (such as metro
stations) that are constantly being announced (and in some cases also inaugurated by politi-
cians several times) without ever seeing a start to their service. The building of Naples’ new
metro line began in 1976, and the total length is now 18 kilometers,21 which suggests an
average increase of 418 meters per year, hardly an impressive figure. This may mean that
people do not really give an added value to the fact that a sale announcement advertises a
property as being close to a metro station since many people do not believe it is actually
going to open as soon as claimed by the public authorities, due to this long history of failed
expectations.
On the other hand, the second matrix of text analysis variables (specifications 3.5 and
3.6) shows how a greater share of nouns and adjectives is correlated to higher prices. It
suggests that houses with sale announcements that have a greater share of nouns and
adjectives are sold at higher prices than houses advertised with fewer nouns and adjec-
tives. The coefficients suggest that the frequency of adjectives especially impacts on the
price. On the other hand, in this regression, neither the relative share of verbs nor that
of punctuation has an impact on the price. While a greater usage of nouns can be at
least partly explained by the presence of more flats’ facilities/features, adjectives serve
the main functional task to qualify these characteristics in order to keep the flats’ price
to a higher level. With this respect, we observe flats with “very bright kitchen” or bed-
room, “wide living room,” long balconies” or flats with an “elegant parquet” sold in “just-
renoveled buildings” and so on … the adjective is hence used to mainly qualify some key
aspects of a given flat in order to act as a stimulus for the buyers.
Anyway, sellers can also lie … while this still remains a veiled phenomenon in our ana-
lysis (our dataset doesn’t allow us to show a different agent behaviour) we strongly
believe that an agent doesn’t have any interest in setting up a false announcement, and
this mainly for the following reason: a lie in the announcement can be unmasked by the
subsequent physical tour of the flat.
From this point of view, while the client loses time to visit other properties, the seller
faces the high risk to lose reliability. Putting it differently, a seller can find himself in hot
water if the client (potential buyer) decides for the signalling of the seller behaviour to the
platform’s management (through a specific alert function). One of the options a potential
buyer can choose to report a scam is in fact “the announcement is a potential fraud.”
As a consequence, many announcements describe a flat referring the presence of certain
characteristics without stressing on its actual status if it is not positive at all (that is to say a
18 V. ALFANO AND M. GUARINO

flat with very bad kitchen left by the owner, is simply reported as “a flat with kitchen,” with-
out using compromising qualifications given by the class of boosting adjectives).
We replicated the analysis by changing our dependent variable in the log of the price
per square meter of the house. Indeed, as the effect of the area of the house is non-
linear, which is already suggested by the regressions in Table 3, the price per square
meter may better be able to indicate the value of the house. As can be seen in Table 4,
all the coefficients have the same sign and pass the usual statistical significance thresh-
old, as the previous battery of regressions.
As a further analysis, we also changed our dependent variable in the share of the max-
imum price of the neighborhood the given house cost, and in the share of the maximum
price per square meter of the neighborhood the given house cost. These are two variables
with an upper bound in 1, which is the value it assumes in case the price (either total or
per square meter) is the most expensive of the neighborhood, and a (theoretical) lower
bound of 0 if the house has no value. Being bounded between these two values we
decided to employ a different estimator, namely a Fractional Probit estimator, which is espe-
cially useful for modeling the impact of variables that express shares between 0 and 1.
In this set of regressions, of which the marginal effects are presented in Table 5, the
signs of the coefficients and the statistical significance are also the same, as commented
previously, suggesting that the estimated effects are stable and the equation is solid in
considering the sources of variance of the price.

Conclusions
Real estate markets are of great importance in strengthening economies both at a national
and a local level. From this point of view, Italy is a pivotal case study because of its relatively
high rate of house owners, compared to the vast majority of European and also American
countries. It is also important since investment in houses is an important financial tool in
Italian culture, and many citizens still consider the property a safe form of asset.
Furthermore, while many studies of house pricing modeling have a theoretical founda-
tion in the neoclassical framework, increasing criticism of this approach has come from
scholars who find the assumption that people act as rational agents inappropriate and
unlikely. Moreover, aside from the debate regarding the different theoretical assump-
tions, some researchers argue about the difficulty in measuring a housing market, due to
the incredible heterogeneity of attributes that define practically every house (Cho, 1996).
Both the aforementioned considerations led us to adopt a different approach that
does not consider the agent to be perfectly rational, instead of basing our empirics on
the use of text analysis in sales announcements of a local estate market. This is done to
obtain relevant insights from agent behavior, and understand what influences the price
of a house in a sale announcement.
More precisely, we were able to investigate the role played by textual descriptions of
houses through the identification of certain meaningful keywords, which seems useful in
predicting local estate market prices. While the importance of text data analysis has
been growing in economic and econometric literature (Fagan & Genc, 2011), we still
noticed few contributions with regard to the effect of textual descriptions of houses in
real estate markets (among them, the most prominent are Nowak & Smith, 2017;
Table 4. OLS regression – log of price per square meter over determinants.
(4.1) (4.2) (4.3) (4.4) (4.5) (4.6)
LogPrice Sq.Mt LogPrice Sq.Mt LogPrice Sq.Mt LogPrice Sq.Mt LogPrice Sq.Mt LogPrice Sq.Mt
Rooms 0.0922 0.0351 0.0685 0.0202 0.0908 0.0349
(17.38) (5.33) (11.83) (2.98) (16.93) (5.25)
Toilets 0.107 0.0994 0.104
(9.63) (9.10) (9.41)
First Floor 0.0731 0.0587 0.0729
(-6.52) (-5.24) (-6.51)
Ground Floor 0.251 0.219 0.248
(-11.33) (-9.89) (-11.30)
Penthouse 0.0569 0.0772 0.0524
(-1.04) (-1.41) (-0.96)
Cultural Heritage 0.175 0.169
(3.11) (3.04)
Transport 0.00503 0.0000649
(0.42) (0.01)
Investment 0.175 0.152
(-9.00) (-7.92)
Touristic 0.00214 0.00845
(0.08) (0.32)
Panoramic 0.148 0.123
(10.16) (8.29)
Share of Nouns 0.471 0.512
(2.72) (3.05)
Share of Verbs 0.203 0.247
(0.78) (0.98)
Share of Adjectives 0.853 0.730
(5.39) (4.73)
Share of punctuation 0.172 0.121
(-1.29) (-0.94)
Dummy Neighborhood YES YES YES YES YES YES
Constant 6.938 7.031 7.016 7.085 6.717 6.802
(186.48) (194.29) (181.83) (190.29) (87.48) (91.43)
Observations 5040 5040 5040 5040 5040 5040
R2 0.516 0.541 0.535 0.554 0.520 0.544
t statistics in parentheses  p < 0.1,  p < 0.05,  p < 0.01.
JOURNAL OF HOUSING RESEARCH
19
20 V. ALFANO AND M. GUARINO

Table 5. Fractional probit regression — marginal effects — share of maximum price per square
meter over determinants.
(5.1) (5.2) (5.3) (5.4)
Sh.MaxPrice Sh.MaxPrice Sh.PrSq.Mt Sh.PrSq.Mt
Rooms 0.0177 0.0193 0.00465 0.00851
(7.32) (7.97) (2.37) (4.36)
Area 0.00162 0.00171
(10.28) (11.33)
AreaSqr 0.00000127 0.00000142
(-3.38) (-3.97)
Toilets 0.0261 0.0264 0.0326 0.0340
(9.17) (9.07) (9.90) (10.23)
First Floor 0.0142 0.0175 0.0213 0.0260
(-6.02) (-7.32) (-5.89) (-7.17)
Ground Floor 0.0402 0.0462 0.0551 0.0632
(-8.44) (-9.71) (-8.51) (-9.94)
Penthouse 0.0116 0.00621 0.00838 0.00114
(-1.69) (-0.89) (-0.65) (0.09)
Cultural Heritage 0.0589 0.0568
(3.68) (3.22)
Transport 0.00138 0.00231
(0.56) (-0.61)
Investment 0.0390 0.0388
(-10.79) (-7.52)
Touristic 0.000669 0.00444
(-0.12) (0.63)
Panoramic 0.0270 0.0355
(7.50) (7.40)
Share of Nouns 0.130 0.179
(3.64) (3.38)
Share of Verbs 0.104 0.142
(1.88) (1.85)
Share of Adjectives 0.203 0.251
(5.91) (5.21)
Share of punctuation 0.0126 0.0805
(-0.45) (-1.99)
Neighborhood YES YES YES YES
Observations 5040 5040 5040 5040
Log likelihood 2342.4 2345.5 2615.7 2618.5
Chi 2 12399.9 11608.7 4279.2 4353.6
t statistics in parentheses  p < 0.1,  p < 0.05,  p < 0.01.

Goodwin et al., 2018)22 and, in particular, none of them is related to a hedonic price
model integrated with part-of-speech variables or word-variables.
To summarize our overall result, we show that some keywords (namely those related
to a panoramic view and cultural goods), while we cannot say whether they match the
effective existence of the corresponding physical features, strongly correlate with higher
prices. According to our interpretation and within the framework of behavioral econom-
ics, this could reveal the presence of an economic incentive for those sellers who want
to keep prices to higher levels, attracting buyers’ interest on the basis of greatly desir-
able house features.
Furthermore, the use of nouns and adjectives in sale announcements is linked with
higher prices. This result suggests to people interested in selling a house that crafting a
text filled with these two parts of speech may help to sell the house at a higher price.
This finding confirms suggestions derived from several websites offering marketing
advice, which suggests the use of many adjectives in the sale announcement.
JOURNAL OF HOUSING RESEARCH 21

In conclusion, this paper raises further awareness of the relevance of text analysis for
the real estate market. It can help researchers interested in the modeling of house prices
to detect hidden house features (which is still a critical issue in house pricing) and draw
inference on how and to what extent language use can affect house prices.
In the future, further contributions may try to replicate this analysis on other datasets
of the same or a different city, to extend and generalize its results even more, or also
take into account other stylometric features of a house description, and testing their
relevance for price prediction.

Acknowledgments
The authors wish to thank Dr. Lorenzo Cicatiello for several useful comments on a preliminary ver-
sion of the article.

Notes
1. https://www.realestateexpress.com/career-hub/grow-your-real-estate-career/real-estate-listing-
descriptions/ (accessed 18 March 2020).
2. https://www.rubyhome.com/blog/how-to-write-effective-real-estate-ads/ (accessed 18 March 2020).
3. The last two can be found at: https://retipster.com/howtowriterealestateads/ (accessed 18
March 2020).
4. http://www.italy24.ilsole24ore.com/art/real-estate/2017-01-09/istat-80percento-of-italians-live-
owned-property-130059.php?uuid=ADpZcFTC (accessed 26 December 2019).
5. https://ec.europa.eu/eurostat/web/products-eurostat-news/-/DDN-20171102-1
6. The research can be found at: https://www.nahbclassic.org/generic.aspx?genericContentID=
57411&fromGSA=1 (accessed 19/08/2021).
7. ISTAT, clients’ movement in hotels, annual averages.
8. See https://espresso.repubblica.it/attualita/2019/01/14/news/airbnb-napoli-1.330438 or also
https://corrieredelmezzogiorno.corriere.it/napoli/arte_e_cultura/19_dicembre_16/napoli-record-
centro-storico-unesco-rischio-gentrificazione-307a7b02-2035-11ea-b618-2a8c8b16f4a2.shtml
9. The most popular Italian website in the field of real estate was Immobiliare.it, as of November
2019 (https://www.similarweb.com/top-websites/italy/category/business-and-consumer-services/
real-estate (accessed 20 December 2019).
10. Unfortunately, we do not have any ID related to the single agent but only to
announcements. Hence, it could be possible that we observe different announcements (with
different IDs) from the same agent.
11. The NLTK (Natural Language Tool Kit) Python platform for computational linguistics is used to
remove Italian stop words (Bird et al., 2009). A full list of the NLTK Italian stop words is not
included for reasons of space, but available upon request [some examples of Italian stop
words are: avere (verb: to have - in all its verb forms), essere (verb: to be), anche (conjunction
or adverb: also or even), vostro (possessive adjective: your), con (preposition: with)].
12. SpaCy allows the part of speech tags classifications on the basis of a pretrained convolutional
neural networks (CNN) on Universal Dependencies and WikiNER corpus. For a more detailed
description, see: https://spacy.io/models/it. Finally, for a complete list of the available POS
tags, see: https://spacy.io/api/annotation.
13. A collocation is a linguistic expression made of two or more words corresponding to some
conventional way of saying things (Manning & Sch€ utze, 1999). Empirically, a collocation can
be seen as recurrent and predictable word combinations or, in more formal terms, the
quantification of the “mutual expectancy” (Firth, 1957) between words and “the statistical
influence a word exerts on its neighbourhood” (Evert, 2008). Examples of collocation are: “Do
your best,” “Do the washing up,” “maiden voyage,” “regular exercise,” etc. …
22 V. ALFANO AND M. GUARINO

14. As shown below, we cannot simply drop rare words from the dataset given that we used
them in some of our queries so as to build the aforementioned 5 key dummy variables.
15. For more details on the algorithm used, please see https://radimrehurek.com/gensim/models/
phrases.html.
16. Our script for bigrams as well as for parameters values is based largely on the following
tutorial: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/.
17. https://www.tripadvisor.it/Attractions-g187785-Activitiesc47Naples_Province_of_Naples_
Campania.html
18. As a robustness check, we replicated the analysis using clustered standard errors at
neighborhood level, obtaining equivalent results.
19. Unfortunately, it is not possible to know whether the house has an elevator or not from our
data, since it is not explicitly highlighted.
20. Although quite different from each other, the variables touristic and investment could capture
at least some of the same effects. To avoid such a bias, we computed a Pearson correlation
coefficient which is equal to 0.224 thus showing a low degree of overlap.
21. https://www.metropolitanadinapoli.it/timeline/ (accessed 3 January 2020).
22. Many contributions to study of the house market within the context of text analysis are
focused on the recognition and evaluation of the demand sentiment (Zhu et., al 2018;
Ruscheinsky et al., 2018; Tsolacos, 2012).
23. We used the proposition “and not” in order to distinguish “Chiesa di San Lorenzo” from
“Quartiere San Lorenzo,” which is a neighborhood
24. As in the previous case, we used the proposition “and not” because we are only interested in
those announcements mentioning Toledo underground station as the art-station, given its
peculiar architecture.

ORCID
Vincenzo Alfano http://orcid.org/0000-0002-4981-748X

References
Abate, G., & Losa, G. (2016). Real estate in Italy: Markets, investment vehicles and performance.
Routledge.
Abraham, J. M., & Schauman, W. S. (1991). New evidence on home prices from Freddie mac repeat
sales. Real Estate Economics, 19(3), 333–352.
Ahlfeldt, G. (2011). If Alonso was right: Modeling accessibility and explaining the residential land
gradient. Journal of Regional Science, 51(2), 318–338.
Ahmed, E., & Moustafa, M. (2016). House price estimation from visual and textual features. In Merelo,
J. J. Melıcio, F. Cadenas, J. M. Dourado, A. Madani, K. Ruano, A. and Filipe, J. (Eds.), Proceedings of the
8th International joint conference on computational intelligence (pp. 62–68). Scitepress.
Akerlof, G., & Shiller, R. (2009). Animal spirits: How human psychology drives the economy and why it
matters for global capitalism (2nd ed.). Princeton University Press.
Ambrose, B.W., & Shen, L. (2021). Past experiences and investment decisions: Evidence from real estate
markets. Journal of Real Estate Finance and Economics. https://doi.org/10.1007/s11146-021-09844-2
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: Analyzing text with the
natural language toolkit. O’Reilly Media, Inc.
Case, B., Pollakowski, H. O., & Wachter, S. M. (1991). On choosing among house price index meth-
odologies. Real Estate Economics, 19(3), 286–307.
Cherif, E. (2013). Analysis of the Internet Impact on the Real Estate Industry. International Journal
of Service Science. Management, Engineering, and Technology (IJSSMET), 4(3), 51–67.
Cheshire, P., & Sheppard, S. (1998). Estimating the demand for housing, land, and neighbourhood
characteristics. Oxford Bulletin of Economics and Statistics, 60, 357–382.
JOURNAL OF HOUSING RESEARCH 23

Cho, M. (1996). House price dynamics: A survey of theoretical and empirical issues. Journal of
Housing Research, 7(2), 145–172.
Conlisk, J. (1996). Why bounded rationality? Journal of Economic Literature, 34(2), 669–700.
Daniel, K., Hirschleifer, D., & Subrahmanyam, A. (1998). Investor psychology and security market
under and over reactions. The Journal of Finance, 53(6), 1839–1885.
Daoud M., Boitet C., Kageura K., Kitamoto A., Mangeot M., & Daoud D. (2010). Building specialized
multilingual lexical graphs using community resources. In Z. Lacroix (Ed.), Resource discovery.
RED 2009. lecture notes in computer science (vol. 6162). Heidelberg, Berlin: Springer.
Daud, A., Khan, J., Nasir, J., Abbasi, R., Aljohani, N., & Alowibdi, J. (2018). Latent dirichlet allocation
and POS tags based method for external plagiarism detection: LDA and POS tags based plagiar-
ism detection. International Journal on Semantic Web and Information Systems, 14, 53–69. 10.
4018/IJSWIS.2018070103.
Earls, P. (2005). Economics and psychology in the twenty first century. Cambridge Journal of
Economics, 29(6), 909–926.
Evert, S. (2008). Corpora and collocations. In A. Ludeling & M. Kyt (Eds.), Corpus linguistics. An inter-
national handbook (Vol. 2, pp. 1212–1248). Berlin: Mouton de Gruyter.
Fagan, S., & Genc, R. (2011). An introduction to textual econometrics. In A. Ullah & D. Giles (Ed.),
Handbook of empirical economics and finance (pp. 136–158). Chapman and Hall/CRC.
Firth, J. R. (1957). A synopsis of linguistic theory 1930–55. In F.R. Palmer (Ed.), Studies in linguistic
analysis (Vol. 24, pp. 1–32).The Philological Society. Reprinted in Palmer (1968), pp. 168– 205.
Gentzkow, M., & Shapiro, J. M. (2010). What drives media slant? Evidence from US daily newspa-
pers. Econometrica, 78(1), 35–71.
Gentzkow, M., Kelly, B., & Taddy, M. (2019). Text as data. Journal of Economic Literature, 57(3),
535–574. vol
Goodman, A. C. (1978). Hedonic price, price indices and housing markets. Journal of Urban
Economics, 5, 471–484.
Goodwin, K. R., Waller, B. D., & Weeks, H. S. (2014). The impact of broker vernacular in residential
real estate. Journal of Housing Research, 23(2), 143–161.
Goodwin, K., R., Waller, B. D., & Weeks, H. S. (2018). Connotation and textual analysis in real estate
listings. Journal of Housing Research, 27(2), 93–106.
Jurafsky, D. (2000). Speech & language processing. Pearson Education India.
Law, S. (2017). Defining street-based local area and measuring its effect on house price using a
hedonic price approach: The case study of Metropolitan London. Cities, 60, 166–179.
Levitt, S. D., & Syverson, C. (2008). Market distortions when agents are better informed: The value
of information in real estate transactions. Review of Economics and Statistics, 90(4), 599–611.
Li, Y., & Leatham, D. J. (2011, July 24–26). Forecasting housing prices: Dynamic factor model versus
LBVAR model. Selected Paper prepared for presentation at the Agricultural & Applied Economics
Association’s 2011 AAEA & NAREA Joint Annual Meeting, Pennsylvania.
Lin, T., & Evans, A. (2000). The relationship between the price of land and size of plot when plots
are small. Land Economics, 76(3), 386–394.
Liu, C. H., Nowak, A. D., & Smith, P. S. (2020). Asymmetric or incomplete information about asset
values? The Review of Financial Studies, 33(7), 2898–2936.
Manning, C. D., & Sch€ utze, H. (1999). Foundations of statistical natural language processing. MIT
press.
Marsh, A., & Gibb, K. (2011). Uncertainty, expectations and behavioural aspects of housing market
choices. Housing, Theory and Society, 28(3), 215–235.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of
words and phrases and their compositionality. In Advances in neural information processing sys-
tems (pp. 3111–3119).
Mrini, K., Pappas, N., & Popescu-Belis, A. (2017). Cross-lingual transfer for news article labeling:
Benchmarking statistical and neural models. Idiap Research Report.
Nowak, A., & Smith, P. (2017). Textual analysis in real estate. Journal of Applied Econometrics, 32(4),
896–918.
24 V. ALFANO AND M. GUARINO

Palmquist, R. B. (1984). Estimating the demand for the characteristics of housing. The Review of
Economics and Statistics, 66(3), 394–404.
Peek, J., & Wilcox, J. A. (1991). The measurement and determinants of single-family house prices.
Real Estate Economics, 19(3), 353–382.
Plotnikova, N., Kohl, M., Volkert, K., Evert, S., Lerner, A., Dykes, N., & Ermer, H. (2015). KLUEless:
Polarity classification and association. In Proceedings of the 9th International Workshop on
Semantic Evaluation (SemEval 2015), pp. 619–625.
Pomikalek, J., & Rehurek, R. (2007). The influence of preprocessing parameters on text categoriza-
tion. International Journal of Applied Science, Engineering and Technology, 1, 430–444.
Pryce, G., & Oates, S. (2008). Rhetoric in the language of real estate marketing. Housing Studies,
23(2), 319–348.
Rehurek, R., Sojka, P. (2010). Software framework for topic modelling with large corpora. In
Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks.
Rosen, S. (1974). Hedonic prices and implicit markets: Product differentiation in pure competition.
Journal of Political Economy, 82(1), 34–55.
Ruscheinsky, J., Lang, M., & Sch€afers, W. (2018). Real estate media sentiment through textual ana-
lysis. Journal of Property Investment & Finance, 36 (5), 410–428.
Shen, L., & Ross, S. L. (2019). Housing prices and property descriptions: Using soft information to value
real assets. In Working Papers 2019-20. University of Connecticut, Department of Economics.
Shen, Y., & Karimi, K. (2016). Urban function connectivity: Characterisation of functional urban
streets with social media check-in data. Cities, 55, 9–21.
Shiller, R. J. (2005). Irrational exuberance (2nd ed.). Princeton University Press.
Silva, C., & Ribeiro, B. (2003). The importance of stop word removal on recall values in text categor-
ization. In Proceedings of the International Joint Conference on Neural Networks, 2003. (Vol. 3, pp.
1661–1666). IEEE.
Smith, S. (2011). Housing economics: The heterodox experiment. Housing, Theory and Society, 28(3),
300–304.
Stiglitz, J. (2010). Freefall: Free markets and the sinking of the global economy (2nd ed.). Penguin
Books.
Tsolacos, S. (2012). The role of sentiment indicators for real estate market forecasting. Journal of
European Real Estate Research, 5(2), 109–120.
Whittle, R., Davies, T., & Gobey, M. (2014). Behavioural economics and house prices: A literature
review. Business and Management Horizons, 2(2), 15–28.
Zhu, E., Jing, W., Hongyu, L., & Keyang, L. (2018). A sentiment index of the housing market: Text
mining of narratives on social media.

Appendix A
Key-words for dummy variables construction
panorama¼ [’panorama’,’panoramico’,’panoramica’,’vista_mare’,’vista_golfo’]
Transport ¼ [’mezzi_pubblici’,’mezzi_trasporto’,’metropolitana’,’metro’,’trasporti_pubblici’,
’funicolare’,’linea’,’linee’,’cicumvesuviana’,’cumana’,’circumflegrea’,’stazione’,’fermata’,
’tangenziale’,’autostrada’,’autobus’]
Tourism ¼ [’turista’,’turistico’,’turistici’,’turismo’,’turistica’,’turistiche’,’vacanza’,’vacanze’,’breakfast’,
’casa_vacanze’,’b&b’,’turisti’, ’struttura_ricettiva’,’ricettiva’]
Investment ¼ [’investimento’,’uso_investimento’,’rendita’]

Appendix B
Key-words for dummy variable “Cultural Heritage” related to the 30 most voted TripAdvisor sites of
interest in Naples (tokens linked by an underscore are bigrams)
TripAdv
ranking name word 1 word 2 word 3 word 4 word5 word6
1 Cappella sansevero (cappella and sansevero) or san_severo
2 Oltre i resti (oltre and resti) or oltre_resti
3 Miglio sacro miglio and miglio_sacro
4 Chiesa Museo di Santa chiesa and luciella and librai
luciella ai Librai
5 Catacombe San Gaudioso catacombe and gaudioso
6 Galleria Borbonica (galleria or tunnel) and (borbonica or Borbonico) or galleria_borbonica or tunnel_
borbonico
7 Catacombe San Gennaro catacombe and gennaro
8 Chiesa dei Santi Filippo e chiesa and Filippo and giacomo
Giacomo - Complesso
Museale dell’Arte della Seta
9 Acquaquiglia del pozzaro acquaquiglia or pozzaro
10 LAES - La Napoli sotteranea (napoli and sotteranea) or napoli_sotterranea
11 Teatro di San Carlo (teatro and san and carlo) or san_carlo
12 Museo delle Arti Sanitarie e farmacia and (storica or incurabili)
Farmacia Storica degli Incurabili
13 Chiesa del Gesu Nuovo chiesa and Gesu and nuovo
14 Chiesa di San Giovanni a Carbonara chiesa and Giovanni and carbonara
15 Parco archeogico Pausillypon Parco and Archeologico and pausylipon
16 Via Caracciolo e Lungomare di Napoli lungomare and caracciolo
17 Grotta di Seiano Grotta and Seiano
18 Sant’Anna dei Lombardi (Monteoliveto) (sant’ and anna and lombardi) or anna_lombardi or monteoliveto
19 Complesso Monumentale Donnaregina - (chiesa and donnaregina) or (museo and diocesano)
Museo Diocesano Napoli
20 Complesos Monumentale di Santa Chiara (santa and chiara) or santa_chiara
21 cimitero delle fontanelle cimitero and fontanelle
22 La Neapolis Sotterrata - complesso (chiesa and san and lorenzo and maggiore) or (lorenzo_maggiore and not24 quartiere)
monumentale di San Lorenzo Maggiore
23 Pio monte della misericordia pio_monte and misericordia
24 Chiesa e chiostro di San Gregorio Armeno chiesa and gregorio_armeno
25 Stazione Toledo (stazione toledo and (arte or stazione_arte) and not (metropolitana or metro)25
metropolitana dell’arte)
26 Napoli Sotterranea (napoli and sotteranea) or napoli_sotterranea
27 spaccanapoli spaccanapoli
28 Via San Gregorio Armento via and gregorio_armeno
JOURNAL OF HOUSING RESEARCH

29 Duomo di Napoli (chiesa and duomo and napoli) and not (metropolitana or metro)
30 Complesso museale Santa Maria delle chiesa and anime and purgatorio and arco
Anime del Purgatorio ad Arco
25

You might also like